Crews are currently working to clean up a train derailment that took place on March 7 across from the Golden Municipal Campground.
At approximately 4:25 p.m. MT yesterday a westbound freight train derailed 8 cars containing grain near Golden.
What started as a short news item in the Golden Star two weeks ago ended up being a fairly big headache for many network operators, including those who operate the Canadian Research and Education (R&E) network – and its member connections – from coast to coast.
Unbeknownst to most residents of Golden, BC, when the train derailment occurred, it took with it the main fibre route between Alberta and BC. By extension, this meant it also took out the CANARIE primary connection between Vancouver and Calgary.
CANARIE's disaster recovery plan was put into action, notices were sent out to all users, and connections were adjusted to reroute around the problem to other peering points, such as the Toronto Internet Exchange (TORIX) and the New-York Internet Exchange (NYIX). There was enough capacity in Eastern Canada to reduce the discomfort experienced by end users to a very minimal level.
This should have resolved the crisis and made the R&E connection – while not as good as an intact primary connection to Vancouver and Seattle – good enough for users to attend to their daily online chores.
Unfortunately, CANARIE and the provincial network agencies soon realized that something, specifically with Google, was not quite right. All other content traffic was being served out of Toronto or New-York, except for Google and Facebook, which had no way back to Seattle from TORIX. As it turns out, this was because Google does not rely on normal routing decisions to serve its content. It uses a mix of metrics, including geographical DNS data to assign a location where the content should be served. In the case of Cybera's member institutions in Alberta, the Google content has to be served from either the local Google Global Cache in Calgary (for cacheable content), or from Seattle. The only route to Seattle that worked on Monday, March 7, was the secondary connection between Winnipeg-Chicago-Seattle, which only has limited capacity, and was already being used for all other traffic to BCNET and WestGrid.
What caused even more problems was the fact that all other Google Global Cache devices hosted by CANARIE are normally 'fed' from Seattle, because of its direct connection with Google. When all this traffic attempted to go back to Seattle, the Google caches in other provinces started to give up, as they could not update their content due to the congestion. Eventually, Google and CANARIE decided it simply made more sense to have Google's return traffic use commercial transit.
Once the fibre was repaired (at 00:10:10 MDT on March 7), everything returned to normal.
CANARIE is currently working on a path from Edmonton to Vancouver, to provide redundancy. Any fibre path through the Rocky Mountains is prone to experience occasional breaks because of avalanches, car accidents, critters munching on the cable, and even the rare train derailment'¦ The redundant path should be in production this summer.
Despite all the machinery, routers and capability to reroute things automatically that one would expect from a modern network, the most important factor is still the people managing it, and their ability to quickly intervene and use good judgement. Having skilled network operators is the most important component of running a large, complex national or provincial network.
Fortunately, in this instance, things were resolved in a fairly efficient manner, with some service degradation but no hard failures.