Lessons learned from our recent outage

On Saturday, February 26, one of our network team’s worst nightmares occurred. A major incident took place during planned maintenance at one of our data centres. Thankfully, this occured on a weekend, our major services were restored the same day, and access was restored to all Cybera services by Monday, February 28.

As a team, we have used such scenarios to brainstorm what gaps exist in our response plans, however, there’s nothing like the real thing to find out where the gaps really exist. We’d love to share with you what happened, and what we learned.

What Happened?

Mid-day on the Saturday, staff at the data centre where we house our Calgary network router (as well as the Calgary region of the Rapid Access Cloud) prudently notified us that they needed to shut off all power due to a sudden emergency. We cannot understate how helpful the data centre’s staff was in communicating with us during a situation no one wants to be in — when well laid out plans and multiple contingencies fail.

This event led to an outage on Cybera’s network, the Rapid Access Cloud, our Virtual Firewall Service in Calgary, and the Callysto Hub.

The impact on Cybera’s members was quite extensive, reinforcing the importance of this data centre to our southern Alberta operations.

Specifically, the following groups were disrupted by the data centre shut-down: 

  • Members who connect to Cybera’s network through Calgary, including:
    • Lethbridge members
    • Red Deer members
    • Directly connected Calgary members
  • Anyone trying to access the Rapid Access Cloud and Callysto Hub
  • And Calgary-based Virtual Firewall Service users

Members who connect via the SuperNet would have been redirected to Edmonton by way of our BGP failover.

At 12:55 pm, we were notified that power would need to be cut at the data centre until necessary maintenance and clean up were completed. By 1:39 pm, all of Cybera’s infrastructure had been proactively shut down, and the centre’s power was cut shortly thereafter. 

The data centre was able to restore power five hours later (by 6:43 pm), which then started our process of turning our items back on. This was a meticulous process of confirming servers and services started correctly, and taking a how to start from scratch process from hypothetical to reality.

The vast majority of our services were restored within roughly three hours (by 10:15 pm), with some less urgent services restored early on Monday, February 28.

A post-mortem was done later that week to explore what we did well, and what we did not so well. We believe heavily in blameless post-mortems — aiming to understand why certain decisions were made at the time, without judgement. Some of our learnings felt painfully obvious (hindsight bias alert). Other parts, where we recognized our team reacted well, still gave us insight about major gaps in our documentation.

So we learned things…

An emergency (or ideally, a game day) is a fantastic opportunity to identify some critical gaps one would not normally encounter in their day-to-day work. We’re very proud to see how well and quickly our staff responded to the disruption — between shutting down hardware cleanly to prevent unexpected hardware failure, and sending out notifications to members via our Revere system.

We did, however, discover critical gaps in our response.

Communication is always important — both internal and external

While we had posted notices to the outages page on our website and on Twitter, some of our core groups (eg. Callysto users) should have received direct notifications from us, but didn’t. We are very apologetic about the delays in notification and restoration of service for these Callysto users. This is a fantastic example of something that feels painfully obvious in retrospect.

We are adjusting our standard operating procedure for outages to make sure our contact lists are up to date, and that all impacted users and staff (not just our network folks) are made aware of outages like this. Making something explicit in our notification chain will make it that much easier to avoid inadvertently forgetting people, or leaving groups out of the communication loop for too long.

Our internal communication, thankfully, was an all around success — a constant clear line of communication between responders allowed others to follow along. This saved considerable amounts of time when it came to coordination of work and notification updates. We split communication between several Slack channels focused on each individual project, which also saved a need for a phone bridge or video call. It also allowed easy summaries to be shared between channels, so everyone could know where we were on different items.

When planning proves to be too vague

The response to this incident offered multiple instances of where our previous planning proved insufficient. It is a fair expectation that over time some documentation or planning components age poorly. But which of our steps were left too vague? Were we comfortable enough with the task that we did not need more explicit documentation? Each of these considerations can lead to delays that add up quickly. 

We are very thankful that, when the vast majority of issues and action items came up, someone knew exactly what to do. Identifying that we were relying on personal knowledge, rather than information that was publicly available, was key. Had someone else been responding, this could have created longer delays as they attempted to seek out solutions.

For example, we realised this was the first time we’ve needed to physically restart some of our major hardware (eg. our router) in several years. This is both a testament to the longevity of the hardware, but also that, as a team, we had been confident enough that “restart our router” was sufficient detail in a plan. Looking back, it’s easy to point out steps we should look at rehearsing more in depth — in this instance, making note of the undocumented extra steps to confirm the router is ready to work, and what extra steps or items to check should be included.

Restarting from nothing is a great way to reveal catch-22s

While we had previously worked to avoid catch-22s — a situation where we have a circular dependency — we did end up still running into a known issue that caused a minor delay.

Our current system does not allow you to manage our Active Directory servers unless you’re on the VPN or at the Calgary office. The VPN requires Active Directory for authentication, ergo a circular dependency.

Even though we knew how to work around the catch-22, it did not lessen the feeling of egg on our face when it was a known “gotcha.” While we had a planned fix for it, the event and post-mortem led to a much more holistic view of how to deal with such catch-22’s. A re-evaluation of how we do authentication — ensuring we have multiple points of access that are suitably secured and audited — should result in significantly less fragility or delays in future events.

What to do when your documentation is unavailable?

Another item we’ve had on our list of things to eventually resolve is the question of “what happens when your documentation isn’t available due to the outage?” Do we know our systems well enough to put them back together without looking up our internal documentation or diagrams?

We’re very thankful that our staff were well positioned on this, and know our systems well enough to restore things without requiring documentation. Like some other items, we can celebrate the skill and experience of our staff, but also all be very aware that this is not setting us up for success in the future.

As such, exploration into how to keep an accessible, searchable static export (either PDF or HTML) of our internal documentation wikis is high on our priority list.

Sometimes the details matter a whole lot

Undocumented states of services can be a landmine. For example, we have to ensure certain VLAN states are set up correctly on our VFS service. It failed to run after the power was restored because the IPv6 connectivity was broken on the service. We had already created a fix for IPv6 connectivity, but were waiting for a separate upgrade to be deployed. We forgot this  fix was not already in, which caused a minor delay while we ran the script manually. We had to look at why the connectivity was not working to realise it was a “fixed problem”.

Thank you

Last but certainly not least, we want to celebrate what went well. First of all, a big thank you to our members for their understanding and patience during the outage. We also want to thank our network and technical operations teams for their hard work over the weekend.

For those curious, whenever we experience an outage, we endeavour to follow Atlassian’s post-mortem template. Our goal is always to learn how we can improve as a team and how we can improve our systems to be more resilient.

If you have any more questions about the procedure our team followed during the outage, or suggestions for ways we can improve our communications to affected members, please let us know!

Leave a Comment

Your email address will not be published. Required fields are marked *