On Friday March 8, Cybera experienced an outage that impacted several of our members.
On Friday at 4:50 pm we began maintenance work on the DWDM in Edmonton to reset a backup power supply. This work was successfully completed by 5:00 pm. Shortly after (at 5:10 pm), we started receiving alerts notifying us that the DWDM was down. The main and backup power supplies failed, impacting members in Edmonton without regional redundancy. Peering, IBG (Internet Buying Group), VFS (Virtual Firewall Service), and shared hardware firewall services became unavailable to them. This outage continued until we were able to find and install a working power supply at 7:45 pm.
Once the DWDM came back online, the Peering and IBG services were restored for all members. Unfortunately, the DWDM outage triggered another failure. This impacted the affected firewalls, and spread to include the previously working VFS firewalls across both regions. After several hours of debugging, we determined there was a Layer 2 spanning tree loop issue. By 2:00 am on Saturday, the issue was resolved for many of our members.
We did not realize that, at that time, the Calgary region was experiencing similar issues. Later Saturday morning, some of our VFS members in Calgary let us know they were having issues (thank-you!). A review of the symptoms showed the same problems that Edmonton had experienced, and we were able to similarly resolve the issue by 10:00 am Saturday morning.
Two schools, one in Edmonton and one in Calgary, continued to have firewall issues. Our team worked with them throughout the day to debug, and were able to resolve their issues by late afternoon.
We are compiling a list of lessons learned which we will be reviewing next week. Once we’ve determined our next steps and how we will improve our procedures going forward, we will share that information with our members.