What Caused the Outage at Caregroup? What Are the Root Causes?
A knowledge management application (Napster for Health Care) designed to locate and copy information was left running. This software was new to the system and had not been configured for this network nor had it gone through any testing. The application explored the surrounding network and began collecting and copying information. The longer it was running the more information that it was trying to copy and transfer. This enormous data transfer monopolized the services of the central switch. When other users created queries, the switch was so tied up in from the large data transfers that it was unable to respond.
The management application that was left running was merely the final straw. There were several underlying issues that were at the center of the outage. First of all it should be noted that Halamka’s background in networking at this time was in building home networks. He had also been informed by Cisco in the days leading up to the outage that the system was beginning to be dated.
There were three root causes that we identified: Complex System, Ethernet Protocol and finally in the redundancy that was in place.
Complex system- As the system grew it meant that smaller networks were being added one at a time. This slow growth process had caused the system to go “out of Spec”. The networks each then had to determine the primary versus the back-up and with multiple small networks each trying to make these connections caused confusion. The algorithms were now failing due to the system being out of spec.
Ethernet Protocol- With the switch being tied up the additional queries coming in confused the system. Each new query was another computer trying to pass information multiple queries were coming and the system was unable to respond to any of them.
The redundant systems that were intended to work together-primary and back-up were confused and both became primary. Due to this they began