Network-upgrade horror story

IT executive learns key lessons during four-year effort to get revamp off the ground

Comments

It sounded like a no-brainer back in 2003. Replace the aging, 155Mbps ATM-over-SONET network running at the University of California, San Francisco with a state-of-the-art network based on 10G Ethernet over dense wavelength division multiplexing.

During the nine years that the ATM-over-SONET system has been in place, the metropolitan network has grown to 55,000 nodes encompassing two San Francisco campuses, four hospitals and more than 200 remote sites, including regional clinics spread throughout California. The campus network also has evolved into an essential, mission-critical utility, right up there with water and electric power.

Reliability had become a worry, however. Of great concern was the ticking clock: network devices that were at -- or rapidly heading toward -- end-of-life. That means no vendor support for such essentials as software patches, technical support and replacement of failed hardware components. Cisco's support for the Catalyst 5500s and LS-1010s was waning.

In addition, the demands of video distribution, telemedicine and medical-imaging technologies were quickly making the network outdated. It lacked QoS or multicast capabilities. That meant e-mail, Web surfing, video and medical images all got the same "best effort" treatment. Video packets were broadcast indiscriminately, causing bottlenecks and congestion. Applications that needed greater bandwidth or QoS, such as those used for remote clinician consultation and patient diagnosis and medical research, could not be carried efficiently -- or at all -- on the network.

Clear sailing in the design phase

In the summer of 2003, a design team of network technologists from campus IT, several campus departments and the medical center began to think about a new network. We considered what technologies offered the best mix of price and performance and which offered the greatest capability for expansion and the lowest risk of downtime.

DWDM quickly became a front-runner in terms of the potential technology. It can scale over time from eight lambdas (light-wave channels) all the way to 32 protected lambdas or 64 unprotected lambdas.

DWDM would provide a graceful evolution for the network's ever-increasing demands for capacity and capability. Each individual lambda running as fast as 2.5Gbps can carry a different service. For example, we could run the production Ethernet network over one lambda and a high-definition video feed over another. Or we could choose to provide a secure second Ethernet network for the medical center to connect the university's hospital facilities. This would let secure, electronic, protected health information move across the medical center's clinical network without coming in contact with student and faculty traffic on the campus network.

Then there is the matter of protected and unprotected lambdas. The bane of any optical-fiber-based network is the feared fiber cut. DWDM offers the option of protected lambdas, which run in one direction in the DWDM ring, while working lambdas run in the other direction.

Most DWDM gear has protection-switching that senses the loss of signal from the failed working lambdas and switches to the protected lambdas in less than 50 microseconds. There are few if any network applications that would notice that short an outage.

To add even more resiliency, we engineered in topology reliability. The new network was designed with diversely routed, dual-concentric rings at the main sites. Thus, a fiber cut or optical failure would have to take out both rings to cause a network failure. Even then, protected lambdas would take over.

Now we had the basis for the new network, which we christened UCSF's Next Generation Metropolitan Area Network (NGMAN).

NGMAN is made up of core and secondary sites. The core consists of the two main campuses and a central administrative building. San Francisco General Hospital, Mount Zion Medical Complex, Laurel Heights Conference Center and the Veterans Administration Medical Center are secondary sites.

Core sites are the locations with the heaviest traffic demands. They also are the sites with the most users. Therefore, they have the highest bandwidth (10Gbps) and the most resiliency. Most secondary sites connect to the core in a point-to-point fashion using unprotected lambdas running at 1Gbps or 10Gbps, depending on their traffic requirements.

The product of building reliability on top of reliability was a resilient, redundant and self-healing network that could survive such events as earthquakes and bioterrorism -- not an unimportant consideration for a patient care network in a seismically active area. In fact, NGMAN's design let it achieve five-nines of reliability -- no more than 5.26 minutes of downtime a year.

UCSF has a "build it and they will come" philosophy. We don't build things frivolously, but we do build them on faith. The university built an entirely new campus at Mission Bay hoping to attract top medical researchers from around the world. A number of educators and researchers in fact made their way to UCSF and wound up doing their research in the new state-of-the-art Mission Bay buildings, which were outfitted with high-performance networks.

There was an element of "build it and they will come" in the NGMAN project as well. The network was built to support future medical applications. It needed to be high-performance and support QoS and multicast. It had to support high-definition video distribution, IP telephony and real-time medical imaging. And it had to be scalable.

We chose a modular approach to minimize forklift upgrades. Modularity extended to more than just the equipment. We intended the modular concept to allow for adding and deleting secondary sites easily. If a site didn't need the full capabilities of DWDM, we could bring it online via alternative technologies, such as optical metropolitan Ethernet service or leased services.