Research Computing Team Efficiently Reboots SRCF Data Center

While many were taking a well-deserved break over the Labor Day weekend, the Research Computing team was laboring on the Stanford Research Computing Facility (SRCF) Reboot project.

“When the SRCF was built, we knew that after five years of operation, the electrical backbone — a 12kV system fed from the SLAC substation — would require maintenance to keep it running smoothly for the next five years,” said research computing strategist Phil Reese.

To safely do the maintenance, the SRCF would have to be shut down for up to 72 hours.

The facility is home to more than 150 racks of high density servers and petabytes of data that are used by more than 10,000 faculty, research staff and students across Stanford and SLAC. An outage of this magnitude is considerably disruptive, but the Research Computing team began planning for the event well in advance.

Planning the reboot

Planning for the reboot involved ongoing coordination with researchers and system administrators across campus that began back in the fall of 2016.

The reboot team selected Labor Day weekend for the 72-hour shut down because they found, through years of monitoring, that the facility consistently had lower power use during that weekend.

The plan was to bring down the systems, perform the maintenance, and bring the systems back up in as little time as possible. The communication effort began in earnest in February, and consisted of a variety of approaches, including individual meetings with researchers, email, event notices on the web page, and a poster which was distributed throughout the SRCF.

Three-part process over three-day weekend

The weekend began with the reboot team working with researchers and system administrators across campus to ensure that the existing systems were shut down safely by 7:00am on Saturday. The electrical workers and the team performed the maintenance in three steps.

First, the emergency procedures, including the Emergency Power Off (EPO) system and fire suppression were tested successfully.

Second, the primary electrical work began. The high and medium voltage circuit breakers were removed, cleaned, tested for conformance to standards and reinstalled. In addition, a specific task was done to confirm that signal wires, within the many electrical cabinets, were properly torqued at both ends.

Third, after the building had power again, the team took advantage of the outage to conduct their annual “building fail” test. This simulated the very real possibility of the loss of primary power to the building. This tested and confirmed that the Uninterruptible Power Supply (UPS) worked and that the generators would provide backup electrical power.

Once this work was completed (by early afternoon on Monday), the Research Computing team brought up the centrally-maintained systems, and communicated with the researchers and system administrators that the work was complete. The team provided help and assistance in bringing other systems up.

Insight gained from a power loss

After two years of planning for a three-day event, Phil and team consider the reboot a success. He said,

“The actual maintenance work proceeded well over the weekend, and, as a bonus, the team gained a better understanding of how important features in the building work.”

Research Computing Team Efficiently Reboots SRCF Data Center

Planning the reboot

Three-part process over three-day weekend

Insight gained from a power loss

More News

Our Sherlock HPC cluster goes full flash

Lunch & Learn and Other New Classes

The National Artificial Intelligence Research Resource (NAIRR) Pilot