While many were taking a well-deserved break over the Labor Day weekend, the Research Computing team was laboring on the Stanford Research Computing Facility (SRCF) Reboot project.
“When the SRCF was built, we knew that after five years of operation, the electrical backbone — a 12kV system fed from the SLAC substation — would require maintenance to keep it running smoothly for the next five years,” said research computing strategist Phil Reese.
To safely do the maintenance, the SRCF would have to be shut down for up to 72 hours.
The facility is home to more than 150 racks of high density servers and petabytes of data that are used by more than 10,000 faculty, research staff and students across Stanford and SLAC. An outage of this magnitude is considerably disruptive, but the Research Computing team began planning for the event well in advance.
Planning for the reboot involved ongoing coordination with researchers and system administrators across campus that began back in the fall of 2016.
The reboot team selected Labor Day weekend for the 72-hour shut down because they found, through years of monitoring, that the facility consistently had lower power use during that weekend.
The plan was to bring down the systems, perform the maintenance, and bring the systems back up in as little time as possible. The communication effort began in earnest in February, and consisted of a variety of approaches, including individual meetings with researchers, email, event notices on the web page, and a poster which was distributed throughout the SRCF.
Read the rest at UIT Community Newsletter