Sherlock Cluster: Adventures in storage

This is part of our blog series about behind-the-scenes things we do on a regular basis on Sherlock, to keep it up and running in the best possible conditions for our users.
Now that Sherlock’s old storage system has been retired, we can finally tell that story. It all happened in 2016.

Or: How we replaced more than 1 PB of hard drives, while continuing to serve files to unsuspecting users.

TL;DR: The parallel filesystem in Stanford’s largest HPC cluster has been affected by frequent and repeated hard-drive failures since its early days. A defect was identified that affected all of the 360 disks used in 6 different disk arrays. A major swap operation was planned to replace the defective drives. Multiple hardware disasters piled up to make matters worse, but in the end, all of the initial disks were replaced, while retaining 1.5 PB of user data intact, and keeping the filesystem online the whole time.

