SRCC Open Source Software Projects and Contributions
Several SRCC team members are committed, longtime open source contributors, and they maintain all of their projects on GitHub. Following is a list of notable projects where our team members have met research computing needs while giving back to the open source community.
- sasutils – Serial Attached SCSI (SAS) Linux utilities and Python library
- lauditd – Lustre changelogs audit daemon
- fuse-migratefs – Filesystem overlay for transparent data migration
- ibswinfo - Infiniband switch monitoring tool
- slurm-spank-gpu_cmode - SLURM SPANK plugin to set GPU compute mode
- ct_gdrive – Lustre/HSM Google Drive copytool
- XSEDE AMIE DB abstraction Python library
- Clustershell – Python Library and tools
- Lmod – An Environment Module System based on Lua
- Lustre File System
- Open OnDemand – Open, Interactive HPC via the Web
- Robinhood Policy Engine
- Slurm: A Highly Scalable Workload Manager
- xCAT – eXtreme Cluster/Cloud Administration Toolkit
sasutils – Serial Attached SCSI (SAS) Linux utilities and Python library
sasutils is a tool that helps administrators manage large storage backend fabrics, including those used on Stanford’s own Oak and Sherlock systems. According to HPC systems administrator Stéphane Thiell, manager of the Oak Storage service, the sasutils tool is “quite popular and used by large-scale storage sysadmins around the world, and is readily available in Fedora and EPEL repositories for RHEL.”
lauditd – Lustre changelogs audit daemon
lauditd forwards Lustre Changelogs to log analysis software like Splunk. It was developed at the SRCC and is used in production on Oak storage with Stanford’s Splunk instance to record file system metadata changes. File system auditing is important for maintaining security and meeting compliance requirements, but is also very useful to SRCC system administrators for user support (e.g., to help answer questions like “who deleted these files?”).
fuse-migratefs – Filesystem overlay for transparent data migration
fuse-migratefs is a filesystem overlay for transparent, distributed migration of active data across separate storage systems. It intercepts file system calls made by applications and redirects them to a different storage system, facilitating the migration of data between different storage backends without requiring any changes to the application that uses the data.
The tool is particularly useful for migrating data between systems with different performance characteristics or access methods, such as from a local file system to a network file system. Because fuse-migratefs works transparently, applications don't need to be modified to use the new storage system.
Stéphane Thiell wrote the tool “as a fork of fuse-overlayfs that I developed for Sherlock’s scratch migration from Regal to Fir.”
“I had a good time developing this project, and we’ve had inquiries as people are making use of it.”
ibswinfo - Infiniband switch monitoring tool
Kilian Cavalotti, the SRCC’s HPC tech lead and architect, developed ibswinfo as an open source alternative to NVIDIA”s proprietary Infiniband switch management software. InfiniBand is a high-speed networking technology that is commonly used in HPC environments.
ibswinfo is a command-line tool that can be used to manage unmanaged InfiniBand switches commonly used in large-scale HPC systems, monitor their hardware components and facilitate asset inventory.
slurm-spank-gpu_cmode - SLURM SPANK plugin to set GPU compute mode
Kilian Cavalotti also created the slurm-spank-gpu_cmode Slurm plugin, which allows HPC cluster users to dynamically reconfigure GPU compute modes in their jobs.
The project arose from the need to have some control over GPU settings so that users could run different applications requiring a particular GPU compute mode. The plugin provides this flexibility while also maintaining general environment defaults to ensure optimal performance in most cases.
ct_gdrive – Lustre/HSM Google Drive copytool
Stéphane Thiell created ct_gdrive to use Google Drive as an HSM storage tier for Lustre (with transparent data migration). “SRCC used it in 2016 for an experimental project to backup Sherlock data to Google Drive,” Thiell recounts.
“Unfortunately, shortly after that experiment, Google added more restrictions to Google Drive. Still, this project can be used as an example to implement the same thing with other cloud backends.”
XSEDE AMIE DB abstraction Python library
The Account Management Information Exchange (AMIE) software system provides the capability for XSEDE to manage accounts and track resource usage. Developed for XStream, xsede-amie-python is a Python library that creates a database abstraction layer to ease the implementation of AMIE packets by local XSEDE sites. Not in production anymore since XStream has been decommissioned from XSEDE.
Other projects that SRCC contributes to ...
Clustershell – Python Library and tools
Stéphane Thiell is the lead developer of this cluster administration framework, which is also a powerful parallel shell whose purpose is to replace the traditional “pdsh” tool.
ClusterShell provides a number of features that can simplify the management of cluster environments. For example, it provides a command-line tool called “clush” that can be used to execute commands across multiple nodes simultaneously. It also provides a Python API that can be used to create custom scripts and tools for cluster management.
One of Clustershell’s key features is its support for multiple communication channels, including SSH, RSH, and Sudo. This means that administrators can choose the most appropriate channel for their cluster environment, depending on factors such as security, performance, and network topology.
ClusterShell also provides a number of powerful tools for working with large cluster node sets, such as the “NodeSet” class that allows administrators to specify sets of nodes using intuitive patterns and ranges. This can simplify operations such as file transfers and command execution. ClusterShell is readily available on many Linux distributions and its documentation can be found at the ClusterShell ReadMe.
Lmod – An Environment Module System based on Lua
Lmod is used on Sherlock to manage modules. SRCC team members have made a few contributions.
Lustre File System
The open-source Lustre file system is used across a number of systems at Stanford, including the Sherlock cluster and others built and administered by the SRCC.
Lustre is an open-source parallel distributed file system designed for large-scale cluster computing that uses a distributed architecture allowing multiple servers to provide access to shared storage devices (or “targets”) over a high-speed network. Lustre’s scalability enables it to handle thousands of clients and hundreds of petabytes of data.
SRCC is an active member of OpenSFS, a non-profit organization that aims to keep the Lustre filesystem open. Stéphane Thiell contributes occasional patches for bug fixes, small features (like lctl del_ost to be able to remove an OST on a live system), and many bug reports.
Open OnDemand – Open, Interactive HPC via the Web
Funded by the National Science Foundation (NSF), the Open OnDemand portal enables system administrators to access and administer their HPC resources via the web. The SRCC team has contributed a number of applications, including Jupyter, RStudio, and Tensorboard.
Robinhood Policy Engine
Robinhood is a versatile tool to monitor filesystem contents and schedule actions on filesystem entries. Stéphane Thiell’s contributions include project/directory quota support — used on the Oak storage service and the Sherlock compute cluster — and the “modeguard” plugin for enforcing specific filesystem permissions.
Slurm: A Highly Scalable Workload Manager
xCAT – eXtreme Cluster/Cloud Administration Toolkit
xCAT is the software used to administer and deploy Sherlock, Oak, and Fir (Sherlock’s /scratch).
Killian Cavalotti and Stéphane Thiell contributed a number of patches and bug fixes.