Skip to main content Skip to secondary navigation
Main content start

Hundreds of Commits: Stanford Research Computing Reflects on its Contributions to Open Source Software

Featuring 14 notable projects where our team members have met research computing needs while giving back to the open source community.
Illustration of Github octopus-cat mascot holding the research computing group's logo in one tentacle

Across academia and industry, software developers are increasingly contributing their skills and time to open source projects. These range from whole new applications to changes and add-ons to existing programs, modules, plug-ins, etc. The open source model allows developers to build on and improve each other’s work, free of proprietary restrictions and priorities.

Scientific and academic researchers benefit from open source software, too. The collaborative and iterative qualities of open source allow for quicker delivery of modifications to meet the specific needs of a given experiment or study. A core principle behind the open source model, and a goal of open source contributors, is that open sharing among coders helps democratize access to cutting-edge technology and creates opportunities for new collaborations across fields and disciplines.

Several SRCC team members are committed, longtime open source contributors, and they maintain all of their projects on GitHub. In this article we’ll discuss 14 notable projects where our team members have met research computing needs while giving back to the open source community.

SRCC-initiated projects

sasutils – Serial Attached SCSI (SAS) Linux utilities and Python library

sasutils is a tool that helps administrators manage large storage backend fabrics, including those used on Stanford’s own Oak and Sherlock systems. According to HPC systems administrator Stéphane Thiell, manager of the Oak Storage service, the sasutils tool is “quite popular and used by large-scale storage sysadmins around the world, and is readily available in Fedora and EPEL repositories for RHEL.”

lauditd – Lustre changelogs audit daemon

lauditd forwards Lustre Changelogs to log analysis software like Splunk. It was developed at the SRCC and is used in production on Oak storage with Stanford’s Splunk instance to record file system metadata changes. File system auditing is important for maintaining security and meeting compliance requirements, but is also very useful to SRCC system administrators for user support (e.g., to help answer questions like “who deleted these files?”).

fuse-migratefs – Filesystem overlay for transparent data migration

fuse-migratefs is a filesystem overlay for transparent, distributed migration of active data across separate storage systems. It intercepts file system calls made by applications and redirects them to a different storage system, facilitating the migration of data between different storage backends without requiring any changes to the application that uses the data.

The tool is particularly useful for migrating data between systems with different performance characteristics or access methods, such as from a local file system to a network file system. Because fuse-migratefs works transparently, applications don't need to be modified to use the new storage system.

Stéphane Thiell wrote the tool “as a fork of fuse-overlayfs that I developed for Sherlock’s scratch migration from Regal to Fir.”

“I had a good time developing this project, and we’ve had inquiries as people are making use of it.”

ibswinfo - Infiniband switch monitoring tool

Kilian Cavalotti, the SRCC’s HPC tech lead and architect, developed ibswinfo as an open source alternative to NVIDIA”s proprietary Infiniband switch management software. InfiniBand is a high-speed networking technology that is commonly used in HPC environments.

ibswinfo is a command-line tool that can be used to manage unmanaged InfiniBand switches commonly used in large-scale HPC systems, monitor their hardware components and facilitate asset inventory.

slurm-spank-gpu_cmode - SLURM SPANK plugin to set GPU compute mode

Kilian Cavalotti also created the slurm-spank-gpu_cmode Slurm plugin, which allows HPC cluster users to dynamically reconfigure GPU compute modes in their jobs.

The project arose from the need to have some control over GPU settings so that users could run different applications requiring a particular GPU compute mode. The plugin provides this flexibility while also maintaining general environment defaults to ensure optimal performance in most cases.

ct_gdrive – Lustre/HSM Google Drive copytool

Stéphane Thiell created ct_gdrive to use Google Drive as an HSM storage tier for Lustre (with transparent data migration). “SRCC used it in 2016 for an experimental project to backup Sherlock data to Google Drive,” Thiell recounts.

“Unfortunately, shortly after that experiment, Google added more restrictions to Google Drive. Still, this project can be used as an example to implement the same thing with other cloud backends.”

XSEDE AMIE DB abstraction Python library

The Account Management Information Exchange (AMIE) software system provides the capability for XSEDE to manage accounts and track resource usage. Developed for XStream, xsede-amie-python is a Python library that creates a database abstraction layer to ease the implementation of AMIE packets by local XSEDE sites. Not in production anymore since XStream has been decommissioned from XSEDE.

Other projects that SRCC contributes to ...

Clustershell – Python Library and tools

Stéphane Thiell is the lead developer of this cluster administration framework, which is also a powerful parallel shell whose purpose is to replace the traditional “pdsh” tool.

ClusterShell provides a number of features that can simplify the management of cluster environments. For example, it provides a command-line tool called “clush” that can be used to execute commands across multiple nodes simultaneously. It also provides a Python API that can be used to create custom scripts and tools for cluster management.

One of Clustershell’s key features is its support for multiple communication channels, including SSH, RSH, and Sudo. This means that administrators can choose the most appropriate channel for their cluster environment, depending on factors such as security, performance, and network topology.

ClusterShell also provides a number of powerful tools for working with large cluster node sets, such as the “NodeSet” class that allows administrators to specify sets of nodes using intuitive patterns and ranges. This can simplify operations such as file transfers and command execution. ClusterShell is readily available on many Linux distributions. Its documentation can be found at the ClusterShell ReadMe, and there’s a post about it on the Sherlock cluster news site.

Lmod – An Environment Module System based on Lua

Lmod is used on Sherlock to manage modules. SRCC team members have made a few contributions.

Lustre File System

The open-source Lustre file system is used across a number of systems at Stanford, including the Sherlock cluster and others built and administered by the SRCC.

Lustre is an open-source parallel distributed file system designed for large-scale cluster computing that uses a distributed architecture allowing multiple servers to provide access to shared storage devices (or “targets”) over a high-speed network. Lustre’s scalability enables it to handle thousands of clients and hundreds of petabytes of data.

SRCC is an active member of OpenSFS, a non-profit organization that aims to keep the Lustre filesystem open. Stéphane Thiell contributes occasional patches for bug fixes, small features (like lctl del_ost to be able to remove an OST on a live system), and many bug reports.

Open OnDemand – Open, Interactive HPC via the Web

Funded by the National Science Foundation (NSF), the Open OnDemand portal enables system administrators to access and administer their HPC resources via the web. The SRCC team has contributed a number of applications, including Jupyter, RStudio, and Tensorboard.

Robinhood Policy Engine

Robinhood is a versatile tool to monitor filesystem contents and schedule actions on filesystem entries. Stéphane Thiell’s contributions include project/directory quota support — used on the Oak storage service and the Sherlock compute cluster — and the “modeguard” plugin for enforcing specific filesystem permissions.

Slurm: A Highly Scalable Workload Manager

The Slurm Workload Manager is used to manage resources and schedule jobs on the SRCC’s computer clusters. Kilian Cavalotti has contributed patches.

xCAT – eXtreme Cluster/Cloud Administration Toolkit

xCAT is the software used to administer and deploy Sherlock, Oak, and Fir (Sherlock’s /scratch).

Killian Cavalotti and Stéphane Thiell contributed a number of patches and bug fixes.

More News Topics