OLCF staff gave talks and technical presentations at an international Lustre workshop March 3 and 4 in Annapolis, Md. ORNL participants in the event include(front row, left to right) Lora Wolfe, Neena Imam, Sarp Oral, and Feiyi Wang;
(back row, left to right) Rick Mohr, Michael Brim, Matt Ezell, Jason Hill, Corliss Thompson, and Blake Caldwell; (not pictured) Josh Lothian and Joel Reed.
Talks contribute to ORNL initiative to expand parallel file system’s capabilities
High-performance supercomputers need high-performance file systems to manage the movement and storage of large amounts of data. For many of the fastest supercomputers in the world, including Titan at the US Department of Energy’s (DOE’s) Oak Ridge National Laboratory (ORNL), the Lustre parallel file system fills that need.
Because of its open source licensing, ability to reduce I/O constraints, and scalability, Lustre has been adopted widely by high-performance computing (HPC) users worldwide. But as the needs of HPC users evolve, so too must Lustre.
To that end, the Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science User Facility, played a significant role in an ORNL event to share knowledge and discuss the future development of the parallel file system. The International Workshop on the Lustre Ecosystem: Challenges and Opportunities, which took place March 3 and 4 in Annapolis, Maryland, brought together Lustre users from academia, industry, and government to explore improvements in the parallel file system’s performance and flexibility. OLCF staff members gave talks and technical presentations on both days of the workshop, sharing knowledge related to managing and optimizing the Lustre environment that could benefit other users.
The event was organized by the US Department of Defense (DOD)-HPC Research Program at ORNL, a collaboration between DOD and ORNL. The program has interests and competencies in extreme-scale HPC, particularly advanced architectures, metrics, benchmarks, system evaluations, programming environments, fully distributed data centers, and parallel file systems. Neena Imam, Mike Brim, and Sarp Oral of ORNL’s Computing and Computational Sciences Directorate were the workshop co-chairs.
“Historically, the OLCF has been a leader in deploying the largest known Lustre production file system,” said Brim, a research associate in ORNL’s Computer Science and Mathematics Division. “Because of this, we oftentimes run into problems before anyone else. This workshop gave us an opportunity to share the challenges we’ve overcome and make our solutions available to a wider audience who may be following the same path.”
The first day of the program featured a keynote presentation by Eric Barton, lead architect of the High Performance Data Division at Intel and a long-time proponent of Lustre. On day two, presentations covered technical topics, including burst buffer systems, dynamic file striping, and monitoring toolkits for Lustre.
Jason Hill, the OLCF’s HPC Operations storage team leader and tutorial chair for the workshop, led sessions covering networking and the OLCF’s efforts to minimize the effects of file system hardware and software failures.
“Lustre has a lot of flexibility in the way you can configure it,” Hill said. “That’s one of its great powers, but that’s also one of its downfalls. You either have to be an expert in all the areas of the ecosystem that you create or obtain that support from a vendor. The hope is that other members of the Lustre community can benefit from our experience.”
A major focus of the workshop concerned adapting Lustre to efficiently handle diverse, non-scientific workloads, such as those produced by big data-type applications. ORNL currently is spearheading this initiative.
“Lustre was designed with scientific simulation in mind, which means it’s good at sequential read and write I/O workloads,” said Oral, file and storage systems team lead for the OLCF Technology Integration Group. “Big data workloads are different, requiring lots of small data reads and randomized access. Lustre is not well suited for these read-heavy, random I/O workloads today. Much of the discussion focused on what could be done to improve Lustre’s capabilities in this area.”
The first step in diversifying Lustre’s I/O workload capabilities is to create tools that measure how the parallel file system currently handles big data workloads, Brim said. “After we’ve characterized those workloads, we can start talking about what changes are necessary to make Lustre a more general purpose, high-performance parallel file system.”
Enhanced workload capability could help expand Lustre’s user base, historically a niche market, to include organizations and businesses in a growing number of sectors that are leveraging data mining and analytics tools. Increased capability also could benefit long-time Lustre adherents. For example, a more robust Lustre could give computational scientists improved data analysis capabilities, such as real-time data visualization.
“If we can improve the productivity of analysis workloads on Lustre, we can improve the productivity of scientists by giving them insights more quickly,” Brim said.
—Jonathan Hines
Oak Ridge National Laboratory is supported by the US Department of Energy’s Office of Science. The single largest supporter of basic research in the physical sciences in the United States, the Office of Science is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.