From Oak Ridge National Laboratory: “Faces of Summit: Preparing to Launch”

i1

From Oak Ridge National Laboratory

OLCF

5.1.18
Katie Elyce Jones

1
HPC Support Specialist Chris Fuson in the Summit computing center. No image credit.

OLCF’s Chris Fuson works with Summit vendors and OLCF team members to ready Summit’s batch scheduler and job launcher.

The Faces of Summit series shares stories of people working to stand up America’s next top supercomputer for open science, the Oak Ridge Leadership Computing Facility’s Summit. The next-generation machine is scheduled to come online in 2018.

ORNL IBM Summit Supercomputer

ORNL IBM Summit Supercomputer

At the Oak Ridge Leadership Computing Facility (OLCF), supercomputing staff and users are already talking about what kinds of science problems they will be able to solve once they “get on Summit.”

But before they run their science applications on the 200-petaflop IBM AC922 supercomputer later this year, they will have to go through the system’s batch scheduler and job launcher.

“The batch scheduler and job launcher control access to the compute resources on the new machine,” said Chris Fuson, OLCF high-performance computing (HPC) support specialist. “As a user, you will need to understand these resources to utilize the system effectively.”

A staff member in the User Assistance and Outreach (UAO) Group, Fuson has worked on five flagship supercomputers at OLCF—Cheetah, Phoenix, Jaguar, Titan, and now Summit.

[Cheetah, Phoenix, no images available.]

ORNL OCLF Jaguar Cray Linux supercomputer

ORNL Cray XK7 Titan Supercomputer

With a background in programming and computer science, Fuson said he likes to focus on solving the unexpected issues that come up during installation and testing, such as fixing bugs or adding new features to help users navigate the system.

Fuson can often be found standing at his desk listening to background music while he sorts through new tasks, user requests, and technical issues related to job scheduling.

“As the systems change and evolve, the detective work involved in helping users solve problems as they run on a new machine keeps it interesting,” he said.

Of course, the goal is to make the transition to a new system as smooth as possible for users. While still responding to day-to-day tasks related to the OLCF’s current supercomputer, Titan, Fuson and the UAO group also work with IBM to learn, incorporate, and document the IBM Load Sharing Facility (LSF) batch scheduler and the parallel job launcher jsrun for Summit. LSF allocates Summit resources, and jsrun launches jobs on the compute nodes.

“The new launcher provides similar functionality to other parallel job launchers, such as aprun and mpirun, but requires users to take a slightly different approach in determining how to request and lay out resources for a job,” Fuson said.

IBM developed jsrun to meet the unique computing needs of two CORAL partners, the US Department of Energy’s (DOE’s) Oak Ridge and Lawrence Livermore National Laboratories.

“We relayed our workload and scheduling requirements to IBM,” Fuson said. “For example, as a leadership computing facility, we provide priority for large jobs in the batch queue. We work with LSF developers to incorporate our center’s policy requirements and diverse workload needs into the existing scheduler.”

OLCF Center for Accelerated Application Readiness team members, who are optimizing application codes for Summit, have tested LSF and jsrun on Summitdev, an early access system with IBM processers one generation away from Summit’s Power9 processors.

“Early users are already providing feedback,” Fuson said. “There’s a lot of work that goes into getting these pieces polished. At first, it is always a struggle as we work toward production, but things will begin to fall into place.”

To prepare all facility users for scheduling on Summit, Fuson is also developing user documentation and training. In February, he introduced users to jsrun on the monthly User Conference Call for the OLCF, a DOE Office of Science User Facility at ORNL.

“Right now, Summit is a big focus,” he said. “We’ve invested time in learning these new tools and testing them in the Summit environment.”

And what about during his free time when Summit is not the focus? Fuson spends his off-hours scheduling as well. “My hobby is taxiing my kids around town between practices,” he joked.

See the full article here .

Please help promote STEM in your local schools.

STEM Icon

Stem Education Coalition

ORNL is managed by UT-Battelle for the Department of Energy’s Office of Science. DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time.

i2

The Oak Ridge Leadership Computing Facility (OLCF) was established at Oak Ridge National Laboratory in 2004 with the mission of accelerating scientific discovery and engineering progress by providing outstanding computing and data management resources to high-priority research and development projects.

ORNL’s supercomputing program has grown from humble beginnings to deliver some of the most powerful systems in the world. On the way, it has helped researchers deliver practical breakthroughs and new scientific knowledge in climate, materials, nuclear science, and a wide range of other disciplines.

The OLCF delivered on that original promise in 2008, when its Cray XT “Jaguar” system ran the first scientific applications to exceed 1,000 trillion calculations a second (1 petaflop). Since then, the OLCF has continued to expand the limits of computing power, unveiling Titan in 2013, which is capable of 27 petaflops.


ORNL Cray XK7 Titan Supercomputer

Titan is one of the first hybrid architecture systems—a combination of graphics processing units (GPUs), and the more conventional central processing units (CPUs) that have served as number crunchers in computers for decades. The parallel structure of GPUs makes them uniquely suited to process an enormous number of simple computations quickly, while CPUs are capable of tackling more sophisticated computational algorithms. The complimentary combination of CPUs and GPUs allow Titan to reach its peak performance.

The OLCF gives the world’s most advanced computational researchers an opportunity to tackle problems that would be unthinkable on other systems. The facility welcomes investigators from universities, government agencies, and industry who are prepared to perform breakthrough research in climate, materials, alternative energy sources and energy storage, chemistry, nuclear physics, astrophysics, quantum mechanics, and the gamut of scientific inquiry. Because it is a unique resource, the OLCF focuses on the most ambitious research projects—projects that provide important new knowledge or enable important new technologies.