Tagged: TITAN Supercomputer Toggle Comment Threads | Keyboard Shortcuts

  • richardmitnick 1:13 pm on March 29, 2017 Permalink | Reply
    Tags: , , , TITAN Supercomputer, What's next for Titan?   

    From OLCF via TheNextPlatform: “Scaling Deep Learning on an 18,000 GPU Supercomputer” 


    Oak Ridge National Laboratory




    March 28, 2017
    Nicole Hemsoth

    ORNL Cray XK7 Titan Supercomputer

    It is one thing to scale a neural network on a single GPU or even a single system with four or eight GPUs. But it is another thing entirely to push it across thousands of nodes. Most centers doing deep learning have relatively small GPU clusters for training and certainly nothing on the order of the Titan supercomputer at Oak Ridge National Laboratory.

    The emphasis on machine learning scalability has often been focused on node counts in the past for single-model runs. This is useful for some applications, but as neural networks become more integrated into existing workflows, including those in HPC, there is another way to consider scalability. Interestingly, the lesson comes from an HPC application area like weather modeling where, instead of one monolithic model to predict climate, an ensemble of forecasts run in parallel on a massive supercomputer are meshed together for the best result. Using this ensemble method on deep neural networks allows for scalability across thousands of nodes, with the end result being derived from an average of the ensemble–something that is acceptable in an area that does not require the kind of precision (in more ways than one) that some HPC calculations do.

    This approach has been used on the Titan supercomputer at Oak Ridge, which is a powerhouse for deep learning training given its high GPU counts. Titan’s 18,688 Tesla K20X GPUs have proven useful for a large number of scientific simulations and are now pulling double-duty on deep learning frameworks, including Caffe, to boost the capabilities of HPC simulations (classification, filtering of noise, etc.). The next generation supercomputer at the lab, the future “Summit” machine (expected to be operational at the end of 2017) will provide even more GPU power with the “Volta” generation Tesla graphics coprocessors from Nvidia, high-bandwidth memory, NVLink for faster data movement, and IBM Power9 CPUs.

    ORNL IBM Summit supercomputer depiction

    ORNL researchers used this ensemble approach to neural networks and were able to stretch these across all of the GPUs in the machine. This is a notable feat, even for the types of large simulations that are built to run on big supercomputers. What is interesting is that while the frameworks might come from the deep learning (Caffe in ORNL’s case), the node to node communication is rooted in HPC. As we have described before, MPI is still the best method out there for fast communication across InfiniBand-connected nodes and like researchers elsewhere, ORNL has adapted it to deep learning at scale.

    Right now, the team is using each individual node to train an individual deep learning network, but all of those different networks need to have the same data if training from the same set. The question is how to feed that same data to over 18,000 different GPUs at almost the same time—and on a system that wasn’t designed with that in mind? The answer is in a custom MPI-based layer that can divvy up the data and distribute it. With the coming Summit supercomputer—the successor to Titan, which will sport six Volta GPUs per node—the other problem is multi-GPU scaling, something application teams across HPC are tackling as well.

    Ultimately, the success of MPI for deep learning at such scale will depend on how many messages the system and MPI can handle since there is both results between nodes in addition to thousands of synchronous updates for training iterations. Each iteration will cause a number of neurons within the network to be updated, so if the network is spread across multiple nodes, all of that will have to be communicated. That is large enough task on its own—but also consider the delay of the data that needs to be transferred to and from disk (although a burst buffer can be of use here). “There are also new ways of looking at MPI’s guarantees for robustness, which limits certain communication patterns. HPC needs this, but neural networks are more fault-tolerant than many HPC applications,” Patton says. “Going forward, that the same I/O is being used to communicate between the nodes and from disk, so when the datasets are large enough the bandwidth could quickly dwindle.

    In addition to their work scaling deep neural networks across Titan, the team has also developed a method of automatically designing neural networks for use across multiple datasets. Before, a network designed for image recognition could not be reused for speech, but their own auto-designing code has scaled beyond 5,000 (single GPU) nodes on Titan with up to 80 percent accuracy.

    “The algorithm is evolutionary, so it can take design parameters of a deep learning network and evolve those automatically,” Robert Patton, a computational analytics scientist at Oak Ridge, tells The Next Platform. “We can take a dataset that no one has looked at before and automatically generate a network that works well on that dataset.”

    Since developing the auto-generating neural networks, Oak Ridge researchers have been working with key application groups that can benefit from the noise filtering and data classification that large-scale neural nets can provide. These include high-energy particle physics, where they are working with Fermi National Lab to classify neutrinos and subatomic particles. “Simulations produce so much data and it’s too hard to go through it all or even keep it all on disk,” says Patton. “We want to identify things that are interesting in data in real time in a simulation so we can snapshot parts of the data in high resolution and go back later.”

    It is with an eye on “Summit” and the challenges to programming the system that teams at Oak Ridge are swiftly figuring out where deep learning fits into existing HPC workflows and how to maximize the hardware they’ll have on hand.

    “We started taking notice of deep learning in 2012 and things really took off then, in large part because of the move of those algorithms to the GPU, which allowed researchers to speed the development process,” Patton explains. “There has since been a lot of progress made toward tackling some of the hardest problems and by 2014, we started seeing that if one GPU is good for deep learning, what could we do with 18,000 of them on the Titan supercomputer.”

    While large supercomputers like Titan have the hybrid GPU/CPU horsepower for deep learning at scale, they are not built for these kinds of workloads. Some hardware changes in Summit will go a long way toward speeding through some bottlenecks, but the right combination of hardware might include some non-standard accelerators like neuromorphic devices and other chips to bolster training or inference. “Right now, if we were to use machine learning in real-time for HPC applications, we still have the problem of training. We are loading the data from disk and the processing can’t continue until the data comes off disk, so we are excited for Summit, which will give us the ability to get the data off disk faster in the nodes, which will be thicker, denser and have more memory and storage,” Patton says.

    “It takes a lot of computation on expensive HPC systems to find the distinguishing features in all the noise,” says Patton. “The problem is, we are throwing away a lot of good data. For a field like materials science, for instance, it’s not unlikely for them to pitch more than 90 percent of their data because it’s so noisy and they lack the tools to deal with it.” He says this is also why his teams are looking at integrating novel architectures to offload to, including neuromorphic and quantum computers—something we will talk about more later this week in an interview with ORNL collaborator, Thomas Potok.


    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    ORNL is managed by UT-Battelle for the Department of Energy’s Office of Science. DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time.


    The Oak Ridge Leadership Computing Facility (OLCF) was established at Oak Ridge National Laboratory in 2004 with the mission of accelerating scientific discovery and engineering progress by providing outstanding computing and data management resources to high-priority research and development projects.

    ORNL’s supercomputing program has grown from humble beginnings to deliver some of the most powerful systems in the world. On the way, it has helped researchers deliver practical breakthroughs and new scientific knowledge in climate, materials, nuclear science, and a wide range of other disciplines.

    The OLCF delivered on that original promise in 2008, when its Cray XT “Jaguar” system ran the first scientific applications to exceed 1,000 trillion calculations a second (1 petaflop). Since then, the OLCF has continued to expand the limits of computing power, unveiling Titan in 2013, which is capable of 27 petaflops.

    ORNL Cray XK7 Titan Supercomputer

    Titan is one of the first hybrid architecture systems—a combination of graphics processing units (GPUs), and the more conventional central processing units (CPUs) that have served as number crunchers in computers for decades. The parallel structure of GPUs makes them uniquely suited to process an enormous number of simple computations quickly, while CPUs are capable of tackling more sophisticated computational algorithms. The complimentary combination of CPUs and GPUs allow Titan to reach its peak performance.

    The OLCF gives the world’s most advanced computational researchers an opportunity to tackle problems that would be unthinkable on other systems. The facility welcomes investigators from universities, government agencies, and industry who are prepared to perform breakthrough research in climate, materials, alternative energy sources and energy storage, chemistry, nuclear physics, astrophysics, quantum mechanics, and the gamut of scientific inquiry. Because it is a unique resource, the OLCF focuses on the most ambitious research projects—projects that provide important new knowledge or enable important new technologies.

  • richardmitnick 2:39 pm on October 2, 2014 Permalink | Reply
    Tags: , , , , , TITAN Supercomputer   

    From ORNL via Cray Supercomputer Co.: “Q&A: Diving Deep Into Our Solar System” 


    Oak Ridge National Laboratory

    October 1, 2014
    Anthony Mezzacappa

    Anthony Mezzacappa, director of the University of Tennessee–Oak Ridge National Laboratory Joint Institute for Computational Sciences, and a team of computational astrophysicists are conducting one of the largest supernova simulations to date on ORNL’s “Titan” supercomputer. Titan, which is a hybrid Cray® XK7™ supercomputer, is managed by the Oak Ridge Leadership Computing Facility on behalf of the Department of Energy. Dr. Mezzacappa answers our questions about his team’s work on Titan.

    Cray Titan at ORNL

    Q: Why is understanding what triggers a supernova explosion so important?

    A: Supernovae are ultimately responsible for why you and I are here. The class of supernova that our team studies is known as core-collapse supernovae [Type II], and this type of supernova is arguably the most important source of elements in the universe. Core-collapse supernovae are the death throes of massive stars (by massive stars, I’m referring to stars of eight to 10 solar masses and greater). Supernovae are basically stellar explosions that obliterate these stars, leaving the core behind. They are responsible for the lion’s share of elements in the periodic table between oxygen and iron, including the oxygen you breath and the calcium in your bones, and they are believed to be responsible for half the elements heavier than iron. So through supernova explosions, we’re tied to the cosmos in an intimate way.

    Twenty years ago, astronomers witnessed one of the brightest stellar explosions in more than 400 years. The titanic supernova, called SN 1987A, blazed with the power of 100 million suns for several months following its discovery on Feb. 23, 1987. Observations of SN 1987A, made over the past 20 years by NASA’s Hubble Space Telescope and many other major ground- and space-based telescopes, have significantly changed astronomers’ views of how massive stars end their lives. Astronomers credit Hubble’s sharp vision with yielding important clues about the massive star’s demise.

    This Hubble telescope image shows the supernova’s triple-ring system, including the bright spots along the inner ring of gas surrounding the exploded star. A shock wave of material unleashed by the stellar blast is slamming into regions along the inner ring, heating them up, and causing them to glow. The ring, about a light-year across, was probably shed by the star about 20,000 years before it exploded.

    NASA Hubble Telescope
    NASA/ESA HUbble

    Q: Why is supernova research critical to the progression of astrophysics?

    A: In addition to releasing a lot of the elements that make up ourselves and the nature around us, core-collapse supernovae give birth to neutron stars, which can become pulsars or black holes. So these supernovae are also responsible for the birth of other important objects in the universe that we want to understand.

    Another reason to study supernovae as a key component of astrophysics is that we can actually use supernovae as nuclear physics laboratories. With supernovae, we’re dealing with very high-density physics, with systems that are rich in neutrons and conditions that are difficult to produce in a terrestrial lab. We can use supernova models in conjunction with observations to understand fundamental nuclear physics.

    In all these ways, the “supernova problem,” as we call it, is certainly one of the most important and most challenging problems in astrophysics being answered today.

    Q: Back in 2003, what role did the simulations done on “Phoenix,” the Cray® X1E™ supercomputer, have on supernova research?

    A: The simulations back in 2002, which we published in 2003, led to the discovery of the SASI, or standing accretion shock instability.

    Phoenix was a magnificent machine, and we got a lot of science out of it. On Phoenix, we discovered the SASI and learned that the supernova shock wave, which generates the supernova, is unstable and this instability distorts its shape. The shock wave will become prolate or oblate (cigar-like or pancake-like), which has important ramifications for how these stars explode.

    I think if you take a look at supernova theory from about 1980 and onward, the results we see in our 2D and basic 3D models suggest the SASI is the missing link in obtaining supernova explosions in models that have characteristics commensurate with observations.

    Q: The work in 2003 unlocked the key SASI simulation result that was recently proven through observation. Can you explain the importance of that breakthrough now?

    A: As an x-ray observatory, NuSTAR — which delivered these supporting observations — can see the x-rays emitted from the decay of titanium-44. The reason titanium-44 is so important is because it is produced very deep in the explosion, so it can provide more information about the explosion mechanism than other radiative signatures.


    The map of the titanium-44 x-rays gave researchers a map of the explosion and, as such, it gave us a fingerprint, if you will, of the explosion dynamics, which was consistent with the active presence of the SASI. This is a rare example of a computational discovery being made before there was observational evidence to support it because these latest NuSTAR observations occurred a decade after the SASI was simulated on Phoenix. I think computational discovery is likely to happen more often as models develop and the machines they run on develop with them.

    Q: There are still some nuances in supernova research that aren’t explained by SASI. What is being done to fill in those gaps?

    A: Since the SASI was discovered, all supernova groups have considered the SASI an integral part of supernova dynamics. There is some debate on its importance to the supernova explosion mechanism. Some experts believe it’s there but it’s subcritical to other phenomenon; however, everyone believes it’s there and needs to be understood. I think, when all is said and done, we’ll find the SASI is integral to the explosion mechanism.

    The key thing is that Mother Nature works in three dimensions, and earlier simulations have been in 2D. The simulation we are running on Titan now is among the first multiphysics, 3D simulations ever performed.

    It’s not even a nuance so much as if you’re trying to understand the role of the SASI and other parts of the explosion mechanism, it has to be done in 3D — an endeavor which is only now beginning.

    Q: How is working with Titan unlocking better ways to simulate supernova activity?

    A: Unlike earlier 2D simulations, the Titan simulations will model all the important physical properties of a dying massive star in 3D, included gravity, neutrino transport and interaction, and fluid instability.

    For gravity, we’re modeling gravitational fields dictated by the general relativistic theory of gravity (or [Albert] Einstein’s theory of gravity). It’s very important that models include calculations for relativistic gravity rather than Newtonian gravity, which you would use to understand the orbits of the planets around the sun, for instance, although even here a deeper description in terms of Einstein’s theory of gravity can be given.

    Second, the model includes neutrinos. We believe neutrinos actually power these explosions. They are nearly massless particles that behave like radiation in this system and emerge from the center of the supernova. The center of the supernova is like a neutrino bulb radiating at 1045 watts, and it’s energizing the shock wave by heating the material underneath the wave. There’s a lot of energy in neutrinos, but you only have to tap into a fraction of that energy to generate a supernova.

    So neutrinos likely power these explosive events, and their production, transport and interaction with the stellar material has to be modeled very carefully.

    Finally, the stellar material is fluid-like, and because you have a heat source (the neutrinos) below that stellar material, convection is going to occur. If you heat a pot of water on the stove, the bubbles that occur during boiling are less dense than the water around them — that’s an instability and that’s why those bubbles rise. Convection is a similar instability that develops. The shock wave is a discontinuity in the stellar fluid, and the SASI is an instability of this shock wave. So convection and the SASI are both operative and are the main fluid instabilities we must model.

    Those are the main components. There are other properties — rotation, magnetic fields, thermonuclear reactions and more — that are important for understanding the formation of elements as well as the explosion mechanism, and these will all be modeled on Titan.

    Q: What are some of the core goals that make up the INCITE project?

    A: The current Titan simulation is representative of supernovae that originate in stars of about 15 solar masses. Later, we will do other runs on Titan at different solar masses — 10, 20, 25. Fifteen solar masses is in the middle of the range of stellar masses we believe result in supernovae, and we’ll compare it to observed supernovae whose progenitor mass is determined to have been at or near 15 solar masses.

    This INCITE project is focusing on how the explosion mechanism works, which is not just limited to the SASI. When we run the model we’ll wait to see: Does it explode? If it does, was it a weak or a robust explosion? Was the explosion energy commensurate with observed energies for stars of that solar mass? What kind of neutron star is left behind after the explosion?

    Q: If you had to sum up the value of this supernova research, considering everything that has been learned from 2003′s simulations to today, what would you say has been the most important lesson?

    A: I would say the most important lesson is that computation is a critical mode of discovery. It is arguably the best tool to understand phenomena that are nonlinear and have many interconnected components. It is very difficult to understand phenomena like core-collapse supernovae analytically with pencil and paper. The SASI had to be discovered computationally because it’s a nonlinear phenomenon. Computation is not just about getting the details. You don’t go into computation knowing the answer but wanting to get the details; there is discovery and surprise in computation. And there’s no better example of that than the discovery of the SASI.

    In addition to his role as the director of ORNL’s Joint Institute for Computational Sciences, Anthony Mezzacappa is the Newton W. and Wilma C. Thomas chair and a professor in the department of physics and astronomy at the University of Tennessee. He is also ORNL corporate fellow emeritus. The team simulating a core-collapse supernova on Titan includes Mezzacappa, Steve Bruenn of Florida Atlantic University, Bronson Messer and Raph Hix of ORNL, Eric Lentz and Austin Harris of the University of Tennessee, Knoxville, and John Blondin of North Carolina State University.

    See the full article here.

    ORNL is managed by UT-Battelle for the Department of Energy’s Office of Science. DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time.


    ScienceSprings relies on technology from

    MAINGEAR computers



  • richardmitnick 4:52 pm on August 11, 2014 Permalink | Reply
    Tags: , , , , Dark Sky Simulations, , TITAN Supercomputer   

    From Symmetry: “Open access to the universe” 


    August 08, 2014
    Lori Ann White

    A team of scientists generated a giant cosmic simulation—and now they’re giving it away.

    A small team of astrophysicists and computer scientists have created some of the highest-resolution snapshots yet of a cyber version of our own cosmos. Called the Dark Sky Simulations, they’re among a handful of recent simulations that use more than 1 trillion virtual particles as stand-ins for all the dark matter that scientists think our universe contains.

    Courtesy of Dark Sky Simulations collaboration

    They’re also the first trillion-particle simulations to be made publicly available, not only to other astrophysicists and cosmologists to use for their own research, but to everyone. The Dark Sky Simulations can now be accessed through a visualization program in coLaboratory, a newly announced tool created by Google and Project Jupyter that allows multiple people to analyze data at the same time.

    To make such a giant simulation, the collaboration needed time on a supercomputer. Despite fierce competition, the group won 80 million computing hours on Oak Ridge National Laboratory’s Titan through the Department of Energy’s 2014 INCITE program.


    In mid-April, the group turned Titan loose. For more than 33 hours, they used two-thirds of one of the world’s largest and fastest supercomputers to direct a trillion virtual particles to follow the laws of gravity as translated to computer code, set in a universe that expanded the way cosmologists believe ours has for the past 13.7 billion years.

    “This simulation ran continuously for almost two days, and then it was done,” says Michael Warren, a scientist in the Theoretical Astrophysics Group at Los Alamos National Laboratory. Warren has been working on the code underlying the simulations for two decades. “I haven’t worked that hard since I was a grad student.”

    Back in his grad school days, Warren says, simulations with millions of particles were considered cutting-edge. But as computing power has increased, particle counts did too. “They were doubling every 18 months. We essentially kept pace with Moore’s Law.”

    When planning such a simulation, scientists make two primary choices: the volume of space to simulate and the number of particles to use. The more particles added to a given volume, the smaller the objects that can be simulated—but the more processing power needed to do it.

    Current galaxy surveys such as the Dark Energy Survey are mapping out large volumes of space but also discovering small objects. The under-construction Large Synoptic Survey Telescope “will map half the sky and can detect a galaxy like our own up to 7 billion years in the past,” says Risa Wechsler, Skillman’s colleague at KIPAC who also worked on the simulation. “We wanted to create a simulation that a survey like LSST would be able to compare their observations against.”

    LSST Telescope

    The time the group was awarded on Titan made it possible for them to run something of a Goldilocks simulation, says Sam Skillman, a postdoctoral researcher with the Kavli Institute for Particle Astrophysics and Cosmology, a joint institute of Stanford and SLAC National Accelerator Laboratory. “We could model a very large volume of the universe, but still have enough resolution to follow the growth of clusters of galaxies.”

    The end result of the mid-April run was 500 trillion bytes of simulation data. Then it was time for the team to fulfill the second half of their proposal: They had to give it away.

    They started with 55 trillion bytes: Skillman, Warren and Matt Turk of the National Center for Supercomputing Applications spent the next 10 weeks building a way for researchers to identify just the interesting bits—no pun intended— and use them for further study, all through the Web.

    “The main goal was to create a cutting-edge data set that’s easily accessed by observers and theorists,” says Daniel Holz from the University of Chicago. He and Paul Sutter of the Paris Institute of Astrophysics, helped to ensure the simulation was based on the latest astrophysical data. “We wanted to make sure anyone can access this data—data from one of the largest and most sophisticated cosmological simulations ever run—via their laptop.”

    See the full article here.

    Symmetry is a joint Fermilab/SLAC publication.

    ScienceSprings relies on technology from

    MAINGEAR computers



  • richardmitnick 10:29 am on July 23, 2014 Permalink | Reply
    Tags: , , , , TITAN Supercomputer   

    From DOE Pulse: “Ames Lab scientist hopes to improve rare earth purification process” 


    July 21, 2014
    Austin Kreber, 515.987.4885,

    Using the second fastest supercomputer in the world, a scientist at the U.S. Department of Energy’s Ames Laboratory is attempting to develop a more efficient process for purifying rare-earth materials.

    Dr. Nuwan De Silva, a postdoctoral research associate at the Ames Laboratory’s Critical Materials Institute, said CMI scientists are honing in on specific types of ligands they believe will only bind with rare-earth metals. By binding to these rare metals, they believe they will be able to extract just the rare-earth metals without them being contaminated with other metals.

    Nuwan De Silva, scientist at the Ames
    Laboratory, is developing software to help improve purification of rare-earth materials. Photo credit: Sarom Leang

    Rare-earth metals are used in cars, phones, wind turbines, and other devices important to society. De Silva said China now produces 80-90 percent of the world’s supply of rare-metals and has imposed export restrictions on them. Because of these new export limitations, many labs, including the CMI, have begun trying to find alternative ways to obtain more rare-earth metals.

    Rare-earth metals are obtained by extracting them from their ore. The current extraction process is not very efficient, and normally the rare-earth metals produced are contaminated with other metals. In addition the rare-earth elements for various applications need to be separated from each other, which is a difficult process, one that is accomplished through a solvent extraction process using an aqueous acid solution.

    CMI scientists are focusing on certain types of ligands they believe will bind with just rare-earth metals. They will insert a ligand into the acid solution, and it will go right to the metal and bind to it. They can then extract the rare-earth metal with the ligand still bound to it and then remove the ligand in a subsequent step. The result is a rare-earth metal with little or no contaminants from non rare-earth metals. However, because the solution will still contain neighboring rare-earth metals, the process needs to be repeated many times to separate the other rare earths from the desired rare-earth element.

    The ligand is much like someone being sent to an airport to pick someone up. With no information other than a first name — “John” — finding the right person is a long and tedious process. But armed with a description of John’s appearance, height, weight, and what he is doing, finding him would be much easier. For De Silva, John is a rare-earth metal, and the challenge is developing a ligand best adapted to finding and binding to it.

    To find the optimum ligand, De Silva will use Titan to search through all the possible candidates. First, Titan has to discover the properties of a ligand class. To do that, it uses quantum-mechanical (QM) calculations. These QM calculations take around a year to finish.

    ORNL Titan Supercomputer

    Once the QM calculations are finished, Titan uses a program to examine all the parameters of a particular ligand to find the best ligand candidate. These calculations are called molecular mechanics (MM). MM calculations take about another year to accomplish their task.

    “I have over 2,500,000 computer hours on Titan available to me so I will be working with it a lot,” De Silva said. “I think the short term goal of finding one ligand that works will take two years.”

    The CMI isn’t the only lab working on this problem. The Institute is partnering with Oak Ridge National Laboratory, Lawrence Livermore National Laboratory and Idaho National Laboratory as well as numerous other partners. “We are all in constant communication with each other,” De Silva said.

    See the full article here.

    DOE Pulse highlights work being done at the Department of Energy’s national laboratories. DOE’s laboratories house world-class facilities where more than 30,000 scientists and engineers perform cutting-edge research spanning DOE’s science, energy, National security and environmental quality missions. DOE Pulse is distributed twice each month.

    DOE Banner

    ScienceSprings is powered by MAINGEAR computers

Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc
%d bloggers like this: