Tagged: Exascale computing Toggle Comment Threads | Keyboard Shortcuts

  • richardmitnick 10:47 am on March 28, 2020 Permalink | Reply
    Tags: , “Today’s news provides a prime example of how government and industry can work together for the benefit of the entire nation.”, Ensuring the National Nuclear Security Administration — LLNL Sandia National Laboratories and Los Alamos National Laboratory —keeping the nation’s nuclear stockpile safe., Exascale computing, HPE Cray Shasta El Capitan supercomputer at LLNL, HPE/Cray,   

    From Lawrence Livermore National Laboratory: “LLNL and HPE to partner with AMD on El Capitan, projected as world’s fastest supercomputer” 

    From Lawrence Livermore National Laboratory


    Jeremy Thomas

    Lawrence Livermore National Laboratory (LLNL), Hewlett Packard Enterprise (HPE) and Advanced Micro Devices Inc. (AMD) today announced the selection of AMD as the node supplier for El Capitan, projected to be the world’s most powerful supercomputer when it is fully deployed in 2023.

    HPE Cray Shasta El Capitan supercomputer at LLNL

    With its advanced computing and graphics processing units (CPUs/GPUs), El Capitan’s peak performance is expected to exceed 2 exaFLOPS, ensuring the National Nuclear Security Administration (NNSA) laboratories — LLNL, Sandia National Laboratories and Los Alamos National Laboratory — can meet their primary mission of keeping the nation’s nuclear stockpile safe, secure and reliable. (An exaFLOP is one quintillion floating point operations per second.)

    Funded by the Advanced Simulation and Computing (ASC) program at the Department of Energy’s (DOE) NNSA, El Capitan will perform complex and increasingly predictive modeling and simulation for NNSA’s vital Life Extension Programs (LEPs), which address weapons aging and emergent threat issues in the absence of underground nuclear testing.

    “This unprecedented computing capability, powered by advanced CPU and GPU technology from AMD, will sustain America’s position on the global stage in high-performance computing and provide an observable example of the commitment of the country to maintaining an unparalleled nuclear deterrent,” said LLNL Director Bill Goldstein. “Today’s news provides a prime example of how government and industry can work together for the benefit of the entire nation.”

    El Capitan will be powered by next-generation AMD EPYC processors, code-named “Genoa” and featuring the “Zen 4” processor core, next-generation AMD Radeon Instinct GPUs based on a new compute-optimized architecture for workloads including HPC and AI, and the AMD Radeon Open Compute platform (ROCm) heterogenous computing software. The nodes will support simulations used by the NNSA labs to address the demands of the LEPs, whose computational requirements are growing due to the ramping up of stockpile modernization efforts and in response to rapidly evolving threats from America’s adversaries.

    Providing enormous computation capability for the energy used, the GPUs will provide the majority of the peak floating-point performance of El Capitan. This enables LLNL scientists to run high-resolution 3D models quicker, as well as increase the fidelity and repeatability of calculations, thus making those simulations truer to life.

    “We have been pursuing a balanced investment effort at NNSA in advancing our codes, our platforms and our facilities in an integrated and focused way,” said Michel McCoy, Weapon Simulation and Computing Program Director at LLNL. “And our teams and industrial partners will deliver this capability as planned to the nation. Naturally, this has required an intimate, sustained partnership with our industry technology partners and between the tri-labs to be successful.”

    Anticipated to be one of the most capable supercomputers in the world, El Capitan will have a significantly greater per-node capability than any current systems, LLNL researchers said. El Capitan’s graphics processors will be amenable to AI and machine learning-assisted data analysis, further propelling LLNL’s sizable investment in AI-driven scientific workloads. These workloads will supplement scientific models that researchers hope will be faster, more accurate and intrinsically capable of quantifying uncertainty in their predictions, and will be increasingly used for stockpile stewardship applications. The use of AMD’s GPUs also is anticipated to dramatically increase El Capitan’s energy efficiency as compared to systems using today’s graphical processors.

    “El Capitan will drive unprecedented advancements in HPC and AI, powered by the next-generation AMD EPYC CPUs and Radeon Instinct GPUs,” said Forrest Norrod, senior vice president and general manager, Datacenter and Embedded Systems Group, AMD. “Building on our strong foundation in high-performance computing and adding transformative coherency capabilities, AMD is enabling the NNSA Tri-Lab community — LLNL, Los Alamos and Sandia national laboratories — to achieve their mission-critical objectives and contribute new AI advancements to the industry. We are extremely proud to continue our exascale work with HPE and NNSA and look forward to the delivery of the most powerful supercomputer in the world, expected in early 2023.”

    El Capitan also will integrate many advanced features that are not yet widely deployed, including HPE’s advanced Cray Slingshot interconnect network, which will enable large calculations across many nodes, an essential requirement for the NNSA laboratories’ simulation workloads. In addition to the capabilities that Cray Slingshot provides, HPE and LLNL are partnering to actively explore new HPE optics technologies that integrate electrical-to-optical interfaces that could deliver higher data transmission at faster speeds with improved power efficiency and reliability. El Capitan also will feature the new Cray Shasta software platform, which will have a new container-based architecture to enable administrators and developers to be more productive, and to orchestrate LLNL’s complex new converged HPC and AI workflows at scale.

    “As an industry and as a nation, we have achieved a major milestone in computing. HPE is honored to support DOE, NNSA and Lawrence Livermore National Laboratory in a critical strategic mission to advance the United States’ position in security and defense,” said Peter Ungaro, senior vice president and general manager, HPC and Mission Critical Systems (MCS), at HPE. “The computing power and capabilities of this system represent a new era of innovation that will unlock solutions to society’s most complex issues and answer questions we never thought were possible.”

    The exascale ecosystem being developed through the sustained efforts of DOE’s Exascale Computing Initiative will further ensure El Capitan has formidable capabilities from day one. Through funding from NNSA’s ASC program, in collaboration with the DOE Office of Science’s Advanced Scientific Computing Research program, ongoing investments in hardware and software technology will assure highly functional hardware and tools to meet DOE’s needs in the next decade. The El Capitan system also will benefit from a partnership with Oak Ridge National Laboratory, which is taking delivery of a similar system from HPE about one year earlier than El Capitan.

    El Capitan would not have been possible without the investments made by DOE’s Exascale PathForward program, which provided funding for American companies including HPE/Cray and AMD to accelerate the technologies necessary to maximize energy efficiency and performance of exascale supercomputers.

    Besides supporting the nuclear stockpile, El Capitan will perform secondary national security missions, including nuclear nonproliferation and counterterrorism. NNSA laboratories are building machine learning and AI into computational techniques and analysis that will benefit NNSA’s primary missions and unclassified projects such as climate modeling and cancer research for DOE.

    See the full article here .


    Please help promote STEM in your local schools.

    Stem Education Coalition

    Operated by Lawrence Livermore National Security, LLC, for the Department of Energy’s National Nuclear Security Administration
    Lawrence Livermore National Laboratory (LLNL) is an American federal research facility in Livermore, California, United States, founded by the University of California, Berkeley in 1952. A Federally Funded Research and Development Center (FFRDC), it is primarily funded by the U.S. Department of Energy (DOE) and managed and operated by Lawrence Livermore National Security, LLC (LLNS), a partnership of the University of California, Bechtel, BWX Technologies, AECOM, and Battelle Memorial Institute in affiliation with the Texas A&M University System. In 2012, the laboratory had the synthetic chemical element livermorium named after it.
    LLNL is self-described as “a premier research and development institution for science and technology applied to national security.” Its principal responsibility is ensuring the safety, security and reliability of the nation’s nuclear weapons through the application of advanced science, engineering and technology. The Laboratory also applies its special expertise and multidisciplinary capabilities to preventing the proliferation and use of weapons of mass destruction, bolstering homeland security and solving other nationally important problems, including energy and environmental security, basic science and economic competitiveness.

    The Laboratory is located on a one-square-mile (2.6 km2) site at the eastern edge of Livermore. It also operates a 7,000 acres (28 km2) remote experimental test site, called Site 300, situated about 15 miles (24 km) southeast of the main lab site. LLNL has an annual budget of about $1.5 billion and a staff of roughly 5,800 employees.

    LLNL was established in 1952 as the University of California Radiation Laboratory at Livermore, an offshoot of the existing UC Radiation Laboratory at Berkeley. It was intended to spur innovation and provide competition to the nuclear weapon design laboratory at Los Alamos in New Mexico, home of the Manhattan Project that developed the first atomic weapons. Edward Teller and Ernest Lawrence,[2] director of the Radiation Laboratory at Berkeley, are regarded as the co-founders of the Livermore facility.

    The new laboratory was sited at a former naval air station of World War II. It was already home to several UC Radiation Laboratory projects that were too large for its location in the Berkeley Hills above the UC campus, including one of the first experiments in the magnetic approach to confined thermonuclear reactions (i.e. fusion). About half an hour southeast of Berkeley, the Livermore site provided much greater security for classified projects than an urban university campus.

    Lawrence tapped 32-year-old Herbert York, a former graduate student of his, to run Livermore. Under York, the Lab had four main programs: Project Sherwood (the magnetic-fusion program), Project Whitney (the weapons-design program), diagnostic weapon experiments (both for the Los Alamos and Livermore laboratories), and a basic physics program. York and the new lab embraced the Lawrence “big science” approach, tackling challenging projects with physicists, chemists, engineers, and computational scientists working together in multidisciplinary teams. Lawrence died in August 1958 and shortly after, the university’s board of regents named both laboratories for him, as the Lawrence Radiation Laboratory.

    Historically, the Berkeley and Livermore laboratories have had very close relationships on research projects, business operations, and staff. The Livermore Lab was established initially as a branch of the Berkeley laboratory. The Livermore lab was not officially severed administratively from the Berkeley lab until 1971. To this day, in official planning documents and records, Lawrence Berkeley National Laboratory is designated as Site 100, Lawrence Livermore National Lab as Site 200, and LLNL’s remote test location as Site 300.[3]

    The laboratory was renamed Lawrence Livermore Laboratory (LLL) in 1971. On October 1, 2007 LLNS assumed management of LLNL from the University of California, which had exclusively managed and operated the Laboratory since its inception 55 years before. The laboratory was honored in 2012 by having the synthetic chemical element livermorium named after it. The LLNS takeover of the laboratory has been controversial. In May 2013, an Alameda County jury awarded over $2.7 million to five former laboratory employees who were among 430 employees LLNS laid off during 2008.[4] The jury found that LLNS breached a contractual obligation to terminate the employees only for “reasonable cause.”[5] The five plaintiffs also have pending age discrimination claims against LLNS, which will be heard by a different jury in a separate trial.[6] There are 125 co-plaintiffs awaiting trial on similar claims against LLNS.[7] The May 2008 layoff was the first layoff at the laboratory in nearly 40 years.[6]

    On March 14, 2011, the City of Livermore officially expanded the city’s boundaries to annex LLNL and move it within the city limits. The unanimous vote by the Livermore city council expanded Livermore’s southeastern boundaries to cover 15 land parcels covering 1,057 acres (4.28 km2) that comprise the LLNL site. The site was formerly an unincorporated area of Alameda County. The LLNL campus continues to be owned by the federal government.


    DOE Seal

  • richardmitnick 11:00 am on November 23, 2019 Permalink | Reply
    Tags: , Argonne Leadership Computing Facility, , , Cray Intel SC18 Shasta Aurora exascale supercomputer, Exascale computing,   

    From Argonne Leadership Computing Facility: “Argonne teams up with Altair to manage use of upcoming Aurora supercomputer” 

    Argonne Lab
    News from Argonne National Laboratory

    From Argonne Leadership Computing Facility

    November 19, 2019
    Jo Napolitano

    Depiction of ANL ALCF Cray Intel SC18 Shasta Aurora exascale supercomputer

    The U.S. Department of Energy’s (DOE) Argonne National Laboratory has teamed up with the global technology company Altair to implement a new scheduling system that will be employed on the Aurora supercomputer, slated for delivery in 2021.

    Aurora will be one of the nation’s first exascale systems; capable of performing a billion billion – that’s a quintillion – calculations per second. It will be nearly 100 times faster than Argonne’s current supercomputer, Theta, which went online just two years ago.

    Aurora will be in high demand from researchers around the world and, as a result, will need a sophisticated workload manager to sort and prioritize requested jobs.

    It found a natural partner in Altair to meet that need. Founded in 1985 and headquartered in Troy, Michigan, the company provides software and cloud solutions in the areas of product development, high-performance computing (HPC) and data analytics.

    Argonne was initially planning an update to its own workload manager COBALT (Component-Based Lightweight Toolkit) which was developed 20 years ago within the lab’s own Mathematics and Computer Science Division.

    COBALT has served the Argonne Leadership Computing Facility (ALCF), a DOE Office of Science User Facility, for years, but after careful consideration of several factors, including cost and efficiency, the laboratory determined that a collaboration with Altair on the PBS Professional™ open source solution was the best path forward.

    “When we went to talk to Altair, we were looking for a resource manager (one of the components in a workload manager) we could use,” said Bill Allcock, manager of the Advanced Integration Group at the ALCF. ​“We decided to collaborate on the entire workload manager rather than just the resource manager because our future roadmaps were well aligned.”

    Altair was already working on a couple of important features that the laboratory wanted to employ with Aurora, Allcock said.

    And most importantly, the teams meshed well together.

    “Exascale will be a huge milestone in HPC — to make better products, to make better decisions, to make the world a better place,” said Bill Nitzberg, chief technology officer of Altair PBS Works™. ​“Getting to exascale requires innovation, especially in systems software, like job scheduling. The partnership between Altair and Argonne will enable effective exascale scheduling, not only for Aurora, but also for the wider HPC world. This is a real 1+1=3 partnership.”

    Aurora is expected to have a significant impact on nearly every field of scientific endeavor, including artificial intelligence. It will improve extreme weather forecasting, accelerate medical treatments, help map the human brain, develop new materials and further our understanding of the universe.

    It will also play a pivotal role in national security and human health.

    “We want to enable researchers to conduct the most important science possible, projects that cannot be done anywhere else in the world because they demand a machine of this size, and this partnership will help us reach this goal,” said Allcock.

    See the full article here .


    Please help promote STEM in your local schools.

    Stem Education Coalition

    Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation’s first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America’s scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy’s Office of Science. For more visit http://www.anl.gov.

    About ALCF
    The Argonne Leadership Computing Facility’s (ALCF) mission is to accelerate major scientific discoveries and engineering breakthroughs for humanity by designing and providing world-leading computing facilities in partnership with the computational science community.

    We help researchers solve some of the world’s largest and most complex problems with our unique combination of supercomputing resources and expertise.

    ALCF projects cover many scientific disciplines, ranging from chemistry and biology to physics and materials science. Examples include modeling and simulation efforts to:

    Discover new materials for batteries
    Predict the impacts of global climate change
    Unravel the origins of the universe
    Develop renewable energy technologies

    Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy’s Office of Science

    Argonne Lab Campus

  • richardmitnick 10:34 am on November 29, 2018 Permalink | Reply
    Tags: , , Exascale computing, ,   

    From Science Node: “The race to exascale” 

    Science Node bloc
    From Science Node

    30 Jan, 2018
    Alisa Alering

    Who will get the first exascale machine – a supercomputer capable of 10^18 floating point operations per second? Will it be China, Japan, or the US?

    When it comes to computing power you can never have enough. In the last sixty years, processing power has increased more than a trillionfold.

    Researchers around the world are excited because these new, ultra-fast computers represent a 50- to 100-fold increase in speed over today’s supercomputers and promise significant breakthroughs in many areas. That exascale supercomputers are coming is pretty clear. We can even predict the date, most likely in the mid-2020s. But the question remains as to what kind of software will run on these machines.

    Exascale computing heralds an era of ubiquitous massive parallelism, in which processors perform coordinated computations simultaneously. But the number of processors will be so high that computer scientists will have to constantly cope with failing components.

    The high number of processors will also likely slow programs tremendously. The consequence is that beyond the exascale hardware, we will also need exascale brains to develop new algorithms and implement them in exascale software.

    In 2011, the German Research Foundation established a priority program “Software for Exascale Computing”( SPPEXA ) to addresses fundamental research on various aspects of high performance computing (HPC) software, making the program the first of its kind in Germany.

    SPPEXA connects relevant sub-fields of computer science with the needs of computational science and engineering and HPC. The program provides the framework for closer cooperation and a co-design-driven approach. This is a shift from the current service-driven collaboration of groups focusing on fundamental HPC methodology (computer science or mathematics) on the one side with those working on science applications and providing the large codes (science and engineering) on the other side.

    Despite exascale computing still being several years away, SPPEXA scientists are well ahead of the game, developing scalable and efficient algorithms that will make the best use of resources when the new machines finally arrive. SPPEXA drives research towards extreme-scale computing in six areas: computational algorithms, system software, application software, data management and exploration, programming, and software tools.

    Some major projects include research on alternative sources of clean energy; stronger, lighter weight steel manufacturing; and unprecedented simulations of the earth’s convective processes:

    EXAHD supports Germany’s long-standing research into the use of plasma fusion as a clean, safe, and sustainable carbon-free energy source. One of the main goals of the EXAHD project is to develop scalable and efficient algorithms to run on distributed systems, with the aim of facilitating the progress of plasma fusion research.

    EXASTEEL is a massively parallel simulation environment for computational material science. Bringing together experts from mathematics, material and computer sciences, and engineering, EXASTEEL will serve as a virtual laboratory for testing new forms of steel with greater strengths and lower weight.

    TerraNeo addresses the challenges of understanding the convection of Earth’s mantle – the cause of most of our planet’s geological activity, from plate tectonics to volcanoes and earthquakes. Due to the sheer scale and complexity of the models, the advent of exascale computing offers a tremendous opportunity for greater understanding. But in order to take full advantage of the coming resources, TerraNeo is working to design new software with optimal algorithms that permit a scalable implementation.

    Exascale hardware is expected to have less consistent performance than current supercomputers due to fabrication, power, and heat issues. Their sheer size and unprecedented number of components will likely increase fault rates. Fast and Fault-Tolerant Microkernel-based Operating System for Exascale Computing (FFMK) aims to address these challenges through a coordinated approach that connects system software, computational algorithms, and application software.

    Mastering the various challenges related to the paradigm shift from moderately to massively parallel processing will be the key to any future capability computing application at exascale. It will also be crucial for learning how to effectively and efficiently deal with near-future commodity systems smaller-scale or capacity computing tasks. No matter who puts the first machine online, exascale supercomputing is coming. SPPEXA is making sure we are prepared to take full advantage of it.

    See the full article here .

    Please help promote STEM in your local schools.

    Stem Education Coalition

    Science Node is an international weekly online publication that covers distributed computing and the research it enables.

    “We report on all aspects of distributed computing technology, such as grids and clouds. We also regularly feature articles on distributed computing-enabled research in a large variety of disciplines, including physics, biology, sociology, earth sciences, archaeology, medicine, disaster management, crime, and art. (Note that we do not cover stories that are purely about commercial technology.)

    In its current incarnation, Science Node is also an online destination where you can host a profile and blog, and find and disseminate announcements and information about events, deadlines, and jobs. In the near future it will also be a place where you can network with colleagues.

    You can read Science Node via our homepage, RSS, or email. For the complete iSGTW experience, sign up for an account or log in with OpenID and manage your email subscription from your account preferences. If you do not wish to access the website’s features, you can just subscribe to the weekly email.”

  • richardmitnick 5:57 pm on September 5, 2018 Permalink | Reply
    Tags: , , , Exascale computing, ,   

    From PPPL and ALCF: “Artificial intelligence project to help bring the power of the sun to Earth is picked for first U.S. exascale system” 

    From PPPL


    Argonne Lab

    Argonne National Laboratory ALCF

    August 27, 2018
    John Greenwald

    Deep Learning Leader William Tang. (Photo by Elle Starkman/Office of Communications.)

    To capture and control the process of fusion that powers the sun and stars in facilities on Earth called tokamaks, scientists must confront disruptions that can halt the reactions and damage the doughnut-shaped devices.


    Now an artificial intelligence system under development at the U.S. Department of Energy’s (DOE) Princeton Plasma Physics Laboratory (PPPL) and Princeton University to predict and tame such disruptions has been selected as an Aurora Early Science project by the Argonne Leadership Computing Facility, a DOE Office of Science User Facility.

    Depiction of ANL ALCF Cray Shasta Aurora supercomputer

    The project, titled “Accelerated Deep Learning Discovery in Fusion Energy Science” is one of 10 Early Science Projects on data science and machine learning for the Aurora supercomputer, which is set to become the first U.S. exascale system upon its expected arrival at Argonne in 2021. The system will be capable of performing a quintillion (1018) calculations per second — 50-to-100 times faster than the most powerful supercomputers today.

    Fusion combines light elements

    Fusion combines light elements in the form of plasma — the hot, charged state of matter composed of free electrons and atomic nuclei — in reactions that generate massive amounts of energy. Scientists aim to replicate the process for a virtually inexhaustible supply of power to generate electricity.

    The goal of the PPPL/Princeton University project is to develop a method that can be experimentally validated for predicting and controlling disruptions in burning plasma fusion systems such as ITER — the international tokamak under construction in France to demonstrate the practicality of fusion energy. “Burning plasma” refers to self-sustaining fusion reactions that will be essential for producing continuous fusion energy.

    Heading the project will be William Tang, a principal research physicist at PPPL and a lecturer with the rank and title of professor in the Department of Astrophysical Sciences at Princeton University. “Our research will utilize capabilities to accelerate progress that can only come from the deep learning form of artificial intelligence,” Tang said.

    Networks analagous to a brain

    Deep learning, unlike other types of computational approaches, can be trained to solve with accuracy and speed highly complex problems that require realistic image resolution. Associated software consists of multiple layers of interconnected neural networks that are analogous to simple neurons in a brain. Each node in a network identifies a basic aspect of data that is fed into the system and passes the results along to other nodes that identify increasingly complex aspects of the data. The process continues until the desired output is achieved in a timely way.

    The PPPL/Princeton deep-learning software is called the “Fusion Recurrent Neural Network (FRNN),” composed of convolutional and recurrent neural nets that allow a user to train a computer to detect items or events of interest. The software seeks to speedily predict when disruptions will break out in large-scale tokamak plasmas, and to do so in time for effective control methods to be deployed.

    The project has greatly benefited from access to the huge disruption-relevant data base of the Joint European Torus (JET) in the United Kingdom, the largest and most powerful tokamak in the world today.

    Joint European Torus, at the Culham Centre for Fusion Energy in the United Kingdom

    The FRNN software has advanced from smaller computer clusters to supercomputing systems that can deal with such vast amounts of complex disruption-relevant data. Running the data aims to identify key pre-disruption conditions, guided by insights from first principles-based theoretical simulations, to enable the “supervised machine learning” capability of deep learning to produce accurate predictions with sufficient warning time.

    Access to Tiger computer cluster

    The project has gained from access to Tiger, a high-performance Princeton University cluster equipped with advanced image-resolution GPUs that have enabled the deep learning software to advance to the Titan supercomputer at Oak Ridge National Laboratory and to powerful international systems such as the Tsubame 3.0 supercomputer in Tokyo, Japan.

    Tiger supercomputer at Princeton University

    ORNL Cray XK7 Titan Supercomputer

    Tsubame 3.0 supercomputer in Tokyo, Japan

    The overall goal is to achieve the challenging requirements for ITER, which will need predictions to be 95 percent accurate with less than 5 percent false alarms at least 30 milliseconds or longer before disruptions occur.

    ITER Tokamak in Saint-Paul-lès-Durance, which is in southern France

    The team will continue to build on advances that are currently supported by the DOE while preparing the FRNN software for Aurora exascale computing. The researchers will also move forward with related developments on the SUMMIT supercomputer at Oak Ridge.

    ORNL IBM AC922 SUMMIT supercomputer. Credit: Carlos Jones, Oak Ridge National Laboratory/U.S. Dept. of Energy

    Members of the team include Julian Kates-Harbeck, a graduate student at Harvard University and a DOE Office of Science Computational Science Graduate Fellow (CSGF) who is the chief architect of the FRNN. Researchers include Alexey Svyatkovskiy, a big-data, machine learning expert who will continue to collaborate after moving from Princeton University to Microsoft; Eliot Feibush, a big data analyst and computational scientist at PPPL and Princeton, and Kyle Felker, a CSGF member who will soon graduate from Princeton University and rejoin the FRNN team as a post-doctoral research fellow at Argonne National Laboratory.

    See the full article here .


    Please help promote STEM in your local schools.

    Stem Education Coalition

    PPPL campus

    Princeton Plasma Physics Laboratory is a U.S. Department of Energy national laboratory managed by Princeton University. PPPL, on Princeton University’s Forrestal Campus in Plainsboro, N.J., is devoted to creating new knowledge about the physics of plasmas — ultra-hot, charged gases — and to developing practical solutions for the creation of fusion energy. Results of PPPL research have ranged from a portable nuclear materials detector for anti-terrorist use to universally employed computer codes for analyzing and predicting the outcome of fusion experiments. The Laboratory is managed by the University for the U.S. Department of Energy’s Office of Science, which is the largest single supporter of basic research in the physical sciences in the United States, and is working to address some of the most pressing challenges of our time. For more information, please visit science.energy.gov.

  • richardmitnick 4:23 pm on August 22, 2018 Permalink | Reply
    Tags: , ECP update, Exascale computing   

    From Exascale Computing Project: “Leadership, Collaboration, and a Focus on Key Exascale Challenges” 

    From Exascale Computing Project

    August 21, 2018
    Doug Kothe, ECP Director

    Dear Colleagues:

    As we traverse the second half of the 2018 calendar year, the Exascale Computing Project (ECP) continues to execute confidently toward our mission of accelerating the delivery of a capable exascale computing ecosystem* to coincide with the nation’s first exascale platforms in the early 2020s.

    Our efforts in the project’s critical research areas of application development, software technology, and hardware and integration, supported by about 1,000 researchers, scientists, vendor participants, and project management experts further intensify as we make significant strides in addressing the four major challenges of exascale computing: parallelism, memory and storage, reliability, and energy consumption.

    Exascale Challenges

    These four challenges were identified in the Mission Need statement for the ECP in March, 2016 and represent challenges that must be addressed to bridge the capability gap between existing HPC and exascale HPC. Drawing upon the original descriptions in ECP’s Mission Need document, let me expand on these challenges just a bit.

    Parallelism: Exascale systems will have parallelism (also referred to as concurrency), a thousand-fold greater than petascale systems. Developing systems and applications software is already challenging at the petascale and increasing concurrency by a thousand will make software development efforts even more difficutl. To mitigate this complexity, a portion of the project’s R&D investments will be on tools that improve the programmability of exascale systems.

    Memory and Storage: In today’s HPC systems, moving data from computer memory into the CPU consumes the greatest amount of time (compared to basic math operations.) This data movement challenge is already an issue in petascale systems and it will become a critical issue in exascale systems. R&D is required to develop memory and storage architectures to provide timely access to and storage of information at the anticipated computational rates.

    Reliability: Exascale systems will contain significantly more components than today’s petascale systems. Achieving system-level reliability, especially with designs based on projected reductions in power, will require R&D to enable the systems to dynamically adapt to a possible constant stream of transient and permanent failures of components and the applications to remain resilient, in spite of system and device failures, in order to produce accurate results.

    Energy Consumption: To state the obvious, the operating cost of an exascale system built on current technology would be prohibitive. Through pre-ECP programs like Fast Forward and Design Forward and current ECP elements like PathForward, engineering improvements identified with the vendor partners have potential to reduce the power required significantly. Current estimates indicate initial exascale systems could operate in the range of 20-40 megawatts (MW). Achieving this efficiency level by the mid-2020s requires R&D beyond what the industry vendors had projected on their product roadmaps.

    How ECP Breaks It All Down—to Bring it All Together

    ECP is a large, complex, visible, and high-priority DOE project. Managing a project as complex as the ECP requires an extraordinary, diverse team of dedicated professionals working in close collaboration. We are fortunate to have recruited such an experienced and widely respected team, from the leadership level all the way through the depths of this organization. The ECP’s principal and co-principal investigators, Control Account Managers (CAMs), researchers, and scientists span a research expertise spectrum that covers mathematics, energy sciences, earth sciences, nuclear science disciplines, computational chemistry, additive manufacturing, precision medicine, cosmology, astrophysics, metagenomics, and the entire range of software tools and libraries necessary to bring a capable exascale ecosystem online.

    This chart depicts the Work Breakdown Structure of the ECP showing the logical segmentation of ECP’s projects under our key focus areas.

    As with any large project, coordination, collaboration and communications are essential to keep us all working in harmony, and at the heart of this infrastructure is the ECP Deputy Director.

    A New Member of the ECP Leadership Team

    I am pleased to announce the selection of the new ECP Deputy Director who replaces Stephen Lee, as he has decided to retire after a stellar 31-year career at Los Alamos National Laboratory (LANL). Effective August 7, 2018, Lori Diachin from Lawrence Livermore National Laboratory (LLNL) has taken over as the ECP’s Deputy Director.

    Lori has been serving as the Deputy Associate Director for Science and Technology in the Computation Directorate at LLNL since 2017. She has been at LLNL for 15 years and previously at Sandia National Laboratories and Argonne National Laboratory. She has held leadership roles in HPC for over 15 years, with experiences ranging from serving as the Director for the Center for Applied Scientific Computing at LLNL to leading multi-laboratory teams such as the FASTMath Institute in the DOE SciDAC program and serving as the Program Director for the HPC4Manufacting and HPC4Materials programs for DOE’s Office of Energy Efficiency and Renewable Energy and Office of Fossil Energy.

    We are thrilled to have Lori joining our team, but I’d also like to say a few words about Lori’s predecessor, Stephen Lee. Not only has Stephen had an amazing career at LANL, he has been a significant contributor to the growth of the ECP. Stephen was dedicated to this effort from day one and approached his role as a team leader, a hands-on contributor, a brilliant strategist, and a mentor to many of the team members. Stephen was the architect of the ECP’s Preliminary Design Report, a critical, foundational document that was key to solidifying the credibility and conviction among project reviewers that ECP was determined to succeed and moving forward as a well-integrated machine. I believe I speak for all the ECP team members when I say Stephen Lee will be missed and we wish him well in retirement.

    We are extremely fortunate to have Lori taking over this role at such a critical time for the ECP. Lori brings the experience and leadership skills to drive us forward, and on behalf of the entire team, we welcome Lori to this important project role and we look forward to her leadership and contributions as she assumes the role of ECP Deputy Director.

    Recent Accomplishments and Project Highlights

    19 minutes

    Along with this exciting news of announcing our new ECP Deputy Director, I recently sat for a video interview with Mike Bernhardt our ECP Communications Lead to talk about some of our most recent accomplishments. During that conversation we discussed the newest ECP Co-Design Center, ExaLearn, which is focused on Machine Learning (ML) Technologies and being led by Frank Alexander at Brookhaven National Laboratory. ExaLearn is a timely announcement and is a collaboration initially consisting of experts from eight multipurpose DOE labs.

    We also covered the recently published ECP Software Technology Capability Assessment Report—this is an important document that will serve both our own ECP research community as well as the broader HPC community. Linking on the Capability Assessment Report on the ECP public website will give our followers a good overview of the document, an overview explanation from our Software Technology Director, Mike Heroux, and we’ve provided a link for downloading the report.

    Another item we discussed is a recent highlight on the ExaSMR project. SMR stands for small modular reactor. This is a project aimed at high-fidelity modeling of coupled neutronics and fluid dynamics to create virtual experimental datasets for SMRs under varying operational scenarios. This capability will help to validate fundamental design parameters including the turbulent mixing conditions necessary for natural circulation and steady-state critical heat flux margins between the moderator and fuel. It will also provide validation for low-order engineering simulations and reduce conservative operational margins resulting in higher updates and longer fuel cycles. The ExaSMR product can be thought of a virtual test reactor for advanced designs via experimental-quality simulations of reactor behavior. In addition to the highlight document, ECP’s Scott Gibson sat down with the ExaSMR principal investigator, Steven Hamilton (ORNL), to discuss this highlight in more detail.

    We wrapped up by chatting about the key role performance measurement plays for a project such as ECP, and we addressed ECP’s efforts in support of software deployment as it relates to the Hardware and Integration focus of ECP.

    We hope you enjoy this video update and we encourage you to send us your thoughts on our newsletter and ECP Communications overall, as well as ideas on topics you’d like to see covered in the future.

    We’re excited to see such strong momentum, and we sincerely appreciate the support of our sponsors, collaborators, and followers throughout the HPC community.

    I look forward to meeting many of you at upcoming events during the second half of this year.

    Doug Kothe

    ECP Director

    *The exascale ecosystem encompasses exascale computing systems, high-end data capabilities, efficient software at scale, libraries, tools, and other capabilities. This information is stated in the US Department of Energy document Crosscut Report, an Office of Science review sponsored by Advanced Scientific Computing Research, Basic Energy Sciences, Biological and Environmental Research, Fusion Energy Sciences, High Energy Physics, Nuclear Physics, March 9–10, 2017.

    Lab Partner Updates

    Argonne National Laboratory

    The High-Tech Evolution of Scientific Computing

    Realizing the promise of exascale computing, the Argonne Leadership Computing Facility is developing the framework by which to harness this immense computing power to an advanced combination of simulation, data analysis, and machine learning. This effort will undoubtedly reframe the way science is conducted, and do so on a global scale.

    Read More >

    Lawrence Berkeley National Laboratory

    Educating for Exascale: Berkeley Lab Hosts Summer School for Next Generation of Computational Chemists

    Some 25 graduate and post-graduate students recently spent four intense days preparing for the next generation of parallel supercomputers and exascale at the Parallel Computing in Molecular Sciences (ParCompMolSci) Summer School and Workshop hosted by Berkeley Lab.

    Held August 6–9 at the Brower Center in downtown Berkeley, the event aimed to “prepare the next generation of computational molecular scientists to use new parallel hardware platforms, such as the [US Department of Energy’s (DOE’s)] exascale computer arriving in 2021,” said Berkeley Lab Senior Scientist Bert de Jong, an organizer of the summer school and one of the scientists behind the DOE Exascale Computing Project’s NWChemEx effort. NWChemEx belongs to the less talked about, but equally necessary half of building exascale systems: software.

    Read More >

    See the full article here.


    Please help promote STEM in your local schools.

    Stem Education Coalition

    About ECP

    The ECP is a collaborative effort of two DOE organizations – the Office of Science and the National Nuclear Security Administration. As part of the National Strategic Computing initiative, ECP was established to accelerate delivery of a capable exascale ecosystem, encompassing applications, system software, hardware technologies and architectures, and workforce development to meet the scientific and national security mission needs of DOE in the early-2020s time frame.

    About the Office of Science

    DOE’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, please visit https://science.energy.gov/.

    About NNSA

    Established by Congress in 2000, NNSA is a semi-autonomous agency within the DOE responsible for enhancing national security through the military application of nuclear science. NNSA maintains and enhances the safety, security, and effectiveness of the U.S. nuclear weapons stockpile without nuclear explosive testing; works to reduce the global danger from weapons of mass destruction; provides the U.S. Navy with safe and effective nuclear propulsion; and responds to nuclear and radiological emergencies in the United States and abroad. https://nnsa.energy.gov

    The Goal of ECP’s Application Development focus area is to deliver a broad array of comprehensive science-based computational applications that effectively utilize exascale HPC technology to provide breakthrough simulation and data analytic solutions for scientific discovery, energy assurance, economic competitiveness, health enhancement, and national security.

    Awareness of ECP and its mission is growing and resonating—and for good reason. ECP is an incredible effort focused on advancing areas of key importance to our country: economic competiveness, breakthrough science and technology, and national security. And, fortunately, ECP has a foundation that bodes extremely well for the prospects of its success, with the demonstrably strong commitment of the US Department of Energy (DOE) and the talent of some of America’s best and brightest researchers.

    ECP is composed of about 100 small teams of domain, computer, and computational scientists, and mathematicians from DOE labs, universities, and industry. We are tasked with building applications that will execute well on exascale systems, enabled by a robust exascale software stack, and supporting necessary vendor R&D to ensure the compute nodes and hardware infrastructure are adept and able to do the science that needs to be done with the first exascale platforms.

  • richardmitnick 3:21 pm on July 11, 2018 Permalink | Reply
    Tags: Exascale computing,   

    From M I T Technology Review: “The US may have just pulled even with China in the race to build supercomputing’s next big thing” 

    MIT Technology Review
    From M.I.T Technology Review

    July 11, 2018
    Martin Giles

    Ms. Tech

    The US may have just pulled even with China in the race to build supercomputing’s next big thing.

    The two countries are vying to create an exascale computer that could lead to significant advances in many scientific fields.

    There was much celebrating in America last month when the US Department of Energy unveiled Summit, the world’s fastest supercomputer.

    ORNL IBM AC922 SUMMIT supercomputer. Credit: Carlos Jones, Oak Ridge National Laboratory/U.S. Dept. of Energy

    Now the race is on to achieve the next significant milestone in processing power: exascale computing.

    This involves building a machine within the next few years that’s capable of a billion billion calculations per second, or one exaflop, which would make it five times faster than Summit (see chart). Every person on Earth would have to do a calculation every second of every day for just over four years to match what an exascale machine will be able to do in a flash.

    Top500 / MIT Technology Review

    This phenomenal power will enable researchers to run massively complex simulations that spark advances in many fields, from climate science to genomics, renewable energy, and artificial intelligence. “Exascale computers are powerful scientific instruments, much like [particle] colliders or giant telescopes,” says Jack Dongarra, a supercomputing expert at the University of Tennessee.

    The machines will also be useful in industry, where they will be used for things like speeding up product design and identifying new materials. The military and intelligence agencies will be keen to get their hands on the computers, which will be used for national security applications, too.

    The race to hit the exascale milestone is part of a burgeoning competition for technological leadership between China and the US. (Japan and Europe are also working on their own computers; the Japanese hope to have a machine running in 2021 and the Europeans in 2023.)

    In 2015, China unveiled a plan to produce an exascale machine by the end of 2020, and multiple reports over the past year or so have suggested it’s on track to achieve its ambitious goal. But in an interview with MIT Technology Review, Depei Qian, a professor at Beihang University in Beijing who helps manage the country’s exascale effort, explained it could fall behind schedule. “I don’t know if we can still make it by the end of 2020,” he said. “There may be a year or half a year’s delay.”

    Teams in China have been working on three prototype exascale machines, two of which use homegrown chips derived from work on existing supercomputers the country has developed. The third uses licensed processor technology. Qian says that the pros and cons of each approach are still being evaluated, and that a call for proposals to build a fully functioning exascale computer has been pushed back.

    Given the huge challenges involved in creating such a powerful computer, timetables can easily slip, which could make an opening for the US. China’s initial goal forced the American government to accelerate its own road map and commit to delivering its first exascale computer in 2021, two years ahead of its original target. The American machine, called Aurora, is being developed for the Department of Energy’s Argonne National Laboratory in Illinois. Supercomputing company Cray is building the system for Argonne, and Intel is making chips for the machine.

    Depiction of ANL ALCF Cray Shasta Aurora supercomputer

    To boost supercomputers’ performance, engineers working on exascale systems around the world are using parallelism, which involves packing many thousands of chips into millions of processing units known as cores. Finding the best way to get all these to work in harmony requires time-consuming experimentation.

    Moving data between processors, and into and out of storage, also soaks up a lot of energy, which means the cost of operating a machine over its lifetime can exceed the cost of building it. The DoE has set an upper limit of 40 megawatts of power for an exascale computer, which would roughly translate into an electricity budget of $40 million a year.

    To lower power consumption, engineers are placing three-dimensional stacks of memory chips as close as possible to compute cores to reduce the distance data has to travel, explains Steve Scott, the chief technology officer of Cray. And they’re increasingly using flash memory, which uses less power than alternative systems such as disk storage. Reducing these power needs makes it cheaper to store data at various points during a calculation, and that saved data can help an exascale machine recover quickly if a glitch occurs.

    Such advances have helped the team behind Aurora. “We’re confident of [our] ability to deliver it in 2021,” says Scott.

    More US machines will follow. In April the DoE announced a request for proposals worth up to $1.8 billion for two more exascale computers to come online between 2021 and 2023. These are expected to cost $400 million to $600 million each, with the remaining money being used to upgrade Aurora or even create a follow-on machine.

    Both China and America are also funding work on software for exascale machines. China reportedly has teams working on some 15 application areas, while in the US, teams are working on 25, including applications in fields such as astrophysics and materials science. “Our goal is to deliver as many breakthroughs as possible,” says Katherine Yelick, the associate director for computing sciences at Lawrence Berkeley National Laboratory, who is part of the leadership team coordinating the US initiative.

    While there’s plenty of national pride wrapped up in the race to get to exascale first, the work Yelick and other researchers are doing is a reminder that raw exascale computing power isn’t the true test of success here; what really matters is how well it’s harnessed to solve some of the world’s toughest problems.

    See the full article here .


    Please help promote STEM in your local schools.

    Stem Education Coalition

    The mission of MIT Technology Review is to equip its audiences with the intelligence to understand a world shaped by technology.

  • richardmitnick 1:57 pm on July 7, 2018 Permalink | Reply
    Tags: , , , Exascale computing, ,   

    From MIT News: “Project to elucidate the structure of atomic nuclei at the femtoscale” 

    MIT News
    MIT Widget

    From MIT News

    July 6, 2018
    Scott Morley | Laboratory for Nuclear Science

    The image is an artist’s visualization of a nucleus as studied in numerical simulations, created using DeepArt neural network visualization software. Image courtesy of the Laboratory for Nuclear Science.

    Laboratory for Nuclear Science project selected to explore machine learning for lattice quantum chromodynamics.

    The Argonne Leadership Computing Facility (ALCF), a U.S. Department of Energy (DOE) Office of Science User Facility, has selected 10 data science and machine learning projects for its Aurora Early Science Program (ESP). Set to be the nation’s first exascale system upon its expected 2021 arrival, Aurora will be capable of performing a quintillion calculations per second, making it 10 times more powerful than the fastest computer that currently exists.

    Depiction of ANL ALCF Cray Shasta Aurora supercomputer

    The Aurora ESP, which commenced with 10 simulation-based projects in 2017, is designed to prepare key applications, libraries, and infrastructure for the architecture and scale of the exascale supercomputer. Researchers in the Laboratory for Nuclear Science’s Center for Theoretical Physics have been awarded funding for one of the projects under the ESP. Associate professor of physics William Detmold, assistant professor of physics Phiala Shanahan, and principal research scientist Andrew Pochinsky will use new techniques developed by the group, coupling novel machine learning approaches and state-of-the-art nuclear physics tools, to study the structure of nuclei.

    Shanahan, who began as an assistant professor at MIT this month, says that the support and early access to frontier computing that the award provides will allow the group to study the possible interactions of dark matter particles with nuclei from our fundamental understanding of particle physics for the first time, providing critical input for experimental searches aiming to unravel the mysteries of dark matter while simultaneously giving insight into fundamental particle physics.

    “Machine learning coupled with the exascale computational power of Aurora will enable spectacular advances in many areas of science,” Detmold adds. “Combining machine learning to lattice quantum chromodynamics calculations of the strong interactions between the fundamental particles that make up protons and nuclei, our project will enable a new level of understanding of the femtoscale world.”

    See the full article here .

    Please help promote STEM in your local schools.

    Stem Education Coalition

    MIT Seal

    The mission of MIT is to advance knowledge and educate students in science, technology, and other areas of scholarship that will best serve the nation and the world in the twenty-first century. We seek to develop in each member of the MIT community the ability and passion to work wisely, creatively, and effectively for the betterment of humankind.

    MIT Campus

  • richardmitnick 11:18 am on June 3, 2018 Permalink | Reply
    Tags: , , Exascale computing, ,   

    From Science Node: “Full speed ahead” 

    Science Node bloc
    From Science Node

    23 May, 2018
    Kevin Jackson

    US Department of Energy recommits to the exascale race.


    The US was once a leader in supercomputing, having created the first high-performance computer (HPC) in 1964. But as of November 2017, TOP500 ranked Titan, the fastest American-made supercomputer, only fifth on its list of the most powerful machines in the world. In contrast, China holds the first and second spots by a whopping margin.

    ORNL Cray Titan XK7 Supercomputer

    Sunway TaihuLight, China

    Tianhe-2 supercomputer China

    But it now looks like the US Department of Energy (DoE) is ready to commit to taking back those top spots. In a CNN opinion article, Secretary of Energy Rick Perry proclaims that “the future is in supercomputers,” and we at Science Node couldn’t agree more. To get a better understanding of the DoE’s plans, we sat down for a chat with Under Secretary for Science Paul Dabbar.

    Why is it important for the federal government to support HPC rather than leaving it to the private sector?

    A significant amount of the Office of Science and the rest of the DoE has had and will continue to have supercomputing needs. The Office of Science produces tremendous amounts of data like at Argonne, and all of our national labs produce data of increasing volume. Supercomputing is also needed in our National Nuclear Security Administration (NNSA) mission, which fulfills very important modeling needs for Department of Defense (DoD) applications.

    But to Secretary Perry’s point, we’re increasingly seeing a number of private sector organizations building their own supercomputers based on what we had developed and built a few generations ago that are now used for a broad range of commercial purposes.

    At the end of the day, we know that a secondary benefit of this push is that we’re providing the impetus for innovation within supercomputing.

    We assist the broader American economy by helping to support science and technology innovation within supercomputing.

    How are supercomputers used for national security?

    The NNSA arm, which is one of the three major arms of the three Under Secretaries here at the department, is our primary area of support for the nation’s defense. And as various testing treaties came into play over time, having the computing capacity to conduct proper testing and security of our stockpiled weapons was key. And that’s why if you look at our three exascale computers that we’re in the process of executing, two of them are on behalf of the Office of Science and one of them is on behalf of the NNSA.

    One of these three supercomputers is the Aurora exascale machine currently being built at Argonne National Laboratory, which Secretary Perry believes will be finished in 2021. Where did this timeline come from, and why Argonne?

    Argonne National Laboratory ALCF

    ANL ALCF Cetus IBM supercomputer

    ANL ALCF Theta Cray supercomputer

    ANL ALCF MIRA IBM Blue Gene Q supercomputer at the Argonne Leadership Computing Facility

    Depiction of ANL ALCF Cray Shasta Aurora supercomputer

    There was a group put together across different areas of DoE, primarily the Office of Science and NNSA. When we decided to execute on building the next wave of top global supercomputers, an internal consortium named the Collaboration of Oak Ridge, Argonne, and Livermore (CORAL) was formed.

    That consortium developed exactly how to fund the technologies, how to issue requests, and what the target capabilities for the machines should be. The 2021 timeline was based on the CORAL group, the labs, and the consortium in conjunction with the Department of Energy headquarters here, the Office of Advanced Computing, and ultimately talking with the suppliers.

    The reason Argonne was selected for the first machine was that they already have a leadership computing facility there. They have a long history of other machines of previous generations, and they were already in the process of building out an exascale machine. So they were already looking at architecture issues, talking with Intel and others on what could be accomplished, and taking a look at how they can build on what they already had in terms of their capabilities and physical plant and user facilities.

    Why now? What’s motivating the push for HPC excellence at this precise moment?

    A lot of this is driven by where the technology is and where the capabilities are for suppliers and the broader HPC market. We’re part of a constant dialogue with the Nvidias, Intels, IBMs, and Crays of the world in what we think is possible in terms of the next step in supercomputing.

    Why now? The technology is available now, and the need is there for us considering the large user facilities coming online across the whole of the national lab complex and the need for stronger computing power.

    The history of science, going back to the late 1800s and early 1900s, was about competition along strings of types of research, whether it was chemistry or physics. If you take any of the areas of science, including high-performance computing, anything that’s being done by anyone out there along any of these strings causes us all to move us along. However, we at the DoE believe America must and should be in the lead of scientific advances across all different areas, and certainly in the area of computing.

    See the full article here .


    Please help promote STEM in your local schools.

    Stem Education Coalition

    Science Node is an international weekly online publication that covers distributed computing and the research it enables.

    “We report on all aspects of distributed computing technology, such as grids and clouds. We also regularly feature articles on distributed computing-enabled research in a large variety of disciplines, including physics, biology, sociology, earth sciences, archaeology, medicine, disaster management, crime, and art. (Note that we do not cover stories that are purely about commercial technology.)

    In its current incarnation, Science Node is also an online destination where you can host a profile and blog, and find and disseminate announcements and information about events, deadlines, and jobs. In the near future it will also be a place where you can network with colleagues.

    You can read Science Node via our homepage, RSS, or email. For the complete iSGTW experience, sign up for an account or log in with OpenID and manage your email subscription from your account preferences. If you do not wish to access the website’s features, you can just subscribe to the weekly email.”

  • richardmitnick 9:26 am on January 10, 2018 Permalink | Reply
    Tags: , , Exascale computing,   

    From HPC Wire: “Momentum Builds for US Exascale” 

    HPC Wire

    January 9, 2018
    Alex R. Larzelere


    2018 looks to be a great year for the U.S. exascale program. The last several months of 2017 revealed a number of important developments that help put the U.S. quest for exascale on a solid foundation. In my last article, I provided a description of the elements of the High Performance Computing (HPC) ecosystem and its importance for advancing and sustaining this strategically important technology. It is good to report that the U.S. exascale program seems to be hitting the full range of ecosystem elements.

    As a reminder, the National Strategic Computing Initiative (NSCI) assigned the U.S. Department of Energy (DOE) Office of Science (SC) and the National Nuclear Security Administration (NNSA) to execute a joint program to deliver capable exascale computing that emphasizes sustained performance on relevant applications and analytic computing to support their missions. The overall DOE program is known as the Exascale Computing Initiative (ECI) and is funded by the SC Advanced Scientific Computing Research (ASCR) program and the NNSA Advanced Simulation and Computing (ASC) program.

    Elements of the ECI include the procurement of exascale class systems and the facility investments in site preparations and non-recurring engineering. Also, ECI includes the Exascale Computing Project (ECP) that will conduct the Research and Development (R&D) in the areas of middleware (software stack), applications, and hardware to ensure that exascale systems will be productively usable to address Office of Science and NNSA missions.

    In the area of hardware – the last part of 2017 revealed a number of important developments. First and most visible, is the initial installation of the SC Summit system at Oak Ridge National Laboratory (ORNL) and the NNSA Sierra system at Lawrence Livermore National Laboratory (LLNL).

    ORNL IBM Summit Supercomputer

    LLNL IBM Sierra ATS2 supercomputer

    Both systems are being built by IBM using Power9 processors with Nvidia GPU co-processors. The machines will have two Power9 CPUs per system board and will use a Mellenox InfinBand interconnection network.

    Beyond that, the architecture of each machine is slightly different. The ORNL Summit machine will use six Nvidia Volta GPUs per two Power9 CPUs on a system board and will use NVLink to connect to 512 GB of memory. The Summit machine will use a combination of air and water cooling. The LLNL Sierra machine will use four Nvidia Voltas and 256 GB of memory connected with the two Power9 CPUs per board. The Sierra machine will use only air cooling. As was reported by HPCwire in November 2017, the peak performance of the Summit machine will be about 200 petaflops and the Sierra machine is expected to be about 125 petaflops.

    Installation of both the Summit and Sierra systems is currently underway with about 279 racks (without system boards) and the interconnection network already installed at each lab. Now that IBM has formally released the Power9 processors, the racks will soon start being populated with the boards that contain the CPUs, GPUs and memory. Once that is completed, the labs will start their acceptance testing, which is expected to be finished later in 2018.

    Another important piece of news about the DOE exascale program is the clarification of the status of the Argonne National Laboratory (ANL) Aurora machine.

    Depiction of ANL ALCF Cray Shasta Aurora supercomputer

    This system was part of the collaborative CORAL procurement that also selected the Sierra and Summit machines. The Aurora system is being manufactured by Intel with Cray Inc. acting as the system integrator. The machine was originally scheduled to be an approximately 180 peak petaflops system using the Knights Hill third generation Phi processors. However, during SC17, we learned that Intel is removing the Knights Hill chip from its roadmap. This explains the reason why during the September ASCR Advisory Committee (ASCAC) meeting, Barb Helland, the Associate Director of the ASCR office, announced that the Aurora system would be delayed to 2021 and upgraded to 1,000 petaflops (aka 1 exaflops).

    The full details of the revised Aurora system are still under wraps. We have learned that it is going to use “novel” processor technologies, but exactly what that means is unclear. The ASCR program subjected the new Aurora design to an independent outside review. It found, “The hardware choices/design within the node is extremely well thought through. Early projections suggest that the system will support a broad workload.” The review committee even suggested that, “The system as presented is exciting with many novel technology choices that can change the way computing is done.” The Aurora system is in the process of being “re-baselined” by the DOE. Hopefully, once that is complete, we will get a better understanding of the meaning of “novel” technologies. If things go as expected, the changes to Aurora will allow the U.S. to achieve exascale by 2021.

    An important, but sometimes overlooked, aspect of the U.S. exascale program is the number of computing systems that are being procured, tested and optimized by the ASCR and ASC programs as part of the buildup to exascale. Other computing systems involved with “pre-exascale” systems include the 8.6 petaflops Mira computer at ANL and the 14 petaflops Cori system at Lawrence Berkeley National Lab (LBNL).

    ANL ALCF MIRA IBM Blue Gene Q supercomputer at the Argonne Leadership Computing Facility

    NERSC Cray Cori II supercomputer at NERSC at LBNL

    The NNSA also has the 14.1 petaflops Trinity system at Los Alamos National Lab (LANL). Up to 20 percent of these precursor machines will serve as testbeds to enable computing science R&D needed to ensure that the U.S. exascale systems will be able to productively address important national security and discovery science objectives.

    The last, but certainly not least, bit of hardware news is that the ASCR and ASC programs are expected to start their next computer system procurement processes in early 2018. During her presentation to the U.S. Consortium for the Advancement of Supercomputing (USCAS), Barb Helland told the group that she expects that the Request for Proposals (RFP) will soon be released for the follow-ons to the Summit and Sierra systems. These systems, to be delivered in the 2021-2023 timeframe, are expected to be provide in excess of exaFLOP/s performance. The procurement process to be used will be similar to the CORAL procurement and will be a collaboration between the DOE-SC ASCR and NNSA ASC programs. The ORNL exascale system will be called Frontier and the LLNL system will be known as El Capitan.

    2017 also saw significant developments for the people element of the U.S HPC ecosystem. As was previously reported, at last September’s ASCAC meeting, Paul Messina announced that he would be stepping down as the ECP Director on October 1st. Doug Kothe, who was previously the applications development lead, was announced as the new ECP Director. Upon taking the Director job, Kothe with his deputy, Stephen Lee of LANL, instituted a process to review the organization and management of the ECP. At the December ASCAC conference call, Doug reported that the review had been completed and resulted in a number of changes. This included paring down ECP from five to four components (applications development, software technology, hardware and integration, and project management). He also reported that ECP has implemented a more structured management approach that includes a revised work breakdown structure (WBS) and additional milestones, new key performance parameters and risk management approaches. Finally, the new ECP Director reported that they had established an Extended Leadership Team with a number of new faces.

    Another important, element of the HPC ecosystem are the people doing the R&D and other work need to keep the ecosystem going. The DOE ECI involves a huge number of people. Last year, there were about 500 researchers who attended the ECP Principle Investigator meeting and there are many more involved in other DOE/NNSA programs and from industry. The ASCR and ASC programs are involved with a number of programs to educate and train future members of the HPC ecosystem. Such programs are the ASCR and ASC co-funded Computational Science Graduate Fellowship (CSGF) and the Early Career Research Program. The NNSA offers similar opportunities. Both the ASCR and ASC programs continue to coordinate with National Science Foundation educational programs to ensure that America’s top computational science talent continues to flow into the ecosystem.

    Finally, in addition to people and hardware, the U.S. program continues to develop the software stack (aka middleware) to develop end users’ applications to ensure that exascale will be used productively. Doug Kothe reported that ECP has adopted standard Software Development Kits. These SDKs are designed to support the goal of building a comprehensive, coherent software stack that enables application developers to productively write highly parallel applications that effectively target diverse exascale architectures. Kothe also reported that ECP is making good progress in developing applications software. This includes the implementation of innovative approaches that include Machine Learning to utilize the GPUs that are part of the future exascale computers.

    All in all – the last several months of 2017 have set the stage for a very exciting 2018 for the U.S. exascale program. It has been about 5 years since the ORNL Titan supercomputer came onto the stage at #1 on the TOP500 list.

    ORNL Cray XK7 Titan Supercomputer

    Over that time, other more powerful DOE computers have come online (Trinity, Cori, etc.) but they were overshadowed by Chinese and European systems.

    LANL Cray XC30 Trinity supercomputer

    It remains unclear whether or not the upcoming exascale systems will put the U.S. back on the top of the supercomputing world. However, the recent developments help to reassure the country is not going to give up its computing leadership position without a fight. That is great news because for more than 60 years, the U.S. has sought leadership in high performance computing for the strategic value it provides in the areas of national security, discovery science, energy security, and economic competitiveness.

    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    HPCwire is the #1 news and information resource covering the fastest computers in the world and the people who run them. With a legacy dating back to 1987, HPC has enjoyed a legacy of world-class editorial and topnotch journalism, making it the portal of choice selected by science, technology and business professionals interested in high performance and data-intensive computing. For topics ranging from late-breaking news and emerging technologies in HPC, to new trends, expert analysis, and exclusive features, HPCwire delivers it all and remains the HPC communities’ most reliable and trusted resource. Don’t miss a thing – subscribe now to HPCwire’s weekly newsletter recapping the previous week’s HPC news, analysis and information at: http://www.hpcwire.com.

  • richardmitnick 11:23 am on October 9, 2017 Permalink | Reply
    Tags: , , , , Exascale computing, , ,   

    From Science Node: “US Coalesces Plans for First Exascale Supercomputer: Aurora in 2021” 

    Science Node bloc
    Science Node

    September 27, 2017
    Tiffany Trader

    ANL ALCF Cray Aurora supercomputer

    At the Advanced Scientific Computing Advisory Committee (ASCAC) meeting, in Arlington, Va., yesterday (Sept. 26), it was revealed that the “Aurora” supercomputer is on track to be the United States’ first exascale system. Aurora, originally named as the third pillar of the CORAL “pre-exascale” project, will still be built by Intel and Cray for Argonne National Laboratory, but the delivery date has shifted from 2018 to 2021 and target capability has been expanded from 180 petaflops to 1,000 petaflops (1 exaflop).


    The fate of the Argonne Aurora “CORAL” supercomputer has been in limbo since the system failed to make it into the U.S. DOE budget request, while the same budget proposal called for an exascale machine “of novel architecture” to be deployed at Argonne in 2021.

    Until now, the only official word from the U.S. Exascale Computing Project was that Aurora was being “reviewed for changes and would go forward under a different timeline.”

    Officially, the contract has been “extended,” and not cancelled, but the fact remains that the goal of the Collaboration of Oak Ridge, Argonne, and Lawrence Livermore (CORAL) initiative to stand up two distinct pre-exascale architectures was not met.

    According to sources we spoke with, a number of people at the DOE are not pleased with the Intel/Cray (Intel is the prime contractor, Cray is the subcontractor) partnership. It’s understood that the two companies could not deliver on the 180-200 petaflops system by next year, as the original contract called for. Now Intel/Cray will push forward with an exascale system that is some 50x larger than any they have stood up.

    It’s our understanding that the cancellation of Aurora is not a DOE budgetary measure as has been speculated, and that the DOE and Argonne wanted Aurora. Although it was referred to as an “interim,” or “pre-exascale” machine, the scientific and research community was counting on that system, was eager to begin using it, and they regarded it as a valuable system in its own right. The non-delivery is regarded as disruptive to the scientific/research communities.

    Another question we have is that since Intel/Cray failed to deliver Aurora, and have moved on to a larger exascale system contract, why hasn’t their original CORAL contract been cancelled and put out again to bid?

    With increased global competitiveness, it seems that the DOE stakeholders did not want to further delay the non-IBM/Nvidia side of the exascale track. Conceivably, they could have done a rebid for the Aurora system, but that would leave them with an even bigger gap if they had to spin up a new vendor/system supplier to replace Intel and Cray.

    Starting the bidding process over again would delay progress toward exascale – and it might even have been the death knell for exascale by 2021, but Intel and Cray now have a giant performance leap to make and three years to do it. There is an open question on the processor front as the retooled Aurora will not be powered by Phi/Knights Hill as originally proposed.

    These events beg the question regarding the IBM-led effort and whether IBM/Nvidia/Mellanox are looking very good by comparison. The other CORAL thrusts — Summit at Oak Ridge and Sierra at Lawrence Livermore — are on track, with Summit several weeks ahead of Sierra, although it is looking like neither will make the cut-off for entry onto the November Top500 list as many had speculated.

    ORNL IBM Summit supercomputer depiction

    LLNL IBM Sierra supercomputer

    We reached out to representatives from Cray, Intel and the Exascale Computing Project (ECP) seeking official comment on the revised Aurora contract. Cray and Intel declined to comment and we did not hear back from ECP by press time. We will update the story as we learn more.

    See the full article here .

    Please help promote STEM in your local schools.
    STEM Icon

    Stem Education Coalition

    Science Node is an international weekly online publication that covers distributed computing and the research it enables.

    “We report on all aspects of distributed computing technology, such as grids and clouds. We also regularly feature articles on distributed computing-enabled research in a large variety of disciplines, including physics, biology, sociology, earth sciences, archaeology, medicine, disaster management, crime, and art. (Note that we do not cover stories that are purely about commercial technology.)

    In its current incarnation, Science Node is also an online destination where you can host a profile and blog, and find and disseminate announcements and information about events, deadlines, and jobs. In the near future it will also be a place where you can network with colleagues.

    You can read Science Node via our homepage, RSS, or email. For the complete iSGTW experience, sign up for an account or log in with OpenID and manage your email subscription from your account preferences. If you do not wish to access the website’s features, you can just subscribe to the weekly email.”

Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc
%d bloggers like this: