From HPC Wire: “IBM Demonstrates Deep Neural Network Training with Analog Memory Devices”

From HPC Wire

June 18, 2018
Oliver Peckham

1
Crossbar arrays of non-volatile memories can accelerate the training of fully connected neural networks by performing computation at the location of the data. (Source: IBM)

From smarter, more personalized apps to seemingly-ubiquitous Google Assistant and Alexa devices, AI adoption is showing no signs of slowing down – and yet, the hardware used for AI is far from perfect. Currently, GPUs and other digital accelerators are used to speed the processing of deep neural network (DNN) tasks – but all of those systems are effectively wasting time and energy shuttling that data back and forth between memory and processing. As the scale of AI applications continues to increase, those cumulative losses are becoming massive.

In a paper published this month in Nature, by Stefano Ambrogio, Pritish Narayanan, Hsinyu Tsai, Robert M. Shelby, Irem Boybat, Carmelo di Nolfo, Severin Sidler, Massimo Giordano, Martina Bodini, Nathan C. P. Farinha, Benjamin Killeen, Christina Cheng, Yassine Jaoudi, and Geoffrey W. Burr, IBM researchers demonstrate DNN training on analog memory devices that they report achieves equivalent accuracy to a GPU-accelerated system. IBM’s solution performs DNN calculations right where the data are located, storing and adjusting weights in memory, with the effect of conserving energy and improving speed.

Analog computing, which uses variable signals rather than binary signals, is rarely employed in modern computing due to inherent limits on precision. IBM’s researchers, building on a growing understanding that DNN models operate effectively at lower precision, decided to attempt an accurate approach to analog DNNs.

The research team says it was able to accelerate key training algorithms, notably the backpropagation algorithm, using analog non-volatile memories (NVM). Writing for the IBM blog, lead author Stefano Ambrogio explains:

“These memories allow the “multiply-accumulate” operations used throughout these algorithms to be parallelized in the analog domain, at the location of weight data, using underlying physics. Instead of large circuits to multiply and add digital numbers together, we simply pass a small current through a resistor into a wire, and then connect many such wires together to let the currents build up. This lets us perform many calculations at the same time, rather than one after the other. And instead of shipping digital data on long journeys between digital memory chips and processing chips, we can perform all the computation inside the analog memory chip.”

The authors note that their mixed hardware-software approach is able to achieve classification accuracies equivalent to pure software based-training using TensorFlow despite imperfections of existing analog memory devices. Writes Ambrogio:

“By combining long-term storage in phase-change memory (PCM) devices, near-linear update of conventional Complementary Metal-Oxide Semiconductor (CMOS) capacitors and novel techniques for cancelling out device-to-device variability, we finessed these imperfections and achieved software-equivalent DNN accuracies on a variety of different networks. These experiments used a mixed hardware-software approach, combining software simulations of system elements that are easy to model accurately (such as CMOS devices) together with full hardware implementation of the PCM devices. It was essential to use real analog memory devices for every weight in our neural networks, because modeling approaches for such novel devices frequently fail to capture the full range of device-to-device variability they can exhibit.”

Ambrogio and his team believe that their early design efforts indicate that a full implemention of the analog approach “should indeed offer equivalent accuracy, and thus do the same job as a digital accelerator – but faster and at lower power.” The team is exploring the design of prototype NVM-based accelerator chips, as part of an IBM Research Frontiers Institute project.

The team estimates that it will be able to deliver chips with a computational energy efficiency of 28,065 GOP/sec/W and throughput-per-area of 3.6 TOP/sec/mm2. This would be a two orders of magnitude improvement over today’s GPUs according to the reserachers.

The researchers will now turn their attention to demonstrating their approach on larger networks that call for large, fully-connected layers, such as recurrently-connected Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks with emerging utility for machine translation, captioning and text analytics. As new and better forms of analog memory are developed, they expect continued improvements in areal density and energy efficiency.

See the full article here .


five-ways-keep-your-child-safe-school-shootings

Please help promote STEM in your local schools.

Stem Education Coalition

HPCwire is the #1 news and information resource covering the fastest computers in the world and the people who run them. With a legacy dating back to 1987, HPC has enjoyed a legacy of world-class editorial and topnotch journalism, making it the portal of choice selected by science, technology and business professionals interested in high performance and data-intensive computing. For topics ranging from late-breaking news and emerging technologies in HPC, to new trends, expert analysis, and exclusive features, HPCwire delivers it all and remains the HPC communities’ most reliable and trusted resource. Don’t miss a thing – subscribe now to HPCwire’s weekly newsletter recapping the previous week’s HPC news, analysis and information at: http://www.hpcwire.com.