Tagged: Machine learning Toggle Comment Threads | Keyboard Shortcuts

  • richardmitnick 3:20 pm on August 1, 2018 Permalink | Reply
    Tags: , , Machine learning, , , , ,   

    From Symmetry: “Machine learning proliferates in particle physics” 

    Symmetry Mag
    From Symmetry

    Manuel Gnida


    A new review in Nature chronicles the many ways machine learning is popping up in particle physics research.

    Experiments at the Large Hadron Collider produce about a million gigabytes of data every second.


    CERN map

    CERN LHC Tunnel

    CERN LHC particles

    Even after reduction and compression, the data amassed in just one hour at the LHC is similar to the data volume Facebook collects in an entire year.

    Luckily, particle physicists don’t have to deal with all of that data all by themselves. They partner with a form of artificial intelligence that learns how to do complex analyses on its own, called machine learning.

    “Compared to a traditional computer algorithm that we design to do a specific analysis, we design a machine learning algorithm to figure out for itself how to do various analyses, potentially saving us countless man-hours of design and analysis work,” says College of William & Mary physicist Alexander Radovic, who works on the NOvA neutrino experiment.

    FNAL NOvA detector in northern Minnesota

    FNAL/NOvA experiment map

    Radovic and a group of researchers summarize current applications and future prospects of machine learning in particle physics in a paper published today in Nature.

    Sifting through big data

    To handle the gigantic data volumes produced in modern experiments like the ones at the LHC, researchers apply what they call “triggers”—dedicated hardware and software that decide in real time which data to keep for analysis and which data to toss out.

    In LHCb, an experiment that could shed light on why there is so much more matter than antimatter in the universe, machine learning algorithms make at least 70 percent of these decisions, says LHCb scientist Mike Williams from the Massachusetts Institute of Technology, one of the authors of the Nature summary.

    CERN LHCb chamber, LHC

    CERN/LHCb detector

    “Machine learning plays a role in almost all data aspects of the experiment, from triggers to the analysis of the remaining data,” he says.

    Machine learning has proven extremely successful in the area of analysis. The gigantic ATLAS and CMS detectors at the LHC, which enabled the discovery of the Higgs boson, each have millions of sensing elements whose signals need to be put together to obtain meaningful results.


    CERN/CMS Detector

    “These signals make up a complex data space,” says Michael Kagan of the US Department of Energy’s SLAC National Accelerator Laboratory, who works on ATLAS and was also an author on the Nature review. “We need to understand the relationship between them to come up with conclusions—for example, that a certain particle track in the detector was produced by an electron, a photon or something else.”

    Neutrino experiments also benefit from machine learning. NOvA [above], which is managed by Fermi National Accelerator Laboratory, studies how neutrinos change from one type to another as they travel through the Earth. These neutrino oscillations could potentially reveal the existence of a new neutrino type that some theories predict to be a particle of dark matter. NOvA’s detectors are watching out for charged particles produced when neutrinos hit the detector material, and machine learning algorithms identify them.

    From machine learning to deep learning

    Recent developments in machine learning often called “deep learning” promise to take applications in particle physics even further. Deep learning typically refers to the use of neural networks: computer algorithms with an architecture inspired by the dense network of neurons in the human brain.

    These neural nets learn on their own how to perform certain analysis tasks during a training period in which they are shown sample data, such as simulations, and are told how well they performed.

    Until recently, the success of neural nets was limited because training them used to be very hard, says co-author Kazuhiro Terao, a SLAC researcher working on the MicroBooNE neutrino experiment, which studies neutrino oscillations as part of Fermilab’s short-baseline neutrino program and will become a component of the future Deep Underground Neutrino Experiment at the Long-Baseline Neutrino Facility.


    FNAL LBNF/DUNE from FNAL to SURF, Lead, South Dakota, USA

    “These difficulties limited us to neural networks that were only a couple of layers deep,” he says. “Thanks to advances in algorithms and computing hardware, we now know much better how to build and train more capable networks hundreds or thousands of layers deep.”

    Many of the advances in deep learning are driven by tech giants’ commercial applications and the data explosion they have generated over the past two decades. “NOvA, for example, uses a neural network inspired by the architecture of the GoogleNet,” Radovic says. “It improved the experiment in ways that otherwise could have only been achieved by collecting 30 percent more data.”

    A fertile ground for innovation

    Machine learning algorithms become more sophisticated and fine-tuned day by day, opening up unprecedented opportunities to solve particle physics problems.

    Many of the new tasks they could be used for are related to computer vision, Kagan says. “It’s similar to facial recognition, except that in particle physics, image features are more abstract and complex than ears and noses.”

    Some experiments like NOvA and MicroBooNE produce data that can easily be translated into actual images, and AI can be readily used to identify features in them. In LHC experiments, on the other hand, images first need to be reconstructed from a murky pool of data generated by millions of sensor elements.

    “But even if the data don’t look like images, we can still use computer vision methods if we’re able to process the data in the right way,” Radovic says.

    One area where this approach could be very useful is the analysis of particle jets produced in large numbers at the LHC. Jets are narrow sprays of particles whose individual tracks are extremely challenging to separate. Computer vision technology could help identify features in jets.

    Another emerging application of deep learning is the simulation of particle physics data that predict, for example, what happens in particle collisions at the LHC and can be compared to the actual data. Simulations like these are typically slow and require immense computing power. AI, on the other hand, could do simulations much faster, potentially complementing the traditional approach.

    “Just a few years ago, nobody would have thought that deep neural networks can be trained to ‘hallucinate’ data from random noise,” Kagan says. “Although this is very early work, it shows a lot of promise and may help with the data challenges of the future.”

    Benefiting from healthy skepticism

    Despite all obvious advances, machine learning enthusiasts frequently face skepticism from their collaboration partners, in part because machine learning algorithms mostly work like “black boxes” that provide very little information about how they reached a certain conclusion.

    “Skepticism is very healthy,” Williams says. “If you use machine learning for triggers that discard data, like we do in LHCb, then you want to be extremely cautious and set the bar very high.”

    Therefore, establishing machine learning in particle physics requires constant efforts to better understand the inner workings of the algorithms and to do cross-checks with real data whenever possible.

    “We should always try to understand what a computer algorithm does and always evaluate its outcome,” Terao says. “This is true for every algorithm, not only machine learning. So, being skeptical shouldn’t stop progress.”

    Rapid progress has some researchers dreaming of what could become possible in the near future. “Today we’re using machine learning mostly to find features in our data that can help us answer some of our questions,” Terao says. “Ten years from now, machine learning algorithms may be able to ask their own questions independently and recognize when they find new physics.”

    See the full article here .


    Please help promote STEM in your local schools.

    Stem Education Coalition

    Symmetry is a joint Fermilab/SLAC publication.

  • richardmitnick 12:18 pm on July 17, 2018 Permalink | Reply
    Tags: , , , Machine learning, , , ,   

    From Symmetry: “Rise of the machines” 

    Symmetry Mag
    From Symmetry

    Sarah Charley

    Machine learning will become an even more important tool when scientists upgrade to the High-Luminosity Large Hadron Collider.

    Artwork by Sandbox Studio, Chicago

    When do a few scattered dots become a line? And when does that line become a particle track? For decades, physicists have been asking these kinds of questions. Today, so are their machines.

    Machine learning is the process by which the task of pattern recognition is outsourced to a computer algorithm. Humans are naturally very good at finding and processing patterns. That’s why you can instantly recognize a song from your favorite band, even if you’ve never heard it before.

    Machine learning takes this very human process and puts computing power behind it. Whereas a human might be able to recognize a band based on a variety of attributes such as the vocal tenor of the lead singer, a computer can process other subtle features a human might miss. The music-streaming platform Pandora categorizes every piece of music in terms of 450 different auditory qualities.

    “Machines can handle a lot more information than our brains can,” says Eduardo Rodrigues, a physicist at the University of Cincinnati. “It’s why they can find patterns that are sometimes invisible to us.”

    Machine learning started to become commonplace in computing during the 1980s, and LHC physicists have been using it routinely to help to manage and process raw data since 2012. Now, with upgrades to what is already the world’s most powerful particle accelerator looming on the horizon, physicists are implementing new applications of machine learning to help them with the imminent data deluge.

    “The high-luminosity upgrade to the LHC is going to increase our amount of data by a factor of 100 relative to that used to discover the Higgs,” says Peter Elmer, a physicist at Princeton University. “This will help us search for rare particles and new physics, but if we’re not prepared, we risk being completely swamped with data.”

    Only a small fraction of the LHC’s collisions are interesting to scientists. For instance, Higgs bosons are born in just roughly one out of every 2 billion proton-proton collisions. Machine learning is helping scientists to sort through the noise and isolate what’s truly important.

    “It’s like mining for rare gems,” Rodrigues says. “Keeping all the sand and pebbles would be ridiculous, so we use algorithms to help us single out the things that look interesting. With machine learning, we can purify the sample even further and more efficiently.”

    LHC physicists use a kind of machine learning called supervised learning. The principle behind supervised learning is nothing new; in fact, it’s how most of us learn how to read and write. Physicists start by training their machine-learning algorithms with data from collisions that are already well-understood. They tell them, “This is what a Higgs looks like. This is what a particle with a bottom quark looks like.”

    After giving an algorithm all of the information they already know about hundreds of examples, physicists then pull back and task the computer with identifying the particles in collisions without labels. They monitor how well the algorithm performs and give corrections along the way. Eventually, the computer needs only minimal guidance and can become even better than humans at analyzing the data.

    “This is saving the LHCb experiment a huge amount of time,” Rodrigues says. “In the past, we needed months to make sense of our raw detector data. With machine learning, we can now process and label events within the first few hours after we record them.”

    Not only is machine learning helping physicists understand their real data, but it will soon help them create simulations to test their predictions from theory as well.

    Using algorithms in the absence of machine learning, scientists have created virtual versions of their detectors with all the known laws of physics pre-programmed.

    “The virtual experiment follows the known laws of physics to a T,” Elmer says. “We simulate proton-proton collisions and then predict how the byproducts will interact with every part of our detector.”

    If scientists find a consistent discrepancy between the virtual data generated by their simulations and the real data recorded by their detectors, it could mean that the particles in the real world are playing by a different set of rules than the ones physicists already know.

    A weakness of scientists’ current simulations is that they’re too slow. They use series of algorithms to precisely calculate how a particle will interact with every detector part it bumps into while moving through the many layers of a particle detector.

    Even though it takes only a few minutes to simulate a collision this way, scientists need to simulate trillions of collisions to cover the possible outcomes of the 600 million collisions per second they will record with the HL-LHC.

    “We don’t have the time or resources for that,” Elmer says.

    With machine learning, on the other hand, they can generalize. Instead of calculating every single particle interaction with matter along the way, they can estimate its overall behavior based on its typical paths through the detector.

    “It’s a matter of balancing quality with quantity,” Elmer says. “We’ll still use the very precise calculations for some studies. But for others, we don’t need such high-resolution simulations for the physics we want to do.”

    Machine learning is helping scientists process more data faster. With the planned upgrades to the LHC, it could play an even large role in the future. But it is not a silver bullet, Elmer says.

    “We still want to understand why and how all of our analyses work so that we can be completely confident in the results they produce,” he says. “We’ll always need a balance between shiny new technologies and our more traditional analysis techniques.”

    See the full article here .


    Please help promote STEM in your local schools.

    Stem Education Coalition

    Symmetry is a joint Fermilab/SLAC publication.

  • richardmitnick 10:59 am on June 19, 2018 Permalink | Reply
    Tags: , , , Machine learning, , Searching Science Data   

    From Lawrence Berkeley National Lab: “Berkeley Lab Researchers Use Machine Learning to Search Science Data” 

    Berkeley Logo

    From Lawrence Berkeley National Lab

    A screenshot of image-based results in the Science Search interface. In this case, the user performed an image search for nanoparticles. (Credit: Gonzalo Rodrigo/Berkeley Lab)

    As scientific datasets increase in both size and complexity, the ability to label, filter and search this deluge of information has become a laborious, time-consuming and sometimes impossible task, without the help of automated tools.

    With this in mind, a team of researchers from the Department of Energy’s Lawrence Berkeley National Laboratory (Berkeley Lab) and UC Berkeley are developing innovative machine learning tools to pull contextual information from scientific datasets and automatically generate metadata tags for each file. Scientists can then search these files via a web-based search engine for scientific data, called Science Search, that the Berkeley team is building.

    As a proof-of-concept, the team is working with staff at Berkeley Lab’s Molecular Foundry, to demonstrate the concepts of Science Search on the images captured by the facility’s instruments. A beta version of the platform has been made available to Foundry researchers.

    LBNL Molecular Foundry – No image credits found

    “A tool like Science Search has the potential to revolutionize our research,” says Colin Ophus, a Molecular Foundry research scientist within the National Center for Electron Microscopy (NCEM) and Science Search Collaborator. “We are a taxpayer-funded National User Facility, and we would like to make all of the data widely available, rather than the small number of images chosen for publication. However, today, most of the data that is collected here only really gets looked at by a handful of people—the data producers, including the PI (principal investigator), their postdocs or graduate students—because there is currently no easy way to sift through and share the data. By making this raw data easily searchable and shareable, via the Internet, Science Search could open this reservoir of ‘dark data’ to all scientists and maximize our facility’s scientific impact.”

    The Challenges of Searching Science Data

    This screen capture of the Science Search interface shows how users can easily validate metadata tags that have been generated via machine learning, or add information that hasn’t already been captured. (Credit: Gonzalo Rodrigo/Berkeley Lab)

    Today, search engines are ubiquitously used to find information on the Internet but searching science data presents a different set of challenges. For example, Google’s algorithm relies on more than 200 clues to achieve an effective search. These clues can come in the form of key words on a webpage, metadata in images or audience feedback from billions of people when they click on the information they are looking for. In contrast, scientific data comes in many forms that are radically different than an average web page, requires context that is specific to the science and often also lacks the metadata to provide context that is required for effective searches.

    At National User Facilities like the Molecular Foundry, researchers from all over the world apply for time and then travel to Berkeley to use extremely specialized instruments free of charge. Ophus notes that the current cameras on microscopes at the Foundry can collect up to a terabyte of data in under 10 minutes. Users then need to manually sift through this data to find quality images with “good resolution” and save that information on a secure shared file system, like Dropbox, or on an external hard drive that they eventually take home with them to analyze.

    Oftentimes, the researchers that come to the Molecular Foundry only have a couple of days to collect their data. Because it is very tedious and time consuming to manually add notes to terabytes of scientific data and there is no standard for doing it, most researchers just type shorthand descriptions in the filename. This might make sense to the person saving the file, but often doesn’t make much sense to anyone else.

    “The lack of real metadata labels eventually causes problems when the scientist tries to find the data later or attempts to share it with others,” says Lavanya Ramakrishnan, a staff scientist in Berkeley Lab’s Computational Research Division (CRD) and co-principal investigator of the Science Search project. “But with machine-learning techniques, we can have computers help with what is laborious for the users, including adding tags to the data. Then we can use those tags to effectively search the data.”

    In addition to images, Science Search can also be used to look for proposals and papers. This is a screenshot of the paper search results. (Credit: Gonzalo Rodrigo/Berkeley Lab). [No hot links.]

    To address the metadata issue, the Berkeley Lab team uses machine-learning techniques to mine the “science ecosystem”—including instrument timestamps, facility user logs, scientific proposals, publications and file system structures—for contextual information. The collective information from these sources including timestamp of the experiment, notes about the resolution and filter used and the user’s request for time, all provides critical contextual information. The Berkeley lab team has put together an innovative software stack that uses machine-learning techniques including natural language processing pull contextual keywords about the scientific experiment and automatically create metadata tags for the data.

    For the proof-of-concept, Ophus shared data from the Molecular Foundry’s TEAM 1 electron microscope at NCEM that was recently collected by the facility staff, with the Science Search Team.

    LBNL National Center for Electron Microscopy (NCEM)

    He also volunteered to label a few thousand images to give the machine-learning tools some labels from which to start learning. While this is a good start, Science Search co-principal investigator Gunther Weber notes that most successful machine-learning applications typically require significantly more data and feedback to deliver better results. For example, in the case of search engines like Google, Weber notes that training datasets are created and machine-learning techniques are validated when billions of people around the world verify their identity by clicking on all the images with street signs or storefronts after typing in their passwords, or on Facebook when they’re tagging their friends in an image.

    “In the case of science data only a handful of domain experts can create training sets and validate machine-learning techniques, so one of the big ongoing problems we face is an extremely small number of training sets,” says Weber, who is also a staff scientist in Berkeley Lab’s CRD.

    To overcome this challenge, the Berkeley Lab researchers used transfer learning to limit the degrees of freedom, or parameter counts, on their convolutional neural networks (CNNs). Transfer learning is a machine learning method in which a model developed for a task is reused as the starting point for a model on a second task, which allows the user to get more accurate results from a smaller training set. In the case of the TEAM I microscope, the data produced contains information about which operation mode the instrument was in at the time of collection. With that information, Weber was able to train the neural network on that classification so it could generate that mode of operation label automatically. He then froze that convolutional layer of the network, which meant he’d only have to retrain the densely connected layers. This approach effectively reduces the number of parameters on the CNN, allowing the team to get some meaningful results from their limited training data.

    Machine Learning to Mine the Scientific Ecosystem

    In addition to generating metadata tags through training datasets, the Berkeley Lab team also developed tools that use machine-learning techniques for mining the science ecosystem for data context. For example, the data ingest module can look at a multitude of information sources from the scientific ecosystem—including instrument timestamps, user logs, proposals and publications—and identify commonalities. Tools developed at Berkeley Lab that use natural language-processing methods can then identify and rank words that give context to the data and facilitate meaningful results for users later on. The user will see something similar to the results page of an Internet search, where content with the most text matching the user’s search words will appear higher on the page. The system also learns from user queries and the search results they click on.

    Because scientific instruments are generating an ever-growing body of data, all aspects of the Berkeley team’s science search engine needed to be scalable to keep pace with the rate and scale of the data volumes being produced. The team achieved this by setting up their system in a Spin instance on the Cori supercomputer at the National Energy Research Scientific Computing Center (NERSC).


    NERSC Cray Cori II supercomputer at NERSC at LBNL, named after Gerty Cori, the first American woman to win a Nobel Prize in science

    LBL NERSC Cray XC30 Edison supercomputer

    The Genepool system is a cluster dedicated to the DOE Joint Genome Institute’s computing needs. Denovo is a smaller test system for Genepool that is primarily used by NERSC staff to test new system configurations and software.


    PDSF is a networked distributed computing cluster designed primarily to meet the detector simulation and data analysis requirements of physics, astrophysics and nuclear science collaborations.

    Spin is a Docker-based edge-services technology developed at NERSC that can access the facility’s high performance computing systems and storage on the back end.

    “One of the reasons it is possible for us to build a tool like Science Search is our access to resources at NERSC,” says Gonzalo Rodrigo, a Berkeley Lab postdoctoral researcher who is working on the natural language processing and infrastructure challenges in Science Search. “We have to store, analyze and retrieve really large datasets, and it is useful to have access to a supercomputing facility to do the heavy lifting for these tasks. NERSC’s Spin is a great platform to run our search engine that is a user-facing application that requires access to large datasets and analytical data that can only be stored on large supercomputing storage systems.”

    An Interface for Validating and Searching Data

    When the Berkeley Lab team developed the interface for users to interact with their system, they knew that it would have to accomplish a couple of objectives, including effective search and allowing human input to the machine learning models. Because the system relies on domain experts to help generate the training data and validate the machine-learning model output, the interface needed to facilitate that.

    “The tagging interface that we developed displays the original data and metadata available, as well as any machine-generated tags we have so far. Expert users then can browse the data and create new tags and review any machine-generated tags for accuracy,” says Matt Henderson, who is a Computer Systems Engineer in CRD and leads the user interface development effort.

    To facilitate an effective search for users based on available information, the team’s search interface provides a query mechanism for available files, proposals and papers that the Berkeley-developed machine-learning tools have parsed and extracted tags from. Each listed search result item represents a summary of that data, with a more detailed secondary view available, including information on tags that matched this item. The team is currently exploring how to best incorporate user feedback to improve the models and tags.

    “Having the ability to explore datasets is important for scientific breakthroughs, and this is the first time that anything like Science Search has been attempted,” says Ramakrishnan. “Our ultimate vision is to build the foundation that will eventually support a ‘Google’ for scientific data, where researchers can even search distributed datasets. Our current work provides the foundation needed to get to that ambitious vision.”

    “Berkeley Lab is really an ideal place to build a tool like Science Search because we have a number of user facilities, like the Molecular Foundry, that have decades worth of data that would provide even more value to the scientific community if the data could be searched and shared,” adds Katie Antypas, who is the principal investigator of Science Search and head of NERSC’s Data Department. “Plus we have great access to machine-learning expertise in the Berkeley Lab Computing Sciences Area as well as HPC resources at NERSC in order to build these capabilities.”

    In addition to Antypas, Ramakrishnan and Weber, UC Berkeley Computer Science Professor Joseph Hellerstein is also a principal investigator.

    This work was supported by the DOE Office of Advanced Scientific Computing Research (ASCR). Both the Molecular Foundry and NERSC are DOE Office of Science User Facilities located at Berkeley Lab.

    See the full article here .

    Please help promote STEM in your local schools.

    Stem Education Coalition

    A U.S. Department of Energy National Laboratory Operated by the University of California

    University of California Seal

    DOE Seal

  • richardmitnick 9:26 pm on May 29, 2018 Permalink | Reply
    Tags: , , Machine learning   

    From Lawrence Berkeley National Lab: “New Machine Learning Approach Could Accelerate Bioengineering” 

    Berkeley Logo

    From Lawrence Berkeley National Lab

    May 29, 2018
    Dan Krotz

    A new approach developed by Zak Costello (left) and Hector Garcia Martin brings the the speed and analytic power of machine learning to bioengineering. (Credit: Marilyn Chung, Berkeley Lab)

    Scientists from the Department of Energy’s Lawrence Berkeley National Laboratory (Berkeley Lab) have developed a way to use machine learning to dramatically accelerate the design of microbes that produce biofuel.

    Their computer algorithm starts with abundant data about the proteins and metabolites in a biofuel-producing microbial pathway, but no information about how the pathway actually works. It then uses data from previous experiments to learn how the pathway will behave. The scientists used the technique to automatically predict the amount of biofuel produced by pathways that have been added to E. coli bacterial cells.

    The new approach is much faster than the current way to predict the behavior of pathways, and promises to speed up the development of biomolecules for many applications in addition to commercially viable biofuels, such as drugs that fight antibiotic-resistant infections and crops that withstand drought.

    The research was published May 29 in the journal Nature Systems Biology and Applications.

    In biology, a pathway is a series of chemical reactions in a cell that produce a specific compound. Researchers are exploring ways to re-engineer pathways, and import them from one microbe to another, to harness nature’s toolkit to improve medicine, energy, manufacturing, and agriculture. And thanks to new synthetic biology capabilities, such as the gene-editing tool CRISPR-Cas9, scientists can conduct this research at a precision like never before.

    “But there’s a significant bottleneck in the development process,” said Hector Garcia Martin, group lead at the DOE Agile BioFoundry and director of Quantitative Metabolic Modeling at the Joint BioEnergy Institute (JBEI), a DOE Bioenergy Research Center funded by DOE’s Office of Science and led by Berkeley Lab. The research was performed by Zak Costello (also with the Agile BioFoundry and JBEI) under the direction of Garcia Martin. Both researchers are also in Berkeley Lab’s Biological Systems and Engineering Division.

    “It’s very difficult to predict how a pathway will behave when it’s re-engineered. Trouble-shooting takes up 99% of our time. Our approach could significantly shorten this step and become a new way to guide bioengineering efforts,” Garcia Martin added.

    The current way to predict a pathway’s dynamics requires a maze of differential equations that describe how the components in the system change over time. Subject-area experts develop these “kinetic models” over several months, and the resulting predictions don’t always match experimental results.

    Machine learning, however, uses data to train a computer algorithm to make predictions. The algorithm learns a system’s behavior by analyzing data from related systems. This allows scientists to quickly predict the function of a pathway even if its mechanisms are poorly understood — as long as there are enough data to work with.

    Machine learning approaches, such as the technique recently developed by Berkeley Lab scientists, are hamstrung by a lack of large quantities of quality data. New automation capabilities at JBEI and the Agile BioFoundry will be able to produce these data in a systematic fashion. This video shows a liquid handler coupled with an automated fermentation platform at JBEI, which takes samples automatically to produce data for the machine learning algorithms.

    The scientists tested their technique on pathways added to E. coli cells. One pathway is designed to produce a bio-based jet fuel called limonene; the other produces a gasoline replacement called isopentenol. Previous experiments at JBEI yielded a trove of data related to how different versions of the pathways function in various E. coli strains. Some of the strains have a pathway that produces small amounts of either limonene or isopentenol, while other strains have a version that produces large amounts of the biofuels.

    The researchers fed this data into their algorithm. Then machine learning took over: The algorithm taught itself how the concentrations of metabolites in these pathways change over time, and how much biofuel the pathways produce. It learned these dynamics by analyzing data from the two experimentally known pathways that produce small and large amounts of biofuels.

    The algorithm used this knowledge to predict the behavior of a third set of “mystery” pathways the algorithm had never seen before. It accurately predicted the biofuel-production profiles for the mystery pathways, including that the pathways produce a medium amount of fuel. In addition, the machine learning-derived prediction outperformed kinetic models.

    “And the more data we added, the more accurate the predictions became,” said Garcia Martin. “This approach could expedite the time it takes to design new biomolecules. A project that today takes ten years and a team of experts could someday be handled by a summer student.”

    The work was part of the DOE Agile BioFoundry, supported by DOE’s Office of Energy Efficiency and Renewable Energy, and the Joint BioEnergy Institute, supported by DOE’s Office of Science.

    See the full article here .



    Stem Education Coalition

    A U.S. Department of Energy National Laboratory Operated by the University of California

    University of California Seal

    DOE Seal

  • richardmitnick 7:24 am on March 4, 2018 Permalink | Reply
    Tags: , Barbara Engelhardt, , , , GTEx-Genotype-Tissue Expression Consortium, Machine learning, ,   

    From Quanta Magazine: “A Statistical Search for Genomic Truths” 

    Quanta Magazine
    Quanta Magazine

    February 27, 2018
    Jordana Cepelewicz

    Barbara Engelhardt, a Princeton University computer scientist, wants to strengthen the foundation of biological knowledge in machine-learning approaches to genomic analysis. Sarah Blesener for Quanta Magazine.

    We don’t have much ground truth in biology.” According to Barbara Engelhardt, a computer scientist at Princeton University, that’s just one of the many challenges that researchers face when trying to prime traditional machine-learning methods to analyze genomic data. Techniques in artificial intelligence and machine learning are dramatically altering the landscape of biological research, but Engelhardt doesn’t think those “black box” approaches are enough to provide the insights necessary for understanding, diagnosing and treating disease. Instead, she’s been developing new statistical tools that search for expected biological patterns to map out the genome’s real but elusive “ground truth.”

    Engelhardt likens the effort to detective work, as it involves combing through constellations of genetic variation, and even discarded data, for hidden gems. In research published last October [Nature], for example, she used one of her models to determine how mutations relate to the regulation of genes on other chromosomes (referred to as distal genes) in 44 human tissues. Among other findings, the results pointed to a potential genetic target for thyroid cancer therapies. Her work has similarly linked mutations and gene expression to specific features found in pathology images.

    The applications of Engelhardt’s research extend beyond genomic studies. She built a different kind of machine-learning model, for instance, that makes recommendations to doctors about when to remove their patients from a ventilator and allow them to breathe on their own.

    She hopes her statistical approaches will help clinicians catch certain conditions early, unpack their underlying mechanisms, and treat their causes rather than their symptoms. “We’re talking about solving diseases,” she said.

    To this end, she works as a principal investigator with the Genotype-Tissue Expression (GTEx) Consortium, an international research collaboration studying how gene regulation, expression and variation contribute to both healthy phenotypes and disease.


    Right now, she’s particularly interested in working on neuropsychiatric and neurodegenerative diseases, which are difficult to diagnose and treat.

    Quanta Magazine recently spoke with Engelhardt about the shortcomings of black-box machine learning when applied to biological data, the methods she’s developed to address those shortcomings, and the need to sift through “noise” in the data to uncover interesting information. The interview has been condensed and edited for clarity.

    What motivated you to focus your machine-learning work on questions in biology?

    I’ve always been excited about statistics and machine learning. In graduate school, my adviser, Michael Jordan [at the University of California, Berkeley], said something to the effect of: “You can’t just develop these methods in a vacuum. You need to think about some motivating applications.” I very quickly turned to biology, and ever since, most of the questions that drive my research are not statistical, but rather biological: understanding the genetics and underlying mechanisms of disease, hopefully leading to better diagnostics and therapeutics. But when I think about the field I am in — what papers I read, conferences I attend, classes I teach and students I mentor — my academic focus is on machine learning and applied statistics.

    We’ve been finding many associations between genomic markers and disease risk, but except in a few cases, those associations are not predictive and have not allowed us to understand how to diagnose, target and treat diseases. A genetic marker associated with disease risk is often not the true causal marker of the disease — one disease can have many possible genetic causes, and a complex disease might be caused by many, many genetic markers possibly interacting with the environment. These are all challenges that someone with a background in statistical genetics and machine learning, working together with wet-lab scientists and medical doctors, can begin to address and solve. Which would mean we could actually treat genetic diseases — their causes, not just their symptoms.

    You’ve spoken before about how traditional statistical approaches won’t suffice for applications in genomics and health care. Why not?

    First, because of a lack of interpretability. In machine learning, we often use “black-box” methods — [classification algorithms called] random forests, or deeper learning approaches. But those don’t really allow us to “open” the box, to understand which genes are differentially regulated in particular cell types or which mutations lead to a higher risk of a disease. I’m interested in understanding what’s going on biologically. I can’t just have something that gives an answer without explaining why.

    The goal of these methods is often prediction, but given a person’s genotype, it is not particularly useful to estimate the probability that they’ll get Type 2 diabetes. I want to know how they’re going to get Type 2 diabetes: which mutation causes the dysregulation of which gene to lead to the development of the condition. Prediction is not sufficient for the questions I’m asking.

    A second reason has to do with sample size. Most of the driving applications of statistics assume that you’re working with a large and growing number of data samples — say, the number of Netflix users or emails coming into your inbox — with a limited number of features or observations that have interesting structure. But when it comes to biomedical data, we don’t have that at all. Instead, we have a limited number of patients in the hospital, a limited number of genotypes we can sequence — but a gigantic set of features or observations for any one person, including all the mutations in their genome. Consequently, many theoretical and applied approaches from statistics can’t be used for genomic data.

    What makes the genomic data so challenging to analyze?

    The most important signals in biomedical data are often incredibly small and completely swamped by technical noise. It’s not just about how you model the real, biological signal — the questions you’re trying to ask about the data — but also how you model that in the presence of this incredibly heavy-handed noise that’s driven by things you don’t care about, like which population the individuals came from or which technician ran the samples in the lab. You have to get rid of that noise carefully. And we often have a lot of questions that we would like to answer using the data, and we need to run an incredibly large number of statistical tests — literally trillions — to figure out the answers. For example, to identify an association between a mutation in a genome and some trait of interest, where that trait might be the expression levels of a specific gene in a tissue. So how can we develop rigorous, robust testing mechanisms where the signals are really, really small and sometimes very hard to distinguish from noise? How do we correct for all this structure and noise that we know is going to exist?

    So what approach do we need to take instead?

    My group relies heavily on what we call sparse latent factor models, which can sound quite mathematically complicated. The fundamental idea is that these models partition all the variation we observed in the samples, with respect to only a very small number of features. One of these partitions might include 10 genes, for example, or 20 mutations. And then as a scientist, I can look at those 10 genes and figure out what they have in common, determine what this given partition represents in terms of a biological signal that affects sample variance.

    So I think of it as a two-step process: First, build a model that separates all the sources of variation as carefully as possible. Then go in as a scientist to understand what all those partitions represent in terms of a biological signal. After this, we can validate those conclusions in other data sets and think about what else we know about these samples (for instance, whether everyone of the same age is included in one of these partitions).

    When you say “go in as a scientist,” what do you mean?

    I’m trying to find particular biological patterns, so I build these models with a lot of structure and include a lot about what kinds of signals I’m expecting. I establish a scaffold, a set of parameters that will tell me what the data say, and what patterns may or may not be there. The model itself has only a certain amount of expressivity, so I’ll only be able to find certain types of patterns. From what I’ve seen, existing general models don’t do a great job of finding signals we can interpret biologically: They often just determine the biggest influencers of variance in the data, as opposed to the most biologically impactful sources of variance. The scaffold I build instead represents a very structured, very complex family of possible patterns to describe the data. The data then fill in that scaffold to tell me which parts of that structure are represented and which are not.

    So instead of using general models, my group and I carefully look at the data, try to understand what’s going on from the biological perspective, and tailor our models based on what types of patterns we see.

    How does the latent factor model work in practice?

    We applied one of these latent factor models to pathology images [pictures of tissue slices under a microscope], which are often used to diagnose cancer. For every image, we also had data about the set of genes expressed in those tissues. We wanted to see how the images and the corresponding gene expression levels were coordinated.

    We developed a set of features describing each of the images, using a deep-learning method to identify not just pixel-level values but also patterns in the image. We pulled out over a thousand features from each image, give or take, and then applied a latent factor model and found some pretty exciting things.

    For example, we found sets of genes and features in one of these partitions that described the presence of immune cells in the brain. You don’t necessarily see these cells on the pathology images, but when we looked at our model, we saw a component there that represented only genes and features associated with immune cells, not brain cells. As far as I know, no one’s seen this kind of signal before. But it becomes incredibly clear when we look at these latent factor components.

    Video: Barbara Engelhardt, a computer scientist at Princeton University, explains why traditional machine-learning techniques have often fallen short for genomic analysis, and how researchers are overcoming that challenge. Sarah Blesener for Quanta Magazine

    You’ve worked with dozens of human tissue types to unpack how specific genetic variations help shape complex traits. What insights have your methods provided?

    We had 44 tissues, donated from 449 human cadavers, and their genotypes (sequences of their whole genomes). We wanted to understand more about the differences in how those genotypes expressed their genes in all those tissues, so we did more than 3 trillion tests, one by one, comparing every mutation in the genome with every gene expressed in each tissue. (Running that many tests on the computing clusters we’re using now takes about two weeks; when we move this iteration of GTEx to the cloud as planned, we expect it to take around two hours.) We were trying to figure out whether the [mutant] genotype was driving distal gene expression. In other words, we were looking for mutations that weren’t located on the same chromosome as the genes they were regulating. We didn’t find very much: a little over 600 of these distal associations. Their signals were very low.

    But one of the signals was strong: an exciting thyroid association, in which a mutation appeared to distally regulate two different genes. We asked ourselves: How is this mutation affecting expression levels in a completely different part of the genome? In collaboration with Alexis Battle’s lab at Johns Hopkins University, we looked near the mutation on the genome and found a gene called FOXE1, for a transcription factor that regulates the transcription of genes all over the genome. The FOXE1 gene is only expressed in thyroid tissues, which was interesting. But we saw no association between the mutant genotype and the expression levels of FOXE1. So we had to look at the components of the original signal we’d removed before — everything that had appeared to be a technical artifact — to see if we could detect the effects of the FOXE1 protein broadly on the genome.

    We found a huge impact of FOXE1 in the technical artifacts we’d removed. FOXE1, it seems, regulates a large number of genes only in the thyroid. Its variation is driven by the mutant genotype we found. And that genotype is also associated with thyroid cancer risk. We went back to the thyroid cancer samples — we had about 500 from the Cancer Genome Atlas — and replicated the distal association signal. These things tell a compelling story, but we wouldn’t have learned it unless we had tried to understand the signal that we’d removed.

    What are the implications of such an association?

    Now we have a particular mechanism for the development of thyroid cancer and the dysregulation of thyroid cells. If FOXE1 is a druggable target — if we can go back and think about designing drugs to enhance or suppress the expression of FOXE1 — then we can hope to prevent people at high thyroid cancer risk from getting it, or to treat people with thyroid cancer more effectively.

    The signal from broad-effect transcription factors like FOXE1 actually looks a lot like the effects we typically remove as part of the noise: population structure, or the batches the samples were run in, or the effects of age or sex. A lot of those technical influences are going to affect approximately similar numbers of genes — around 10 percent — in a similar way. That’s why we usually remove signals that have that pattern. In this case, though, we had to understand the domain we were working in. As scientists, we looked through all the signals we’d gotten rid of, and this allowed us to find the effects of FOXE1 showing up so strongly in there. It involved manual labor and insights from a biological background, but we’re thinking about how to develop methods to do it in a more automated way.

    So with traditional modeling techniques, we’re missing a lot of real biological effects because they look too similar to noise?

    Yes. There are a ton of cases in which the interesting pattern and the noise look similar. Take these distal effects: Pretty much all of them, if they are broad effects, are going to look like the noise signal we systematically get rid of. It’s methodologically challenging. We have to think carefully about how to characterize when a signal is biologically relevant or just noise, and how to distinguish the two. My group is working fairly aggressively on figuring that out.

    Why are those relationships so difficult to map, and why look for them?

    There are so many tests we have to do; the threshold for the statistical significance of a discovery has to be really, really high. That creates problems for finding these signals, which are often incredibly small; if our threshold is that high, we’re going to miss a lot of them. And biologically, it’s not clear that there are many of these really broad-effect distal signals. You can imagine that natural selection would eliminate the kinds of mutations that affect 10 percent of genes — that we wouldn’t want that kind of variability in the population for so many genes.

    But I think there’s no doubt that these distal associations play an enormous role in disease, and that they may be considered as druggable targets. Understanding their role broadly is incredibly important for human health.

    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    Formerly known as Simons Science News, Quanta Magazine is an editorially independent online publication launched by the Simons Foundation to enhance public understanding of science. Why Quanta? Albert Einstein called photons “quanta of light.” Our goal is to “illuminate science.” At Quanta Magazine, scientific accuracy is every bit as important as telling a good story. All of our articles are meticulously researched, reported, edited, copy-edited and fact-checked.

  • richardmitnick 7:31 am on February 27, 2018 Permalink | Reply
    Tags: , , Machine learning,   

    From ETH Zürich: “Teaching quantum physics to a computer” 

    ETH Zurich bloc

    ETH Zürich

    Oliver Morsch

    An international collaboration led by ETH physicists has used machine learning to teach a computer how to predict the outcomes of quantum experiments. The results could prove to be essential for testing future quantum computers.

    Using neural networks, physicists taught a computer to predict the results of quantum experiments. (Graphic: http://www.colourbox.com)

    Physics students spend many years learning to master the often counterintuitive laws and effects of quantum mechanics. For instance, the quantum state of a physical system may be undetermined until a measurement is made, and a measurement on one part of the system can influence the state of a distant part without any exchange of information. It is enough to make the mind boggle. Once the students graduate and start doing research, the problems continue: to exactly determine the state of some quantum system in an experiment, one has to carefully prepare it and make lots of measurements, over and over again.

    Very often, what one is actually interested in cannot even be measured directly. An international team of researchers led by Giuseppe Carleo, a lecturer at the Institute for Theoretical Physics of ETH Zürich, has now developed machine learning software that enables a computer to “learn” the quantum state of a complex physical system based on experimental observations and to predict the outcomes of hypothetical measurements. In the future, their software could be used to test the accuracy of quantum computers.

    Quantum physics and handwriting

    The principle of his approach, Carleo explains, is rather simple. He uses an intuitive analogy that avoids the complications of quantum physics: “What we do, in a nutshell, is like teaching the computer to imitate my handwriting. We will show it a bunch of written samples, and step by step it then learns to replicate all my a’s, l’s and so forth.”

    The way the computer does this is by looking at the ways, for instance, in which an “l” is written when it follows an “a”. These may not always be the same, so the computer will calculate a probability distribution that expresses mathematically how often a letter is written in a certain way when it is preceded by some other letter. “Once the computer has figured out that distribution, it could then reproduce something that looks very much like my handwriting”, Carleo says.

    A neural network (top) “learns” the quantum state of a spin system from measurement data by trying different possibilities of the spin directions (bottom) and correcting itself step by step. (Graphic: ETH Zürich / G. Carleo)

    Quantum physics is, of course, much more complicated than a person’s handwriting. Still, the principle that Carleo (who recently moved to the Flatiron Institute in New York), together with Matthias Troyer, Guglielmo Mazzola (both at ETH) and Giacomo Torlai from the University of Waterloo as well as colleagues at the Perimeter Institute and the company D-Wave in Canada have used for their machine learning algorithm is quite similar.

    The quantum state of the physical system is encoded in a so-called neural network, and learning is achieved in small steps by translating the current state of the network into predicted measurement probabilities. Those probabilities are then compared to the actually measured data, and adjustments are made to the network in order to make them match better in the next round. Once this training period is finished, one can then use the quantum state stored in the neural network for “virtual” experiments without actually performing them in the laboratory.

    Faster tomography for quantum states

    “Using machine learning to extract a quantum state from measurements has a number of advantages”, Carleo explains. He cites one striking example, in which the quantum state of a collection of just eight quantum objects (trapped ions) had to be experimentally determined. Using a standard approached called quantum tomography, around one million measurements were needed to achieve the desired accuracy. With the new method, a much smaller number of measurements could do the same job, and substantially larger systems, previously inaccessible, could be studied.

    This is encouraging, since common wisdom has it that the number of calculations necessary to simulate a complex quantum system on a classical computer grows exponentially with the number of quantum objects in the system. This is mainly because of a phenomenon called entanglement, which causes distant parts of the quantum system to be intimately connected although they do not exchange information. The approach used by Carleo and his collaborators takes this into account by using a layer of “hidden” neurons, which allow the computer to encode the correct quantum state in a much more compact fashion.

    Testing quantum computers

    Being able to study quantum systems with a large number of components – or “qubits”, as they are often called – also has important implications for future quantum technologies, as Carleo points out: “If we want to test quantum computers with more than a handful of qubits, that won’t be possible with conventional means because of the exponential scaling. Our machine learning approach, however, should put us in a position to test quantum computers with as many as 100 qubits.”

    Also, the machine learning software can help experimental physicists by allowing them to perform virtual measurements that would be hard to do in the laboratory, such as measuring the degree of entanglement of a system composed of many interacting qubits. So far, the method has only been tested on artificially generated data, but the researchers plan to use it for analysing real quantum experiments very soon.

    Science paper:
    Torlai G, Mazzola G, Carrasquilla J, Troyer M, Melko R, Carleo G: Neural-network quantum state tomography. Nature Physics.

    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    ETH Zurich campus
    ETH Zürich is one of the leading international universities for technology and the natural sciences. It is well known for its excellent education, ground-breaking fundamental research and for implementing its results directly into practice.

    Founded in 1855, ETH Zürich today has more than 18,500 students from over 110 countries, including 4,000 doctoral students. To researchers, it offers an inspiring working environment, to students, a comprehensive education.

    Twenty-one Nobel Laureates have studied, taught or conducted research at ETH Zürich, underlining the excellent reputation of the university.

  • richardmitnick 8:33 pm on November 16, 2017 Permalink | Reply
    Tags: , , , Machine learning,   

    From phys.org: “Machine learning used to predict earthquakes in a lab setting” 


    October 23, 2017

    Aerial photo of the San Andreas Fault in the Carrizo Plain, northwest of Los Angeles. Credit: Wikipedia.

    A group of researchers from the UK and the US have used machine learning techniques to successfully predict earthquakes. Although their work was performed in a laboratory setting, the experiment closely mimics real-life conditions, and the results could be used to predict the timing of a real earthquake.

    The team, from the University of Cambridge, Los Alamos National Laboratory and Boston University, identified a hidden signal leading up to earthquakes, and used this ‘fingerprint’ to train a machine learning algorithm to predict future earthquakes. Their results, which could also be applied to avalanches, landslides and more, are reported in the journal Geophysical Review Letters.

    For geoscientists, predicting the timing and magnitude of an earthquake is a fundamental goal. Generally speaking, pinpointing where an earthquake will occur is fairly straightforward: if an earthquake has struck a particular place before, the chances are it will strike there again. The questions that have challenged scientists for decades are how to pinpoint when an earthquake will occur, and how severe it will be. Over the past 15 years, advances in instrument precision have been made, but a reliable earthquake prediction technique has not yet been developed.

    As part of a project searching for ways to use machine learning techniques to make gallium nitride (GaN) LEDs more efficient, the study’s first author, Bertrand Rouet-Leduc, who was then a PhD student at Cambridge, moved to Los Alamos National Laboratory in New Mexico to start a collaboration on machine learning in materials science between Cambridge University and Los Alamos. From there the team started helping the Los Alamos Geophysics group on machine learning questions.

    The team at Los Alamos, led by Paul Johnson, studies the interactions among earthquakes, precursor quakes (often very small earth movements) and faults, with the hope of developing a method to predict earthquakes. Using a lab-based system that mimics real earthquakes, the researchers used machine learning techniques to analyse the acoustic signals coming from the ‘fault’ as it moved and search for patterns.

    The laboratory apparatus uses steel blocks to closely mimic the physical forces at work in a real earthquake, and also records the seismic signals and sounds that are emitted. Machine learning is then used to find the relationship between the acoustic signal coming from the fault and how close it is to failing.

    The machine learning algorithm was able to identify a particular pattern in the sound, previously thought to be nothing more than noise, which occurs long before an earthquake. The characteristics of this sound pattern can be used to give a precise estimate (within a few percent) of the stress on the fault (that is, how much force is it under) and to estimate the time remaining before failure, which gets more and more precise as failure approaches. The team now thinks that this sound pattern is a direct measure of the elastic energy that is in the system at a given time.

    “This is the first time that machine learning has been used to analyse acoustic data to predict when an earthquake will occur, long before it does, so that plenty of warning time can be given – it’s incredible what machine learning can do,” said co-author Professor Sir Colin Humphreys of Cambridge’s Department of Materials Science & Metallurgy, whose main area of research is energy-efficient and cost-effective LEDs. Humphreys was Rouet-Leduc’s supervisor when he was a PhD student at Cambridge.

    “Machine learning enables the analysis of datasets too large to handle manually and looks at data in an unbiased way that enables discoveries to be made,” said Rouet-Leduc.

    Although the researchers caution that there are multiple differences between a lab-based experiment and a real earthquake, they hope to progressively scale up their approach by applying it to real systems which most resemble their lab system. One such site is in California along the San Andreas Fault, where characteristic small repeating earthquakes are similar to those in the lab-based earthquake simulator. Progress is also being made on the Cascadia fault in the Pacific Northwest of the United States and British Columbia, Canada, where repeating slow earthquakes that occur over weeks or months are also very similar to laboratory earthquakes.

    “We’re at a point where huge advances in instrumentation, machine learning, faster computers and our ability to handle massive data sets could bring about huge advances in earthquake science,” said Rouet-Leduc.

    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    About Phys.org in 100 Words

    Phys.org™ (formerly Physorg.com) is a leading web-based science, research and technology news service which covers a full range of topics. These include physics, earth science, medicine, nanotechnology, electronics, space, biology, chemistry, computer sciences, engineering, mathematics and other sciences and technologies. Launched in 2004, Phys.org’s readership has grown steadily to include 1.75 million scientists, researchers, and engineers every month. Phys.org publishes approximately 100 quality articles every day, offering some of the most comprehensive coverage of sci-tech developments world-wide. Quancast 2009 includes Phys.org in its list of the Global Top 2,000 Websites. Phys.org community members enjoy access to many personalized features such as social networking, a personal home page set-up, RSS/XML feeds, article comments and ranking, the ability to save favorite articles, a daily newsletter, and other options.

  • richardmitnick 2:04 pm on October 13, 2017 Permalink | Reply
    Tags: , , , , , Machine learning,   

    From BNL: “Scientists Use Machine Learning to Translate ‘Hidden’ Information that Reveals Chemistry in Action” 

    Brookhaven Lab

    October 10, 2017
    Karen McNulty Walsh
    (631) 344-8350

    Peter Genzer
    (631) 344-3174

    New method allows on-the-fly analysis of how catalysts change during reactions, providing crucial information for improving performance.

    A sketch of the new method that enables fast, “on-the-fly” determination of three-dimensional structure of nanocatalysts. The neural network converts the x-ray absorption spectra into geometric information (such as nanoparticle sizes and shapes) and the structural models are obtained for each spectrum. No image credit.

    Chemistry is a complex dance of atoms. Subtle shifts in position and shuffles of electrons break and remake chemical bonds as participants change partners. Catalysts are like molecular matchmakers that make it easier for sometimes-reluctant partners to interact.

    Now scientists have a way to capture the details of chemistry choreography as it happens. The method—which relies on computers that have learned to recognize hidden signs of the steps—should help them improve the performance of catalysts to drive reactions toward desired products faster.

    The method—developed by an interdisciplinary team of chemists, computational scientists, and physicists at the U.S. Department of Energy’s Brookhaven National Laboratory and Stony Brook University—is described in a new paper published in the Journal of Physical Chemistry Letters. The paper demonstrates how the team used neural networks and machine learning to teach computers to decode previously inaccessible information from x-ray data, and then used that data to decipher 3D nanoscale structures.

    Decoding nanoscale structures

    “The main challenge in developing catalysts is knowing how they work—so we can design better ones rationally, not by trial-and-error,” said Anatoly Frenkel, leader of the research team who has a joint appointment with Brookhaven Lab’s Chemistry Division and Stony Brook University’s Materials Science Department. “The explanation for how catalysts work is at the level of atoms and very precise measurements of distances between them, which can change as they react. Therefore it is not so important to know the catalysts’ architecture when they are made but more important to follow that as they react.”

    Anatoly Frenkel (standing) with co-authors (l to r) Deyu Lu, Yuewei Lin, and Janis Timoshenko. No image credit.

    Trouble is, important reactions—those that create important industrial chemicals such as fertilizers—often take place at high temperatures and under pressure, which complicates measurement techniques. For example, x-rays can reveal some atomic-level structures by causing atoms that absorb their energy to emit electronic waves. As those waves interact with nearby atoms, they reveal their positions in a way that’s similar to how distortions in ripples on the surface of a pond can reveal the presence of rocks. But the ripple pattern gets more complicated and smeared when high heat and pressure introduce disorder into the structure, thus blurring the information the waves can reveal.

    So instead of relying on the “ripple pattern” of the x-ray absorption spectrum, Frenkel’s group figured out a way to look into a different part of the spectrum associated with low-energy waves that are less affected by heat and disorder.

    “We realized that this part of the x-ray absorption signal contains all the needed information about the environment around the absorbing atoms,” said Janis Timoshenko, a postdoctoral fellow working with Frenkel at Stony Brook and lead author on the paper. “But this information is hidden ‘below the surface’ in the sense that we don’t have an equation to describe it, so it is much harder to interpret. We needed to decode that spectrum but we didn’t have a key.”

    Fortunately Yuewei Lin and Shinjae Yoo of Brookhaven’s Computational Science Initiative and Deyu Lu of the Center for Functional Nanomaterials (CFN) had significant experience with so-called machine learning methods. They helped the team develop a key by teaching computers to find the connections between hidden features of the absorption spectrum and structural details of the catalysts.

    “Janis took these ideas and really ran with them,” Frenkel said.

    The team used theoretical modeling to produce simulated spectra of several hundred thousand model structures, and used those to train the computer to recognize the features of the spectrum and how they correlated with the structure.

    “Then we built a neural network that was able to convert the spectrum into structures,” Frenkel said.

    When they tested to see if the method would work to decipher the shapes and sizes of well-defined platinum nanoparticles (using x-ray absorption spectra previously published by Frenkel and his collaborators) it did.

    “This method can now be used on the fly,” Frenkel said. “Once the network is constructed it takes almost no time for the structure to be obtained in any real experiment.”

    That means scientists studying catalysts at Brookhaven’s National Synchrotron Light Source II (NSLS-II), for example, could obtain real-time structural information to decipher why a particular reaction slows down, or starts producing an unwanted product—and then tweak the reaction conditions or catalyst chemistry to achieve desired results. This would be a big improvement over waiting to analyze results after completing the experiments and then figuring out what went wrong.

    In addition, this technique can process and analyze spectral signals from very low-concentration samples, and will be particularly useful at new high flux and high-energy-resolution beamlines incorporating special optics and high-throughput analysis techniques at NSLS-II.

    “This will offer completely new methods of using synchrotrons for operando research,” Frenkel said.

    This work was funded by the DOE Office of Science (BES) and by Brookhaven’s Laboratory Directed Research and Development program. Previously published spectra for the model nanoparticles used to validate the neural network were collected at the Advanced Photon Source (APS) at DOE’s Argonne National Laboratory and the original National Synchrotron Light Source (NSLS) at Brookhaven Lab, now replaced by NSLS-II. CFN, NSLS-II, and APS are DOE Office of Science User Facilities. In addition to Frenkel and Timoshenko, Lu and Lin are co-authors on the paper.

    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition
    BNL Campus

    One of ten national laboratories overseen and primarily funded by the Office of Science of the U.S. Department of Energy (DOE), Brookhaven National Laboratory conducts research in the physical, biomedical, and environmental sciences, as well as in energy technologies and national security. Brookhaven Lab also builds and operates major scientific facilities available to university, industry and government researchers. The Laboratory’s almost 3,000 scientists, engineers, and support staff are joined each year by more than 5,000 visiting researchers from around the world.Brookhaven is operated and managed for DOE’s Office of Science by Brookhaven Science Associates, a limited-liability company founded by Stony Brook University, the largest academic user of Laboratory facilities, and Battelle, a nonprofit, applied science and technology organization.

  • richardmitnick 8:08 am on April 25, 2017 Permalink | Reply
    Tags: , Machine learning,   

    From SLAC: “Machine Learning Dramatically Streamlines Search for More Efficient Chemical Reactions” 

    SLAC Lab

    April 24, 2017
    Glennda Chui

    A diagram shows the many possible paths one simple catalytic reaction can theoretically take – in this case, conversion of syngas, which is a combination of carbon dioxide (CO2) and carbon monoxide (CO), to acetaldehyde. Machine learning allowed SUNCAT theorists to prune away the least likely paths and identify the most likely one (red) so scientists can focus on making it more efficient. (Zachary Ulissi/SUNCAT)

    Even a simple chemical reaction can be surprisingly complicated. That’s especially true for reactions involving catalysts, which speed up the chemistry that makes fuel, fertilizer and other industrial goods. In theory, a catalytic reaction may follow thousands of possible paths, and it can take years to identify which one it actually takes so scientists can tweak it and make it more efficient.

    Now researchers at the Department of Energy’s SLAC National Accelerator Laboratory and Stanford University have taken a big step toward cutting through this thicket of possibilities. They used machine learning – a form of artificial intelligence – to prune away the least likely reaction paths, so they can concentrate their analysis on the few that remain and save a lot of time and effort.

    The method will work for a wide variety of complex chemical reactions and should dramatically speed the development of new catalysts, the team reported in Nature Communications.

    ‘A Daunting Task’

    “Designing a novel catalyst to speed a chemical reaction is a very daunting task,” said Thomas Bligaard, a staff scientist at the SUNCAT Center for Interface Science and Catalysis, a joint SLAC/Stanford institute where the research took place. “There’s a huge amount of experimental work that normally goes into it.”

    For instance, he said, finding a catalyst that turns nitrogen from the air into ammonia – considered one of the most important developments of the 20th century because it made the large-scale production of fertilizer possible, helping to launch the Green Revolution – took decades of testing various reactions one by one.

    Even today, with the help of supercomputer simulations that predict the results of reactions by applying theoretical models to huge databases on the behavior of chemicals and catalysts, the search can take years, because until now it has relied largely on human intuition to pick possible winners out of the many available reaction paths.

    “We need to know what the reaction is, and what are the most difficult steps along the reaction path, in order to even think about making a better catalyst,” said Jens Nørskov, a professor at SLAC and Stanford and director of SUNCAT.

    “We also need to know whether the reaction makes only the product we want or if it also makes undesirable byproducts. We’ve basically been making reasonable assumptions about these things, and we really need a systematic theory to guide us.”

    Trading Human Intuition for Machine Learning

    For this study, the team looked at a reaction that turns syngas, a combination of carbon monoxide and hydrogen, into fuels and industrial chemicals. The syngas flows over the surface of a rhodium catalyst, which like all catalysts is not consumed in the process and can be used over and over. This triggers chemical reactions that can produce a number of possible end products, such as ethanol, methane or acetaldehyde.

    “In this case there are thousands of possible reaction pathways – an infinite number, really – with hundreds of intermediate steps,” said Zachary Ulissi, a postdoctoral researcher at SUNCAT. “Usually what would happen is that a graduate student or postdoctoral researcher would go through them one at a time, using their intuition to pick what they think are the most likely paths. This can take years.”

    The new method ditches intuition in favor of machine learning, where a computer uses a set of problem-solving rules to learn patterns from large amounts of data and then predict similar patterns in new data. It’s a behind-the-scenes tool in an increasing number of technologies, from self-driving cars to fraud detection and online purchase recommendations.

    Rapid Weeding

    The data used in this process came from past studies of chemicals and their properties, including calculations that predict the bond energies between atoms based on principles of quantum mechanics. The researchers were especially interested in two factors that determine how easily a catalytic reaction proceeds: How strongly the reacting chemicals bond to the surface of the catalyst and which steps in the reaction present the most significant barriers to going forward. These are known as rate-limiting steps.

    A reaction will seek out the path that takes the least energy, Ulissi explained, much like a highway designer will choose a route between mountains rather than waste time looking for an efficient way to go over the top of a peak. With machine learning the researchers were able to analyze the reaction pathways over and over, each time eliminating the least likely paths and fine-tuning the search strategy for the next round.

    Once everything was set up, Ulissi said, “It only took seconds or minutes to weed out the paths that were not interesting. In the end there were only about 10 reaction barriers that were important.” The new method, he said, has the potential to reduce the time needed to identify a reaction pathway from years to months.

    Andrew Medford, a former SUNCAT graduate student who is now an assistant professor at the Georgia Institute of Technology, also contributed to this research, which was funded by the DOE Office of Science.

    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    SLAC Campus
    SLAC is a multi-program laboratory exploring frontier questions in photon science, astrophysics, particle physics and accelerator research. Located in Menlo Park, California, SLAC is operated by Stanford University for the DOE’s Office of Science.

  • richardmitnick 5:48 am on October 16, 2015 Permalink | Reply
    Tags: , , Machine learning, , , , Random Forest technique   

    From CAASTRO: “Classifying serendipitous X-ray sources with Machine Learning” 

    CAASTRO bloc

    CAASTRO ARC Centre of Excellence for All Sky Astrophysics

    The instruments of modern day observational astronomy have been steadily moving towards bigger telescopes and deeper surveys. A number of facilities have recently been (or will soon be) commissioned to survey the sky in unprecedented detail: at radio wavelengths, the upcoming Square Kilometre Array (SKA) telescope and its operational Australian precursors, the Murchison Widefield Array (MWA) and Australia Square Kilometre Array Pathfinder (ASKAP); in the visible bands, the Large Synoptic Survey Telescope (LSST) and SkyMapper; at higher energies, the soon-to-be-launched Spectrum Roentgen Gamma (SRG) space telescope. These facilities represent a dramatic increase in the amount of data collected that will be immensely challenging to process and utilise in real time.

    SKA Square Kilometer Array

    SKA Murchison Widefield Array
    SKA Murchison Widefield Array

    CSIRO Australian ASKAP Telescope
    Australia Square Kilometre Array Pathfinder (ASKAP)

    LSST Exterior
    LSST Telescope
    LSST Camera
    LSST, housing telescope, camera

    ANU Skymapper telescope
    ANU Skymapper telescope interior

    Spectr-RG Russian satellite
    Spectrum Roentgen Gamma (SRG) space telescope

    Novel methods to quickly and accurately identify astrophysical sources and to flag objects of particular rarity are needed to meet this challenge. In the 2014 publication by former CAASTRO PhD student Kitty Lo and colleagues (see press release), the Random Forest supervised ensemble machine learning algorithm was applied to classify the variable X-ray sources in the second XMM-Newton Serendipitous Source catalogue (2XMM). Building on this work, CAASTRO Affiliate Dr Sean Farrell (University of Sydney) led the team that applied the same method to the 3XMM catalogue, the largest X-ray source catalogue ever produced (representing a 40% increase over 2XMM with 372,728 unique sources of which 3,696 are flagged as variable). The variable X-ray sources were classified into six distinct categories of object: Active Galactic Nuclei (AGN), Cataclysmic Variables (CVs), Gamma Ray Bursts (GRBs), stars, Ultraluminous X-ray Sources (ULXs) and X-ray Binaries (XRBs), with a classification accuracy of ~92%. The Random Forest algorithm was also applied for the first time to data quality control and was used to identify spurious detections with an accuracy of ~95%. Quality control is one of the areas in astronomy surveys that is most demanding of human inspection, making this result particularly significant.

    Temp 1
    No image credit

    In addition to classifying the entire variable source component of 3XMM, a number of exotic outlier sources were discovered that may be representative of entirely new classes of objects. Three particularly interesting objects were identified including a new candidate supergiant fast X-ray transient (SFXT), a 400 second period X-ray pulsar and an eclipsing binary system with a 5-hour orbital period coincident with a known Cepheid variable star. All these objects are very rare and could provide unique insight into the most extreme physical processes known, highlighting the effectiveness of the Random Forest technique. In the era of large surveys, machine learning appears to be rapidly becoming an invaluable tool for the modern day astronomer.

    Publication details:
    S. Farrell, T. Murphy and K. Lo (The Astrophysical Journal 2015): Autoclassification of the Variable 3XMM Sources Using the Random Forest Machine Learning Algorithm

    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    Astronomy is entering a golden age, in which we seek to understand the complete evolution of the Universe and its constituents. But the key unsolved questions in astronomy demand entirely new approaches that require enormous data sets covering the entire sky.

    In the last few years, Australia has invested more than $400 million both in innovative wide-field telescopes and in the powerful computers needed to process the resulting torrents of data. Using these new tools, Australia now has the chance to establish itself at the vanguard of the upcoming information revolution centred on all-sky astrophysics.

    CAASTRO has assembled the world-class team who will now lead the flagship scientific experiments on these new wide-field facilities. We will deliver transformational new science by bringing together unique expertise in radio astronomy, optical astronomy, theoretical astrophysics and computation and by coupling all these capabilities to the powerful technology in which Australia has recently invested.


    The University of Sydney
    The University of Western Australia
    The University of Melbourne
    Swinburne University of Technology
    The Australian National University
    Curtin University
    University of Queensland

Compose new post
Next post/Next comment
Previous post/Previous comment
Show/Hide comments
Go to top
Go to login
Show/Hide help
shift + esc
%d bloggers like this: