Tagged: Big Data Toggle Comment Threads | Keyboard Shortcuts

  • richardmitnick 7:37 am on May 22, 2019 Permalink | Reply
    Tags: , Big Data, , Cancer Cell Line Encyclopedia, Database of Genotypes and Phenotypes, Gene Expression Omnibus, , , MSU’s Global Impact Initiative, Organoids, Scientists are using a lot of genomic data to identify medical issues sooner in patients but also using it to assist their scientific counterparts in researching diseases better., The Cancer Genome Atlas   

    From Michigan State University: “Big data helps identify better way to research breast cancer’s spread” 

    Michigan State Bloc

    From Michigan State University

    May 15, 2019
    Sarina Gleason
    Media Communications office
    (517) 355-9742
    sarina.gleason@cabs.msu.edu

    Bin Chen
    College of Human Medicine office
    616-234-2819
    chenbi12@msu.edu

    Scientists are using a lot of genomic data to identify medical issues sooner in patients, but they’re also using it to assist their scientific counterparts in researching diseases better.

    In a new study, Michigan State University researchers are analyzing large volumes of data, often referred to as big data, to determine better research models to fight the spread of breast cancer and test potential drugs. Current models used in the lab frequently involve culturing cells on flat dishes, or cell lines, to model tumor growth in patients.

    1

    The study is published in Nature Communications.

    This spreading, or metastasis, is the most common cause of cancer-related death, with around 90% of patients not surviving. To date, few drugs can treat cancer metastasis and knowing which step could go wrong in the drug discovery process can be a shot in the dark.

    “The differences between cell lines and tumor samples have raised the critical question to what extent cell lines can capture the makeup of tumors,” said Bin Chen, senior author and assistant professor in the College of Human Medicine.

    To answer this question, Chen and Ke Liu, first author of the study and a postdoctoral scholar, performed an integrative analysis of data taken from genomic databases including The Cancer Genome Atlas, Cancer Cell Line Encyclopedia, Gene Expression Omnibus and the database of Genotypes and Phenotypes.

    “Leveraging open genomic data to discover new cancer therapies is our ultimate goal,” said Chen, who is part of MSU’s Global Impact Initiative. “But before we begin to pour a significant amount of money into expensive experiments, we need to evaluate early research models and choose the appropriate one for drug testing based on genomic features.”

    By using this data, the researchers found substantial differences between lab-created breast cancer cell lines and actual advanced, or metastatic, breast cancer tumor samples. Surprisingly, MDA-MB-231, a cancer cell line used in nearly all metastatic breast cancer research, showed little genomic similarities to patient tumor samples.

    “I couldn’t believe the result,” Chen said. “All evidence pointed to large differences between the two. But, on the flip side, we were able to identify other cell lines that closely resembled the tumors and could be considered, along with other criteria, as better options for this research.”

    The organoid model was found to most likely mirror patient samples. This newly developed technology uses 3D tissue cultures and can capture more of the complexities of how tumors form and grow.

    “Studies have shown that organoids can preserve the structural and genetic makeup of the original tumor,” Chen said. “We found at the gene expression level, it was able to do this, more so than cancer cell lines.”

    However, Chen and Liu added that both the organoids and cell lines couldn’t adequately model the immediate molecular landscape surrounding a tumor found at different sites in the body.

    They said knowing all these factors will help scientists interpret results, especially unexpected ones, and urge the scientific community to develop more sophisticated research models.

    “Our study demonstrates the power of leveraging open data to gain insights on cancer,” Chen said. “Any advances we can make in early research will help us facilitate the discovery of better therapies for people with breast cancer down the road.”

    See the full article here .


    five-ways-keep-your-child-safe-school-shootings

    Please help promote STEM in your local schools.

    Stem Education Coalition

    Michigan State Campus

    Michigan State University (MSU) is a public research university located in East Lansing, Michigan, United States. MSU was founded in 1855 and became the nation’s first land-grant institution under the Morrill Act of 1862, serving as a model for future land-grant universities.

    MSU pioneered the studies of packaging, hospitality business, plant biology, supply chain management, and telecommunication. U.S. News & World Report ranks several MSU graduate programs in the nation’s top 10, including industrial and organizational psychology, osteopathic medicine, and veterinary medicine, and identifies its graduate programs in elementary education, secondary education, and nuclear physics as the best in the country. MSU has been labeled one of the “Public Ivies,” a publicly funded university considered as providing a quality of education comparable to those of the Ivy League.

    Following the introduction of the Morrill Act, the college became coeducational and expanded its curriculum beyond agriculture. Today, MSU is the seventh-largest university in the United States (in terms of enrollment), with over 49,000 students and 2,950 faculty members. There are approximately 532,000 living MSU alumni worldwide.

     
  • richardmitnick 12:05 pm on March 1, 2017 Permalink | Reply
    Tags: A Mind—And an Ear—For Big Data, Big Data, Data Expeditions,   

    From Duke: “A Mind—And an Ear—For Big Data” 

    Duke Bloc
    Duke Crest

    Duke University

    February 23, 2017
    Ken Kingery

    At Duke, engineering doctoral student Chris Tralie discovered a passion for analyzing the topology of music—and for teaching undergraduates about the power of data science.

    1
    Chris Tralie with advisors John Harer (Math) and Guillermo Sapiro (ECE)

    Chris Tralie wasn’t even working with big data when he came to Duke as a graduate student. But a movement gaining steam here in 2013 helped him realize he had the technical skillset to reveal structures and patterns where others saw chaos—or nothing.

    “There were people working on Big Data problems in various departments when I first got to campus,” said Tralie, a doctoral candidate in electrical & computer engineering (ECE) and a National Science Foundation Graduate Research Fellow. “Then the Information Initiative at Duke launched. It was brilliant because it brought everyone together and let them learn from each other’s work. There was real and sudden excitement in the air.”

    Tralie found his niche while learning about topology with John Harer, a professor of mathematics with a secondary appointment in ECE. The class boiled down to understanding the “shape” of data. Tralie thought, “Why can’t we do this with music?”

    Tralie designed a program that analyzes many different musical parameters of a song and mathematically reduces each time point into 3D space. The resulting shape can help determine which genre of music a song belongs to and can even recognize covers of songs by other bands.

    “Nobody thought you could do that, because of the differences in vocals and instruments,” said Tralie.

    Tralie took his own academic journey and used it to turn other Duke students on to big data—creating a “Data Expedition” using his method for visualizing songs as a fun and approachable way to teach undergraduates how to design data-crunching algorithms.

    Data Expeditions are projects proposed and taught by graduate students within the context of an existing undergraduate course. “Data Expeditions and Data+ both benefit our undergraduates by making technical subjects more relevant and exciting, but they’re also professional development opportunities for our graduate students,” said Robert Calderbank, director of iiD, which sponsors both programs. “Industry and academia both need people who can lead projects and manage multidisciplinary teams, so these experiences can provide a competitive advantage for Duke graduates.”

    “The Data Expeditions were really useful for me growing as a mentor,” said Tralie. “I got to work with really talented students who were still learning the basics and yet had amazing new ideas that I could learn from too. Those skills will translate to my future career, where I hope to be a faculty member advising graduate students of my own someday in engineering or applied math.”

    He also developed a new course for graduate students about using data analytics on video recognition challenges, like tracking heartbeats from video clips. Tralie’s own promising work in that arena can potentially add another element to an app developed to recognize signs of autism by another of his advisors, Guillermo Sapiro, the Edmund T. Pratt, Jr. School Professor of Electrical and Computer Engineering.

    After defending his dissertation this spring, Tralie plans to stay in academia, at least in part because he loves the teaching experiences he has had while at Duke.

    “Mentoring and teaching forces me to explain my work in simple terms, which raises my own understanding of it,” said Tralie. “Plus the students all end up going out and doing their own interesting things, which they can later teach me about in return. They’re like my eyes and ears out there in the fast developing world of Big Data.”

    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition
    Duke Campus

    Younger than most other prestigious U.S. research universities, Duke University consistently ranks among the very best. Duke’s graduate and professional schools — in business, divinity, engineering, the environment, law, medicine, nursing and public policy — are among the leaders in their fields. Duke’s home campus is situated on nearly 9,000 acres in Durham, N.C, a city of more than 200,000 people. Duke also is active internationally through the Duke-NUS Graduate Medical School in Singapore, Duke Kunshan University in China and numerous research and education programs across the globe. More than 75 percent of Duke students pursue service-learning opportunities in Durham and around the world through DukeEngage and other programs that advance the university’s mission of “knowledge in service to society.”

     
  • richardmitnick 4:23 am on July 26, 2016 Permalink | Reply
    Tags: "Structural causal model", , Big Data,   

    From UCLA: “Solving big data’s ‘fusion’ problem” 

    UCLA bloc

    UCLA

    July 22, 2016
    Matthew Chin

    As the field of “big data” has emerged as a tool for solving all sorts of scientific and societal questions, one of the main challenges that remains is whether, and how, multiple sets of data from various sources could be combined to determine cause-and-effect relationships in new and untested situations. Now, computer scientists from UCLA and Purdue University have devised a theoretical solution to that problem.

    Their research, which was published this month in the Proceedings of the National Academy of Sciences, could help improve scientists’ ability to understand health care, economics, the environment and other areas of study, and to glean much more pertinent insight from data.

    The study’s authors are Judea Pearl, a distinguished professor of computer science at the UCLA Henry Samueli School of Engineering and Applied Science, and Elias Bareinboim, an assistant professor of computer science at Purdue University who earned his doctorate at UCLA.

    Big data involves using mountains and mountains of information to uncover trends and patterns. But when multiple sets of big data are combined, particularly when they come from studies of diverse environments or are collected under different sets of conditions, problems can arise because certain aspects of the data won’t match up. (The challenge, Pearl explained, is like putting together a jigsaw puzzle using pieces that were produced by different manufacturers.)

    1
    Bareinboim and Pearl discovered how to estimate the effect of one variable, X, on another, Y, when data come from disparate sources that differ in another variable, Z. Judea Pearl and Elias Bareinboim

    For example, researchers might be interested to combine data about people’s health habits from several unrelated studies — say, a survey of Texas residents; an experiment involving young adults in Kenya; and research focusing on the homeless in the Northeast U.S. If the researchers wanted to use the combined data to answer a specific question — for example, “How does soft drink consumption affect obesity rates in Los Angeles?” — a common approach today would be to use statistical techniques that average out differences among the various sets of information.

    The new study claims that these statistical methods blur distinctions in the data, rather than exploiting them for more insightful analyses.

    “It’s like testing apples and oranges to guess the properties of bananas,” said Pearl, a pioneer in the field of artificial intelligence and a recipient of the Turing Award, the highest honor in computing. “How can someone apply insights from multiple sets of data, to figure out cause-and-effect relationships in a completely new situation?”

    To address this, Bareinboim and Pearl developed a mathematical tool called a structural causal model, which essentially decides how information from one source should be combined with data from other sources. This enables researchers to establish properties of yet another source — for example, the population of another state. Structural causal models diagram similarities and differences between the sources and process them using a new mathematical tool called causal calculus.

    The analysis also had another important result — deciding whether the findings from a given study can be generalized to apply to other situations, a century-old problem called external validity.

    For example, medical researchers might conduct a clinical trial involving a distinct group of people, say, college students. The method devised by Bareinboim and Pearl will allow them to predict what would happen if the treatment they were testing were given to an intended population of people in the real world.

    “A problem that every scientist in every field faces is having observations from surveys, laboratory experiments, randomized trials, field studies and more, but not knowing whether we can learn from those observations about cause-and-effect relationships in the real world,” Pearl said. “With structural causal models, they can ask first if it’s possible, and then, if that’s true, how.”

    See the full article here.

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    UC LA Campus

    For nearly 100 years, UCLA has been a pioneer, persevering through impossibility, turning the futile into the attainable.

    We doubt the critics, reject the status quo and see opportunity in dissatisfaction. Our campus, faculty and students are driven by optimism. It is not naïve; it is essential. And it has fueled every accomplishment, allowing us to redefine what’s possible, time after time.

    This can-do perspective has brought us 12 Nobel Prizes, 12 Rhodes Scholarships, more NCAA titles than any university and more Olympic medals than most nations. Our faculty and alumni helped create the Internet and pioneered reverse osmosis. And more than 100 companies have been created based on technology developed at UCLA.

     
  • richardmitnick 11:37 am on April 23, 2016 Permalink | Reply
    Tags: , Big Data,   

    From Nautilus: “How Big Data Creates False Confidence” 

    Nautilus

    Nautilus

    Apr 23, 2016
    Jesse Dunietz

    If I claimed that Americans have gotten more self-centered lately, you might just chalk me up as a curmudgeon, prone to good-ol’-days whining. But what if I said I could back that claim up by analyzing 150 billion words of text? A few decades ago, evidence on such a scale was a pipe dream. Today, though, 150 billion data points is practically passé. A feverish push for “big data” analysis has swept through biology, linguistics, finance, and every field in between.

    Although no one can quite agree how to define it, the general idea is to find datasets so enormous that they can reveal patterns invisible to conventional inquiry. The data are often generated by millions of real-world user actions, such as tweets or credit-card purchases, and they can take thousands of computers to collect, store, and analyze. To many companies and researchers, though, the investment is worth it because the patterns can unlock information about anything from genetic disorders to tomorrow’s stock prices.

    But there’s a problem: It’s tempting to think that with such an incredible volume of data behind them, studies relying on big data couldn’t be wrong. But the bigness of the data can imbue the results with a false sense of certainty. Many of them are probably bogus—and the reasons why should give us pause about any research that blindly trusts big data.

    In the case of language and culture, big data showed up in a big way in 2011, when Google released its Ngrams tool. Announced with fanfare in the journal Science, Google Ngrams allowed users to search for short phrases in Google’s database of scanned books—about 4 percent of all books ever published!—and see how the frequency of those phrases has shifted over time. The paper’s authors heralded the advent of “culturomics,” the study of culture based on reams of data and, since then, Google Ngrams has been, well, largely an endless source of entertainment—but also a goldmine for linguists, psychologists, and sociologists. They’ve scoured its millions of books to show that, for instance, yes, Americans are becoming more individualistic; that we’re “forgetting our past faster with each passing year”; and that moral ideals are disappearing from our cultural consciousness.

    2
    We’re Losing Hope: An Ngrams chart for the word “hope,” one of many intriguing plots found by xkcd author Randall Munroe. If Ngrams really does reflect our culture, we may be headed for a dark place. No image credit

    The problems start with the way the Ngrams corpus was constructed. In a study published last October, three University of Vermont researchers pointed out that, in general, Google Books includes one copy of every book. This makes perfect sense for its original purpose: to expose the contents of those books to Google’s powerful search technology. From the angle of sociological research, though, it makes the corpus dangerously skewed.

    Some books, for example, end up punching below their true cultural weight: The Lord of the Rings gets no more influence than, say, Witchcraft Persecutions in Bavaria. Conversely, some authors become larger than life. From the data on English fiction, for example, you might conclude that for 20 years in the 1900s, every character and his brother was named Lanny. In fact, the data reflect how immensely prolific (but not necessarily popular) the author Upton Sinclair was: He churned out 11 novels about one Lanny Budd.

    Still more damning is the fact that Ngrams isn’t a consistent, well-balanced slice of what was being published. The same UVM study demonstrated that, among other changes in composition, there’s a marked increase in scientific articles starting in the 1960s. All this makes it hard to trust that Google Ngrams accurately reflects the shifts over time in words’ cultural popularity.

    4
    Go Figure: “Figure” with a capital F, used mainly in captions, rose sharply in frequency through the 20th Century, suggesting that the corpus includes more technical literature over time. That may say something about society, but not much about how most of society uses words.

    Even once you get past the data sources, there’s still the thorny issue of interpretation. Sure, words like “character” and “dignity” might decline over the decades. But does that mean that people care about morality less? Not so fast, cautions Ted Underwood, an English professor at the University of Illinois, Urbana-Champaign. Conceptions of morality at the turn of the last century likely differed sharply from ours, he argues, and “dignity” might have been popular for non-moral reasons. So any conclusions we draw by projecting current associations backward are suspect.

    Of course, none of this is news to statisticians and linguists. Data and interpretation are their bread and butter. What’s different about Google Ngrams, though, is the temptation to let the sheer volume of data blind us to the ways we can be misled.

    This temptation isn’t unique to Ngrams studies; similar errors undermine all sorts of big data projects. Consider, for instance, the case of Google Flu Trends (GFT). Released in 2008, GFT would count words like “fever” and “cough” in millions of Google search queries, using them to “nowcast” how many people had the flu. With those estimates, public health officials could act two weeks before the Centers for Disease Control could calculate the true numbers from doctors’ reports.

    Initially, GFT was claimed to be 97 percent accurate. But as a study out of Northeastern University documents, that accuracy was a fluke. First, GFT completely missed the “swine flu” pandemic in the spring and summer of 2009. (It turned out that GFT was largely predicting winter.) Then, the system began to overestimate flu cases. In fact, it overshot the peak 2013 numbers by a whopping 140 percent. Eventually, Google just retired the program altogether.

    So what went wrong? As with Ngrams, people didn’t carefully consider the sources and interpretation of their data. The data source, Google searches, was not a static beast. When Google started auto-completing queries, users started just accepting the suggested keywords, distorting the searches GFT saw. On the interpretation side, GFT’s engineers initially let GFT take the data at face value; almost any search term was treated as a potential flu indicator. With millions of search terms, GFT was practically guaranteed to over-interpret seasonal words like “snow” as evidence of flu.

    But when big data isn’t seen as a panacea, it can be transformative. Several groups, like Columbia University researcher Jeffrey Shaman’s, for example, have outperformed the flu predictions of both the CDC and GFT by using the former to compensate for the skew of the latter. “Shaman’s team tested their model against actual flu activity that had already occurred during the season,” according to the CDC. By taking the immediate past into consideration, Shaman and his team fine-tuned their mathematical model to better predict the future. All it takes is for teams to critically assess their assumptions about their data.

    Lest I sound like a Google-hater, I hasten to add that the company is far from the only culprit. My wife, an economist, used to work for a company that scraped the entire Internet for job postings and aggregate them into statistics for state labor agencies. The company’s managers boasted that they analyzed 80 percent of the jobs in the country, but once again, the quantity of data blinded them to the ways it could be misread. A local Walmart, for example, might post one sales associate job when it actually wants to fill ten, or it might leave a posting up for weeks after it was filled.

    So rather than succumb to “big data hubris,” the rest of us would do well to keep our skeptic hats on—even when someone points to billions of words.

    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    Welcome to Nautilus. We are delighted you joined us. We are here to tell you about science and its endless connections to our lives. Each month we choose a single topic. And each Thursday we publish a new chapter on that topic online. Each issue combines the sciences, culture and philosophy into a single story told by the world’s leading thinkers and writers. We follow the story wherever it leads us. Read our essays, investigative reports, and blogs. Fiction, too. Take in our games, videos, and graphic stories. Stop in for a minute, or an hour. Nautilus lets science spill over its usual borders. We are science, connected.

     
  • richardmitnick 7:57 am on March 17, 2016 Permalink | Reply
    Tags: , Big Data,   

    From UNSW: “Size doesn’t matter” 

    U NSW bloc

    University of New South Wales

    Size doesn’t matter in Big Data, it’s what you ask of it that counts

    17 Mar 2016
    Malte Ebach

    OPINION: Big Data is changing the way we do science today. Traditionally, data was collected manually by scientists making measurements, using microscopes or surveys. This data could be analysed by hand or using simple statistical software on a PC.

    Big Data has changed all that. These days, tremendous volumes of information are being generated and collected through new technologies, be they large telescope arrays, DNA sequencers or Facebook.

    Big Data UNSW

    ORNL Titan Supercomputer
    Cray Titan Supercomputer at ORNL

    The data is vast, but the kinds of data and the formats they take are also new. Consider the hourly clicks on Facebook, or the daily searches on Google. As a result, Big Data offers scientists the ability to perform powerful analyses and make new discoveries.

    The problem is that Big Data hasn’t yet changed the way many researchers ask scientific questions. In biology in particular, where tools like genome sequencing are generating tremendous amounts of data, biologists might not be asking the right kinds of questions that Big Data can answer.

    Questions

    Asking questions is what scientists do. Biologists ask questions about the living world, such as “how many species are there?” or “what are the evolutionary relationships between rats, bats and primates?”.

    The way we ask questions says a lot about the type of information we use. For example, systematists like myself study the diversity and relationship between the many species of creatures throughout evolutionary history.

    We have tended to use physical characteristics, like teeth and bones, to classify mammals into taxonomic groups. These shared characteristics allow us to recognise new species and identify existing ones.

    Enter Big Data, and cheap DNA sequencing technology. Now systematists have access to new forms of information, such as whole genomes, which have drastically changed the way we do systematics. But it hasn’t changed the way many systematists frame their questions.

    Biologists are expecting big things from Big Data, but they are finding out that it initially delivers only so much. Rather than find out what these limitations are and how they can shape our questions, many biologists have responded by gathering more and more data. Put simply: scientists have been lured by size.

    Size matters

    Quantity is often seen as a benchmark of success. The more you have, the better your study will be.

    This thinking stems from the idealistic view of complete datasets with unbiased sampling. Statisticians call this “n = all”, which represents a data set that contains all the information.

    If all the data was available, then scientists wouldn’t have the problem of missing or corrupted data. A real world example would be a complete genome sequence.

    Having all the data would tell us everything, right? Not exactly.

    From 2004 to 2006, J. Craig Venter led an expedition to sample genomes in sea water from the North Atlantic. He concluded he had found 1,800 species.

    Not so fast. He did, in fact, find thousands of unique genomes, but to determine whether they are new species will require Venter and his team to compare and diagnose each organism, as well as name them.

    So, in answer to the question: “how many species are there in this bucket of water?”, Big Data gave the answer of 1.045 billion base pairs. But 1.045 billion base pairs could mean any number of species.

    Size doesn’t matter, it is what we ask of our data that counts.

    Wrong questions

    Asking impossible questions has been the bane of Big Data across many fields of research. For example, Google Flu Trends, an initiative launched by Google to predict flu epidemics weeks before the Centers for Disease Control and Prevention (CDC), made the mistake of asking a traditionally framed question: “when will the next flu epidemic hit North America?”.

    The data analysed were non-traditional, namely the number and frequency of Google search terms. When compared to CDC data, it was discovered that Google Flu Trends missed the 2009 epidemic and over-predicted flu trends by more than double between 2012 and 2013.

    In 2013, Google Flu Trends was abandoned as being unable to answer the questions we were asking of it. Some statisticians blamed sampling bias, others blamed the lack of transparency regarding the Google search terms. Another reason could simply be that the question asked was inappropriate given the non-traditional data collected.

    Big Data is being misunderstood, and this is limiting our ability to find meaningful answers to our questions. Big Data is not a replacement for traditional methods and questions. Rather, it is a supplement.

    Biologists also need to adjust the questions aimed at Big Data. Unlike traditional data, Big Data cannot give a precise answer to a traditionally framed question.

    Instead Big Data sends the scientist onto a path to bigger and bigger discoveries. Big and traditional data can be used together can enable biologists to better navigate their way down the path of discovery.

    If Venter actually took the next step and examined those sea creatures, we could make a historic discovery. If Google Flu Trends asked “what do the frequency and number of Google search terms tell us?”, then we may make an even bigger discovery.

    As we incorporate Big Data into the existing scientific line of enquiry, we also need to accommodate appropriate questions. Until then, biologists are stuck with impossible answers to the wrong questions.

    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    U NSW Campus

    Welcome to UNSW Australia (The University of New South Wales), one of Australia’s leading research and teaching universities. At UNSW, we take pride in the broad range and high quality of our teaching programs. Our teaching gains strength and currency from our research activities, strong industry links and our international nature; UNSW has a strong regional and global engagement.

    In developing new ideas and promoting lasting knowledge we are creating an academic environment where outstanding students and scholars from around the world can be inspired to excel in their programs of study and research. Partnerships with both local and global communities allow UNSW to share knowledge, debate and research outcomes. UNSW’s public events include concert performances, open days and public forums on issues such as the environment, healthcare and global politics. We encourage you to explore the UNSW website so you can find out more about what we do.

     
  • richardmitnick 6:55 pm on December 17, 2015 Permalink | Reply
    Tags: , Big Data,   

    From UC Berkeley: “Seeing Through the Big Data Fog” 

    UC Berkeley

    UC Berkeley

    December 14, 2015
    Wallace Ravven

    A neuroscientist studies how stress affects the brain’s ability to form new memories. Across the campus, another researcher looks for telltale signs of distant planets in a sliver of sky. What each of them seeks may lie hidden in an avalanche of data.

    1
    Joe Hellerstein and his students developed a new programming model for distributed computing which MIT Technology Review named one of the 10 technologies “most likely to change our world”.

    The same is true in industry, where data must be diced, sliced and analyzed to identify changes in customer behavior or the promise of new fabrication techniques.

    Working the data so that it can yield to analysis regularly runs into a bottleneck — a human bottleneck, says Berkeley computer science professor Joe Hellerstein.

    In 2011, Sean Kandel, a grad student working with Hellerstein and Stanford computer scientist Jeffrey Heer, interviewed three dozen analysts at 25 companies in different industries to ask them how they spend their time, what their “pain points” were, as Hellerstein says.

    “It became very clear that the task of wrangling data takes up the lion’s share of their time,” Hellerstein says. “People come at data differently. They name data differently, or it may be incomplete. You have to sort this out. You find oddball data, and you don’t know if it was input incorrectly or if it’s a meaningful outlier. All this precedes analysis. It’s very tedious.”

    Hellerstein, Heer and Kandel devised a software program to refine and speed the process. They called it, reasonably enough, Data Wrangler, and made it freely available online. Data Wrangler became the core of Trifacta, a startup they founded in 2012.

    Trifacta provides a platform to efficiently convert raw data into more structured formats for analysis. Its flagship product for data wrangling enables data analysts to easily transform data from messy traces of the real world into structured tables and charts that can reveal unsuspected patterns, or suggest new directions for analysis.

    Trifacta was quickly adopted by dozens of companies, from Linkedin to Lockheed Martin, and typically provides a major productivity gain.

    “What used to take weeks suddenly takes minutes”, Hellerstein says. “So you can experiment a great deal more with the data. This was far and away the most useful piece of research that I have been involved in.”

    In 2014, CRN, a high-profile communications technology magazine, placed Trifacta on its short list of The 10 Coolest Big Data Products.

    2
    Joe Hellerstein and his postdoc Eugene Wu worked on designing a high-level language for crafting interactive visualizations of data. Wu is now a professor at Columbia University. Photo: Peg Skorpinski

    GoPro, the company that makes wearable video recorders, was an early Trifacta client. On YouTube, GoPro videos run the gamut from a sky diver’s death-defying leap to Kama, the surfing pig. (He prefers three-to four-foot waves.)

    After sales of its recorders took off, GoPro moved into developing media software and other online services for customers. The company was soon inundated with coveted consumer data from devices, retail sales, social media and other sources.

    GoPro built a data science team, which brought in Trifacta to clean up the data and present it in an intuitive and accessible format, so the less techy business people could use it to tailor services to customers and offer new products.

    Hellerstein’s research also targets software developers who build Big Data systems — systems that may harness hundreds or thousands of computers to do their work. These “distributed computing” platforms, which also form the foundation of Cloud Computing, create major new hurdles for software engineering.

    Code for a single computer is an ordered list of instructions, and most programming languages were designed for simple, orderly computing on a single machine.

    With a distributed system, Hellerstein says, “If you force order, the machines spend all their time coordinating, and progress is limited by the slowest machine. Working around this with a traditional programming language is incredibly hard, and typically leads to all kinds of tricky bugs and design flaws.”

    With his students, he launched the BOOM (Berkeley Orders of Magnitude) project to develop a new programming model for distributed computers that helps programmers avoid specifying the steps of a computation in a particular order. Instead, it focuses on the information that the program must manage, and the way that information flows through machines and tasks.

    “The main result of the BOOM project is a ‘disorderly’ programming language called Bloom, which has enabled us to write complex distributed programs in simple, intuitive ways — with tens or hundreds of times less code than traditional languages,” Hellerstein says.

    In 2010, Bloom was recognized by MIT Technology Review as one of the 10 technologies “most likely to change our world.”

    Hellerstein has since used it in his courses on “Programming the Cloud” at Berkeley. It has been adopted by a number of research groups and forms the basis of a startup company in the Bay Area called Eve that Hellerstein advises.

    As he describes his work to ease data wrangling and speed cloud programming, Hellerstein turns a small metal hammer in his hands. “It’s true,” he says. “I do like to tinker.” Of course, he does way more than tinker. He’s developing better tools for the trade.

    See the full article here .

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    Founded in the wake of the gold rush by leaders of the newly established 31st state, the University of California’s flagship campus at Berkeley has become one of the preeminent universities in the world. Its early guiding lights, charged with providing education (both “practical” and “classical”) for the state’s people, gradually established a distinguished faculty (with 22 Nobel laureates to date), a stellar research library, and more than 350 academic programs.

    UC Berkeley Seal

     
  • richardmitnick 2:48 pm on November 11, 2015 Permalink | Reply
    Tags: , Big Data, ,   

    From Science Node: “Building the US big data machine” 

    Science Node bloc
    Science Node

    11.11.15
    Lance Farrell

    The US National Science Foundation (NSF) just put $5 million (€4.68 million) into big data research, establishing four regional centers to advance innovation and spur collaboration across domains.

    1
    The Big Data Regional Innovation Hubs cover all 50 states and include commitments from more than 250 organizations — from universities and cities to foundations and Fortune 500 corporations — with the ability to expand further over time. Courtesy NSF.

    The US National Science Foundation (NSF) has long been integral to the development of the infrastructure, tools, and training required for gleaning insights from large data sets. With a recent $5 million (€4.68 million) investment, they are creating four big data hubs so scientists can investigate research topics with an everyday impact in their region.

    The four hubs are located at Columbia University (Northeast); Georgia Institute of Technology and the University of North Carolina (South); the University of Illinois at Urbana-Champaign (Midwest); and the University of California, San Diego, the University of California, Berkeley, and the University of Washington (West). Alaskan and Hawaiian researchers will work through the West hub, and US territories are welcome to participate in any regional hub.

    “By establishing partnerships among likeminded stakeholders,” says Jim Kurose, NSF’s assistant director for Computer and Information Science and Engineering, “BD Regional Innovation Hubs represent a unique approach to improving the impact of data science.”

    Each of the hubs chose research foci mirroring regional strengths and challenges. For instance, the Midwest hub is situated near one of the largest freshwater reservoirs in the world. The Midwest is also home to Mayo Clinic, world-renowned leader in healthcare, and Eli Lilly, one of the world’s largest pharmaceutical companies. And, as the third most populous city in the US, Chicago employs many smart city management concepts.

    For these reasons, the Midwest hub will focus on:

    Society (e.g., smart cities and communities; network science; business analytics)
    Natural and built world (e.g., water, food, and energy; digital agriculture; transportation; and advanced manufacturing)
    Healthcare and biomedical research

    “Big data will help us determine how much water to use for raising food, how much for drinking, and how much to leave untouched,” said Edward Seidel, principal investigator of the Midwest hub and director of the National Center for Supercomputing Applications. “It will help us decide how to allocate resources based on current soil, crop, and climate decisions — ultimately, how to make the smartest decisions possible for the benefit of the people who live here.”

    The other hubs will have similar foci:

    Northeast hub: Big data in health, energy, finance, cities/regions, discovery science, and data science in education (connecting research will include data sharing, privacy and security, ethics and policy, and education)

    Southern hub: Healthcare, coastal hazards, industrial big data, materials and manufacturing, and habitat planning

    Western hub: Big data technologies, managing natural resources and hazards, precision medicine, metro data science, and data-enabled scientific discovery and learning

    See the full article here .

    Please help promote STEM in your local schools.
    STEM Icon

    Stem Education Coalition

    Science Node is an international weekly online publication that covers distributed computing and the research it enables.

    “We report on all aspects of distributed computing technology, such as grids and clouds. We also regularly feature articles on distributed computing-enabled research in a large variety of disciplines, including physics, biology, sociology, earth sciences, archaeology, medicine, disaster management, crime, and art. (Note that we do not cover stories that are purely about commercial technology.)

    In its current incarnation, Science Node is also an online destination where you can host a profile and blog, and find and disseminate announcements and information about events, deadlines, and jobs. In the near future it will also be a place where you can network with colleagues.

    You can read Science Node via our homepage, RSS, or email. For the complete iSGTW experience, sign up for an account or log in with OpenID and manage your email subscription from your account preferences. If you do not wish to access the website’s features, you can just subscribe to the weekly email.”

     
  • richardmitnick 9:22 am on August 13, 2015 Permalink | Reply
    Tags: , Big Data, , Discrimination,   

    From The Conversation: “Big data algorithms can discriminate, and it’s not clear what to do about it” 

    Conversation
    The Conversation

    August 13, 2015
    Jeremy Kun

    “This program had absolutely nothing to do with race…but multi-variable equations.”

    That’s what Brett Goldstein, a former policeman for the Chicago Police Department (CPD) and current Urban Science Fellow at the University of Chicago’s School for Public Policy, said about a predictive policing algorithm he deployed at the CPD in 2010. His algorithm tells police where to look for criminals based on where people have been arrested previously. It’s a “heat map” of Chicago, and the CPD claims it helps them allocate resources more effectively.

    Chicago police also recently collaborated with Miles Wernick, a professor of electrical engineering at Illinois Institute of Technology, to algorithmically generate a “heat list” of 400 individuals it claims have the highest chance of committing a violent crime. In response to criticism, Wernick said the algorithm does not use “any racial, neighborhood, or other such information” and that the approach is “unbiased” and “quantitative.” By deferring decisions to poorly understood algorithms, industry professionals effectively shed accountability for any negative effects of their code.

    But do these algorithms discriminate, treating low-income and black neighborhoods and their inhabitants unfairly? It’s the kind of question many researchers are starting to ask as more and more industries use algorithms to make decisions. It’s true that an algorithm itself is quantitative – it boils down to a sequence of arithmetic steps for solving a problem. The danger is that these algorithms, which are trained on data produced by people, may reflect the biases in that data, perpetuating structural racism and negative biases about minority groups.

    There are a lot of challenges to figuring out whether an algorithm embodies bias. First and foremost, many practitioners and “computer experts” still don’t publicly admit that algorithms can easily discriminate. More and more evidence supports that not only is this possible, but it’s happening already. The law is unclear on the legality of biased algorithms, and even algorithms researchers don’t precisely understand what it means for an algorithm to discriminate.

    2
    Is bias baked in? Justin Ruckman, CC BY

    Being quantitative doesn’t protect against bias

    Both Goldstein and Wernick claim their algorithms are fair by appealing to two things. First, the algorithms aren’t explicitly fed protected characteristics such as race or neighborhood as an attribute. Second, they say the algorithms aren’t biased because they’re “quantitative.” Their argument is an appeal to abstraction. Math isn’t human, and so the use of math can’t be immoral.

    Sadly, Goldstein and Wernick are repeating a common misconception about data mining, and mathematics in general, when it’s applied to social problems. The entire purpose of data mining is to discover hidden correlations. So if race is disproportionately (but not explicitly) represented in the data fed to a data-mining algorithm, the algorithm can infer race and use race indirectly to make an ultimate decision.

    Here’s a simple example of the way algorithms can result in a biased outcome based on what it learns from the people who use it. Look at how how Google search suggests finishing a query that starts with the phrase “transgenders are”:

    3
    Taken from Google.com on 2015-08-10.

    Autocomplete features are generally a tally. Count up all the searches you’ve seen and display the most common completions of a given partial query. While most algorithms might be neutral on the face, they’re designed to find trends in the data they’re fed. Carelessly trusting an algorithm allows dominant trends to cause harmful discrimination or at least have distasteful results.

    Beyond biased data, such as Google autocompletes, there are other pitfalls, too. Moritz Hardt, a researcher at Google, describes what he calls the sample size disparity. The idea is as follows. If you want to predict, say, whether an individual will click on an ad, most algorithms optimize to reduce error based on the previous activity of users.

    But if a small fraction of users consists of a racial minority that tends to behave in a different way from the majority, the algorithm may decide it’s better to be wrong for all the minority users and lump them in the “error” category in order to be more accurate on the majority. So an algorithm with 85% accuracy on US participants could err on the entire black sub-population and still seem very good.

    Hardt continues to say it’s hard to determine why data points are erroneously classified. Algorithms rarely come equipped with an explanation for why they behave the way they do, and the easy (and dangerous) course of action is not to ask questions.

    3
    Those smiles might not be so broad if they realized they’d be treated differently by the algorithm. Men image via http://www.shutterstock.com

    Extent of the problem

    While researchers clearly understand the theoretical dangers of algorithmic discrimination, it’s difficult to cleanly measure the scope of the issue in practice. No company or public institution is willing to publicize its data and algorithms for fear of being labeled racist or sexist, or maybe worse, having a great algorithm stolen by a competitor.

    Even when the Chicago Police Department was hit with a Freedom of Information Act request, they did not release their algorithms or heat list, claiming a credible threat to police officers and the people on the list. This makes it difficult for researchers to identify problems and potentially provide solutions.

    Legal hurdles

    Existing discrimination law in the United States isn’t helping. At best, it’s unclear on how it applies to algorithms; at worst, it’s a mess. Solon Barocas, a postdoc at Princeton, and Andrew Selbst, a law clerk for the Third Circuit US Court of Appeals, argued together that US hiring law fails to address claims about discriminatory algorithms in hiring.

    The crux of the argument is called the “business necessity” defense, in which the employer argues that a practice that has a discriminatory effect is justified by being directly related to job performance. According to Barocas and Selbst, if a company algorithmically decides whom to hire, and that algorithm is blatantly racist but even mildly successful at predicting job performance, this would count as business necessity – and not as illegal discrimination. In other words, the law seems to support using biased algorithms.

    What is fairness?

    Maybe an even deeper problem is that nobody has agreed on what it means for an algorithm to be fair in the first place. Algorithms are mathematical objects, and mathematics is far more precise than law. We can’t hope to design fair algorithms without the ability to precisely demonstrate fairness mathematically. A good mathematical definition of fairness will model biased decision-making in any setting and for any subgroup, not just hiring bias or gender bias.

    And fairness seems to have two conflicting aspects when applied to a population versus an individual. For example, say there’s a pool of applicants to fill 10 jobs, and an algorithm decides to hire candidates completely at random. From a population-wide perspective, this is as fair as possible: all races, genders and orientations are equally likely to be selected.

    But from an individual level, it’s as unfair as possible, because an extremely talented individual is unlikely to be chosen despite their qualifications. On the other hand, hiring based only on qualifications reinforces hiring gaps. Nobody knows if these two concepts are inherently at odds, or whether there is a way to define fairness that reasonably captures both. Cynthia Dwork, a Distinguished Scientist at Microsoft Research, and her colleagues have been studying the relationship between the two, but even Dwork admits they have just scratched the surface.

    5

    See the full article here.

    Please help promote STEM in your local schools.

    STEM Icon

    Stem Education Coalition

    The Conversation US launched as a pilot project in October 2014. It is an independent source of news and views from the academic and research community, delivered direct to the public.
    Our team of professional editors work with university and research institute experts to unlock their knowledge for use by the wider public.
    Access to independent, high quality, authenticated, explanatory journalism underpins a functioning democracy. Our aim is to promote better understanding of current affairs and complex issues. And hopefully allow for a better quality of public discourse and conversation.

     
c
Compose new post
j
Next post/Next comment
k
Previous post/Previous comment
r
Reply
e
Edit
o
Show/Hide comments
t
Go to top
l
Go to login
h
Show/Hide help
shift + esc
Cancel
%d bloggers like this: