06 Mar, 2017
Much of the internet hides like an iceberg below the surface.
This so-called ‘deep web’ is estimated to be 500 times bigger than the ‘surface web’ seen through search engines like Google. For scientists and others, the deep web holds important computer code and licensing agreements.
Nestled further inside the deep web, one finds the ‘dark web,’ a place where images and video are used by traders in illicit drugs, weapons, and human lives.
“Behind forms and logins, there are bad things,” says Chris Mattmann, chief architect in the instrument and science data systems section of the NASA Jet Propulsion Laboratory (JPL) at the California Institute of Technology.
“Behind the dynamic portions of the web, people are doing nefarious things, and on the dark web, they’re doing even more nefarious things. They traffic in guns and human organs. They’re doing these activities and then they’re tying them back to terrorism.”
In 2014, the Defense Advanced Research Projects Agency (DARPA) started a program called Memex to make the deep web accessible. “The goal of Memex was to provide search engines the retrieval capacity to deal with those situations and to help defense and law enforcement go after the bad guys on the deep web,” Mattmann says.
At the same time, the US National Science Foundation (NSF) invested $11.2 million in a first-of-its-kind data-intensive supercomputer – the Wrangler supercomputer, now housed at the Texas Advanced Computing Center (TACC). The NSF asked engineers and computer scientists at TACC, Indiana University, and the University of Chicago if a computer could be built to handle massive amounts of input and output.
Wrangler does just that, enabling the speedy file transfers needed to fly past big data bottlenecks that can slow down even the fastest computers. It was built to work in tandem with number crunchers such as TACC’s Stampede, which in 2013 was the sixth fastest computer in the world.
“Although we have a lot of search-based queries through different search engines like Google, it’s still a challenge to query the system in way that answers your questions directly,” says Karanjeet Singh.
Singh is a University of Southern California graduate student who works with Chris Mattmann on Memex and other projects.
“The objective is to get more and more domain-specific information from the internet and to associate facts from that information.”
Once the Memex user extracts the information they need, they can apply tools such as named entity recognizer, sentiment analysis, and topic summarization. This can help law enforcement agencies find links between different activities, such as illegal weapon sales and human trafficking.
The problem is that even the fastest computers like Stampede weren’t designed to handle the input and output of the millions of files needed for the Memex project.
“Let’s say that we have one system directly in front of us, and there is some crime going on,” Singh says. “What the JPL is trying to do is automate a lot of domain-specific query processes into a system where you can just feed in the questions and receive the answers.”
For that, he works with an open source web crawler called Apache Nutch. It retrieves and collects web page and domain information of the deep web. The MapReduce framework powers those crawls with a divide-and-conquer approach to big data that breaks it up into small pieces that run simultaneously.
The NSF asked engineers and computer scientists at TACC, Indiana University, and the University of Chicago if a computer could be built to handle massive amounts of input and output.
Wrangler avoids data overload by virtue of its 600 terabytes of speedy flash storage. What’s more, Wrangler supports the Hadoop framework, which runs using MapReduce.
Together, Wrangler and Memex constitute a powerful crime-fighting duo. NSF investment in advanced computation has placed powerful tools in the hands of public defense agencies, moving law enforcement beyond the limitations of commercial search engines.
“Wrangler is a fantastic tool that we didn’t have before as a mechanism to do research,” says Mattman. “It has been an amazing resource that has allowed us to develop techniques that are helping save people, stop crime, and stop terrorism around the world.”
See the full article here .
Science Node is an international weekly online publication that covers distributed computing and the research it enables.
“We report on all aspects of distributed computing technology, such as grids and clouds. We also regularly feature articles on distributed computing-enabled research in a large variety of disciplines, including physics, biology, sociology, earth sciences, archaeology, medicine, disaster management, crime, and art. (Note that we do not cover stories that are purely about commercial technology.)
In its current incarnation, Science Node is also an online destination where you can host a profile and blog, and find and disseminate announcements and information about events, deadlines, and jobs. In the near future it will also be a place where you can network with colleagues.
You can read Science Node via our homepage, RSS, or email. For the complete iSGTW experience, sign up for an account or log in with OpenID and manage your email subscription from your account preferences. If you do not wish to access the website’s features, you can just subscribe to the weekly email.”