Big data is a big deal. With these huge data sets, analysts can gain unprecedented insight into the hidden patterns of fields like physics, healthcare, and finance. Collecting and analyzing this data has become a relatively easy part of the process. Aggregating and organizing it all has proven to be more difficult.
“An oft-cited statistic is that data scientists spend 80 percent of their time finding, preparing, integrating, and cleaning data sets,” Dong Deng, a postdoctorate associate at the Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Laboratory (CSAIL), told Digital Trends. “The remaining 20 percent is spent doing the desired analytic tasks.” Deng suspects that 80 percent may even be a low-ball estimate, citing Mark Schreiber, a data officer from Merck, who claimed his data scientists spend 98 percent of their time on “grunt work.”
To minimize this grunt work and help conquer the clutter of big data, Deng created a system called Data Civilizer along with a team of researchers form CSAIL, the Technical University of Berlin, Nanyang Technological University, the University of Waterloo, and the Qatar Computing Research Institute.
To tame data, the system requires the information be arranged in tables. From there, the system analyzes every column in each table to create a statistical summary of the individual columns, such as the range of values or most frequently occurring words. It then compares each column summary to find similar ranges or sets of words and develops a map to represent the connections.
“Data Civilizer helps users discover interesting data, stitch together relevant data from multiple sources, clean the desired data, and output it to the recipient,” Deng said.
Deng and his team are working to make Data Civilizer into a more scalable module, which means refining the system to include more automated functions. “Data cleaning cannot be a manual process,” he said, “because that will not scale. Hence, we are investigating semi-supervised algorithms for more scalable data cleaning.”
The team is also planning a more approachable user interface that can be used easily by non-programmers. They expect their system to be available sometime in 2017.