Skip to main content

Conquering the clutter: Data Civilizer can sift through heaps of information

The Digital Self: We need laws that empower consumers in the face of big data
Image used with permission by copyright holder
Big data is a big deal. With these huge data sets, analysts can gain unprecedented insight into the hidden patterns of fields like physics, healthcare, and finance. Collecting and analyzing this data has become a relatively easy part of the process. Aggregating and organizing it all has proven to be more difficult.

“An oft-cited statistic is that data scientists spend 80 percent of their time finding, preparing, integrating, and cleaning data sets,” Dong Deng, a postdoctorate associate at the Massachusetts Institute of Technology’s Computer Science and Artificial Intelligence Laboratory (CSAIL), told Digital Trends. “The remaining 20 percent is spent doing the desired analytic tasks.” Deng suspects that 80 percent may even be a low-ball estimate, citing Mark Schreiber, a data officer from Merck, who claimed his data scientists spend 98 percent of their time on “grunt work.”

Recommended Videos

To minimize this grunt work and help conquer the clutter of big data, Deng created a system called Data Civilizer along with a team of researchers form CSAIL, the Technical University of Berlin, Nanyang Technological University, the University of Waterloo, and the Qatar Computing Research Institute.

To tame data, the system requires the information be arranged in tables. From there, the system analyzes every column in each table to create a statistical summary of the individual columns, such as the range of values or most frequently occurring words. It then compares each column summary to find similar ranges or sets of words and develops a map to represent the connections.

“Data Civilizer helps users discover interesting data, stitch together relevant data from multiple sources, clean the desired data, and output it to the recipient,” Deng said.

Deng and his team are working to make Data Civilizer into a more scalable module, which means refining the system to include more automated functions. “Data cleaning cannot be a manual process,” he said, “because that will not scale. Hence, we are investigating semi-supervised algorithms for more scalable data cleaning.”

The team is also planning a more approachable user interface that can be used easily by non-programmers. They expect their system to be available sometime in 2017.

Dyllan Furness
Former Digital Trends Contributor
Dyllan Furness is a freelance writer from Florida. He covers strange science and emerging tech for Digital Trends, focusing…
Range Rover’s first electric SUV has 48,000 pre-orders
Land Rover Range Rover Velar SVAutobiography Dynamic Edition

Range Rover, the brand made famous for its British-styled, luxury, all-terrain SUVs, is keen to show it means business about going electric.

And, according to the most recent investor presentation by parent company JLR, that’s all because Range Rover fans are showing the way. Not only was demand for Range Rover’s hybrid vehicles up 29% in the last six months, but customers are buying hybrids “as a stepping stone towards battery electric vehicles,” the company says.

Read more
BYD’s cheap EVs might remain out of Canada too
BYD Han

With Chinese-made electric vehicles facing stiff tariffs in both Europe and America, a stirring question for EV drivers has started to arise: Can the race to make EVs more affordable continue if the world leader is kept out of the race?

China’s BYD, recognized as a global leader in terms of affordability, had to backtrack on plans to reach the U.S. market after the Biden administration in May imposed 100% tariffs on EVs made in China.

Read more
Tesla posts exaggerate self-driving capacity, safety regulators say
Beta of Tesla's FSD in a car.

The National Highway Traffic Safety Administration (NHTSA) is concerned that Tesla’s use of social media and its website makes false promises about the automaker’s full-self driving (FSD) software.
The warning dates back from May, but was made public in an email to Tesla released on November 8.
The NHTSA opened an investigation in October into 2.4 million Tesla vehicles equipped with the FSD software, following three reported collisions and a fatal crash. The investigation centers on FSD’s ability to perform in “relatively common” reduced visibility conditions, such as sun glare, fog, and airborne dust.
In these instances, it appears that “the driver may not be aware that he or she is responsible” to make appropriate operational selections, or “fully understand” the nuances of the system, NHTSA said.
Meanwhile, “Tesla’s X (Twitter) account has reposted or endorsed postings that exhibit disengaged driver behavior,” Gregory Magno, the NHTSA’s vehicle defects chief investigator, wrote to Tesla in an email.
The postings, which included reposted YouTube videos, may encourage viewers to see FSD-supervised as a “Robotaxi” instead of a partially automated, driver-assist system that requires “persistent attention and intermittent intervention by the driver,” Magno said.
In one of a number of Tesla posts on X, the social media platform owned by Tesla CEO Elon Musk, a driver was seen using FSD to reach a hospital while undergoing a heart attack. In another post, a driver said he had used FSD for a 50-minute ride home. Meanwhile, third-party comments on the posts promoted the advantages of using FSD while under the influence of alcohol or when tired, NHTSA said.
Tesla’s official website also promotes conflicting messaging on the capabilities of the FSD software, the regulator said.
NHTSA has requested that Tesla revisit its communications to ensure its messaging remains consistent with FSD’s approved instructions, namely that the software provides only a driver assist/support system requiring drivers to remain vigilant and maintain constant readiness to intervene in driving.
Tesla last month unveiled the Cybercab, an autonomous-driving EV with no steering wheel or pedals. The vehicle has been promoted as a robotaxi, a self-driving vehicle operated as part of a ride-paying service, such as the one already offered by Alphabet-owned Waymo.
But Tesla’s self-driving technology has remained under the scrutiny of regulators. FSD relies on multiple onboard cameras to feed machine-learning models that, in turn, help the car make decisions based on what it sees.
Meanwhile, Waymo’s technology relies on premapped roads, sensors, cameras, radar, and lidar (a laser-light radar), which might be very costly, but has met the approval of safety regulators.

Read more