Yahoo just released a ton of data in the name of academia. In what is purported to be the largest ever cache of Internet data ever granted to researchers, the company is granting universities access to the online behaviors of some 20 million anonymous users, including their clicks, hovers, and scrolls across a myriad of Yahoo’s pages. The sheer volume of information, Yahoo says, should allow scientists further their work on machine learning and deep learning.
“Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research,” the Internet giant said in a blog post about the recent release. “The dataset is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically useful datasets comprising anonymized user data for noncommercial use.”
The decision comes as Yahoo faces an alarmingly static period during its two decades of existence, even as chief competitors like Google and other social media companies make huge strides across different fields within the tech industry. So in an effort to innovate, Yahoo is investing deeply in the realm of artificial intelligence, and allowing researchers to see exactly how people actually behave when they’re on the Internet.
Despite the fact that all the data is completely anonymized, users might be alarmed by how much Yahoo is actually telling these institutions (and only these institutions). “In addition to the interaction data,” Yahoo says, “we are providing categorized demographic information (age range, gender, and generalized geographic data) for a subset of the anonymized users. On the item side, we are releasing the title, summary, and key-phrases of the pertinent news article.” Further, the company will also reveal “the relevant local time and also contains partial information about the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining.”
This comes as a huge boon to researchers who often don’t have enough data to work with in order to fully realize their projects. “Data is not easy to come by for folks not inside companies,” said Gert Lanckriet, a professor in the Department of Electrical and Computer Engineering, University of California, San Diego, at an event announcing the data release.
“We hope that this data release will similarly inspire our fellow researchers, data scientists, and machine learning enthusiasts in academia, and help validate their models on an extensive, ‘real-world’ dataset,” Yahoo concluded. “We strongly believe that this dataset can become the benchmark for large-scale machine learning and recommender systems, and we look forward to hearing from the community about their applications of our data.”