Yahoo releases massive research dataset

The data release, part of the company’s Webscope initiative and announced on Yahoo’s Tumblr blog, is intended for researchers to use in validating recommender systems, high-scale learning algorithms, user-behaviour modelling, collaborative filtering techniques and unsupervised learning methods.

Source: Yahoo releases massive research dataset

From: Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers

Today, we are proud to announce the public release of the largest-ever machine learning dataset to the research community. The dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015.

The Enron E-mails’ Immortal Life

This research has had widespread applications: computer scientists have used the corpus to train systems that automatically prioritize certain messages in an in-box and alert users that they may have forgotten about an important message. Other researchers use the Enron corpus to develop systems that automatically organize or summarize messages. Much of today’s software for fraud detection, counterterrorism operations, and mining workplace behavioral patterns over e-mail has been somehow touched by the data set.

via The Enron E-mails’ Immortal Life | MIT Technology Review.

Information Extraction and Synthesis Laboratory

Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity. It is central to knowledge base construction and also useful for joint inference with other NLP components. Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. We use a method for automatically gathering massive amounts of naturally-occurring cross-document reference data to create the Wikilinks dataset comprising of 40 million mentions over 3 million entities. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people.

via Wikilinks – Information Extraction and Synthesis Laboratory.

The City of Chicago is on Github

This means that projects like OpenStreetMaps will be able to add over 2GBs of Chicago data to their site. This also means that companies and Chicago startups who would like to leverage this data are able to as part of daily business.

via The City of Chicago is on Github – The Changelog.

I downloaded the crime data dataset that supposedly includes all reported crimes from 2001.  The CSV file was 1G in plain text.  They could have compressed it but it doesn’t matter.  It contained over 4 million records.  Now I have to figure out how to slice and dice this dataset and for what purpose I don’t quite know yet.