Google Research Publication: BigTable

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

via Google Research Publication: BigTable.

Troll sues Facebook, Amazon and others for using Hadoop

Big data has become the latest front for the patent troll epidemic as a shell company is suing firms for using a common open-source storage framework known as the Hadoop Distributed File System (HDFS).

via Troll sues Facebook, Amazon and others for using Hadoop — Tech News and Analysis.

Hadoop has been built by a large network of contributors, including individual developers and large companies like Yahoo and is an Apache Software Foundation project. HDFS, its storage component, was based on Google’s Google File System. Parallel Iron’s patent complaints, however, say the whole system was made possible by four men:

Everything You Wanted to Know About Data Mining but Were Afraid to Ask

With data mining it is possible to let the data itself determine the groups. This is one of the black-box type of algorithms that are hard to understand. But in a simple example – again with purchasing behavior – we can imagine that the purchasing habits of different hobbyists would look quite different from each other: gardeners, fishermen and model airplane enthusiasts would all be quite distinct. Machine learning algorithms can detect all of the different subgroups within a dataset that differ significantly from each other.

via Everything You Wanted to Know About Data Mining but Were Afraid to Ask – Alexander Furnas – Technology – The Atlantic.

How Web giants store big—and we mean big—data

The Great Disk Drive in the Sky: How Web giants store big—and we mean big—data.

The need for this kind of perpetually scalable, durable storage has driven the giants of the Web—Google, Amazon, Facebook, Microsoft, and others—to adopt a different sort of storage solution: distributed file systems based on object-based storage. These systems were at least in part inspired by other distributed and clustered filesystems such as Red Hat’s Global File System and IBM’s General Parallel Filesystem.

And one more blurb…

Google wanted to turn large numbers of cheap servers and hard drives into a reliable data store for hundreds of terabytes of data that could manage itself around failures and errors. And it needed to be designed for Google’s way of gathering and reading data, allowing multiple applications to append data to the system simultaneously in large volumes and to access it at high speeds.