Rackspace’s dispute is with an IP Nav unit called Parallel Iron, which says it has three patents that cover the open source Hadoop Distributed File System (HDFS). But remarkably, Rackspace didn’t even know that at first; IP Nav contacted Rackspace and told the company it infringed some patents while refusing to even reveal the numbers or the owners of the patents, unless Rackspace signed a “forbearance agreement” to not sue first. (Sometimes companies threatened by patent trolls can file a “declaratory judgment” lawsuit, which can help them win a more favorable venue.)
They’re also launching a new distributed database called Plazma, which offers significant improvements over HDFS (Hadoop Distributed Files System). Plazma is significantly better than HDFS precisely because it’s more efficient and is able to compile and parse data at a much faster rate.
Chronos has a number of advantages over regular cron. It allows you to schedule your jobs using ISO8601 repeating interval notation, which enables more flexibility in job scheduling. Chronos also supports the definition of jobs triggered by the completion of other jobs, and it also supports arbitrarily long dependency chains.
In a complex processing pipeline every step increases the chance of failure. Until December last year, we were relying on a single instance with cron to kick off our hourly, daily and weekly ETL jobs. Cron is a really great tool but we wanted a system that allowed retries, was lightweight and provided an easy-to-use interface giving analysts quick insights into which jobs failed and which ones succeeded.
Intel says the distribution is optimized for the Intel Xeon processor platform. In its announcement, the company states it can analyze one terabyte of data, which would previously take more than four hours to fully process, can now be done in seven minutes.
Hadoop Corona is the next version of Map-Reduce. The current Map-Reduce has a single Job Tracker that reached its limits at Facebook. The Job Tracker manages the cluster resource and tracks the state of each job. In Hadoop Corona, the cluster resources are tracked by a central Cluster Manager. Each job gets its own Corona Job Tracker which tracks just that one job. The design provides some key improvements:
Because Hadoop uses MapReduce to perform data queries, searches have to be done in batches. So, while you can perform highly detailed analysis of historical data, for instance, one area you would not want to use Hadoop for is transactional data. Transactional data, by its very nature, is highly complex and fluid, as a transaction on an ecommerce site can generate many steps that all have to be implemented quickly.
Nor would it be efficient for Hadoop to be used to process structured data sets that require very minimal latency, such as a Web site served up by a MySQL database in a typical LAMP stack. That’s a speed requirement that Hadoop would poorly serve.
Expanding supported query languages will be one area of focus for the Drill project. Another will be adding support for additional formats, such as JSON, since right now Dremel only supports the Google Protocol Buffer Format.
Big data has become the latest front for the patent troll epidemic as a shell company is suing firms for using a common open-source storage framework known as the Hadoop Distributed File System (HDFS).
Hadoop has been built by a large network of contributors, including individual developers and large companies like Yahoo and is an Apache Software Foundation project. HDFS, its storage component, was based on Google’s Google File System. Parallel Iron’s patent complaints, however, say the whole system was made possible by four men:
And yet, even as Facebook has embedded itself into modern life, it hasn’t actually done that much with what it knows about us. Now that the company has gone public, the pressure to develop new sources of profit (see “The Facebook Fallacy“) is likely to force it to do more with its hoard of information. That stash of data looms like an oversize shadow over what today is a modest online advertising business, worrying privacy-conscious Web users (see “Few Privacy Regulations Inhibit Facebook”) and rivals such as Google. Everyone has a feeling that this unprecedented resource will yield something big, but nobody knows quite what.
In a kind of passing of the technological baton, Facebook built its data storage system by expanding the power of open-source software called Hadoop, which was inspired by work at Google and built at Yahoo. Hadoop can tame seemingly impossible computational tasks—like working on all the data Facebook’s users have entrusted to it—by spreading them across many machines inside a data center. But Hadoop wasn’t built with data science in mind, and using it for that purpose requires specialized, unwieldy programming. Facebook’s engineers solved that problem with the invention of Hive, open-source software that’s now independent of Facebook and used by many other companies. Hive acts as a translation service, making it possible to query vast Hadoop data stores using relatively simple code. To cut down on computational demands, it can request random samples of an entire data set, a feature that’s invaluable for companies swamped by data.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL
via Welcome to Hive!.
Hortonworks has unveiled Hortonworks Data Platform (HDP) 1.0, an open-source platform built on Apache Hadoop 1.0 that includes data-management, monitoring, metadata and data-integration features.
For example, the platform’s provisioning interface surveys nodes in the target cluster and recommends optimal software configurations, with the subsequent ability to start the cluster via a single click. The monitoring interface offers a streamlined ability to see the health of the cluster in depth. The data integration services allow users to connect with data services and build transformation logic via graphical interfaces, sparing them from having to write code.