Over the past six months, vendors have responded to the demand for more corporate-friendly analytics by announcing a slew of systems that offer full SQL query capabilities with significant performance improvements over existing Hive/Hadoop systems. These systems are designed to allow full SQL queries over warehouse-size data sets, and in most cases they bypass Hadoop entirely (although some are hybrid approaches). Allowing much faster SQL queries at scale makes big data analytics accessible by many more people in the enterprise and fits in with existing workflows.
Rackspace’s dispute is with an IP Nav unit called Parallel Iron, which says it has three patents that cover the open source Hadoop Distributed File System (HDFS). But remarkably, Rackspace didn’t even know that at first; IP Nav contacted Rackspace and told the company it infringed some patents while refusing to even reveal the numbers or the owners of the patents, unless Rackspace signed a “forbearance agreement” to not sue first. (Sometimes companies threatened by patent trolls can file a “declaratory judgment” lawsuit, which can help them win a more favorable venue.)
They’re also launching a new distributed database called Plazma, which offers significant improvements over HDFS (Hadoop Distributed Files System). Plazma is significantly better than HDFS precisely because it’s more efficient and is able to compile and parse data at a much faster rate.
Chronos has a number of advantages over regular cron. It allows you to schedule your jobs using ISO8601 repeating interval notation, which enables more flexibility in job scheduling. Chronos also supports the definition of jobs triggered by the completion of other jobs, and it also supports arbitrarily long dependency chains.
In a complex processing pipeline every step increases the chance of failure. Until December last year, we were relying on a single instance with cron to kick off our hourly, daily and weekly ETL jobs. Cron is a really great tool but we wanted a system that allowed retries, was lightweight and provided an easy-to-use interface giving analysts quick insights into which jobs failed and which ones succeeded.
Intel says the distribution is optimized for the Intel Xeon processor platform. In its announcement, the company states it can analyze one terabyte of data, which would previously take more than four hours to fully process, can now be done in seven minutes.
Hadoop Corona is the next version of Map-Reduce. The current Map-Reduce has a single Job Tracker that reached its limits at Facebook. The Job Tracker manages the cluster resource and tracks the state of each job. In Hadoop Corona, the cluster resources are tracked by a central Cluster Manager. Each job gets its own Corona Job Tracker which tracks just that one job. The design provides some key improvements:
Because Hadoop uses MapReduce to perform data queries, searches have to be done in batches. So, while you can perform highly detailed analysis of historical data, for instance, one area you would not want to use Hadoop for is transactional data. Transactional data, by its very nature, is highly complex and fluid, as a transaction on an ecommerce site can generate many steps that all have to be implemented quickly.
Nor would it be efficient for Hadoop to be used to process structured data sets that require very minimal latency, such as a Web site served up by a MySQL database in a typical LAMP stack. That’s a speed requirement that Hadoop would poorly serve.
Expanding supported query languages will be one area of focus for the Drill project. Another will be adding support for additional formats, such as JSON, since right now Dremel only supports the Google Protocol Buffer Format.
Big data has become the latest front for the patent troll epidemic as a shell company is suing firms for using a common open-source storage framework known as the Hadoop Distributed File System (HDFS).
Hadoop has been built by a large network of contributors, including individual developers and large companies like Yahoo and is an Apache Software Foundation project. HDFS, its storage component, was based on Google’s Google File System. Parallel Iron’s patent complaints, however, say the whole system was made possible by four men:
Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL
via Welcome to Hive!.
Hortonworks has unveiled Hortonworks Data Platform (HDP) 1.0, an open-source platform built on Apache Hadoop 1.0 that includes data-management, monitoring, metadata and data-integration features.
For example, the platform’s provisioning interface surveys nodes in the target cluster and recommends optimal software configurations, with the subsequent ability to start the cluster via a single click. The monitoring interface offers a streamlined ability to see the health of the cluster in depth. The data integration services allow users to connect with data services and build transformation logic via graphical interfaces, sparing them from having to write code.