Hortonworks Data Platform 1.0 Targets Enterprises

Hortonworks has unveiled Hortonworks Data Platform (HDP) 1.0, an open-source platform built on Apache Hadoop 1.0 that includes data-management, monitoring, metadata and data-integration features.

via Hortonworks Data Platform 1.0 Targets Enterprises.

For example, the platform’s provisioning interface surveys nodes in the target cluster and recommends optimal software configurations, with the subsequent ability to start the cluster via a single click. The monitoring interface offers a streamlined ability to see the health of the cluster in depth. The data integration services allow users to connect with data services and build transformation logic via graphical interfaces, sparing them from having to write code.

VMware’s Serengeti Brings Hadoop to Virtual, Cloud Environments

Hadoop is a framework for reliably running applications on large hardware clusters. Many large enterprises (such as Facebook and IBM) have come to rely on it as a vital part of their respective data-crunching infrastructures. Research firm IDC recently predicted that worldwide revenues from Hadoop and MapReduce, another framework for processing problems across huge datasets, could hit $812.8 million in 2016, a significant uptick from $77 million in revenues last year.

via VMware’s Serengeti Brings Hadoop to Virtual, Cloud Environments.

VMware has positioned Serengeti as a “one click” deployment toolkit that, when used in conjunction with its vSphere platform, can deploy an enterprise-level Hadoop cluster in a matter of minutes. The company claims that vSphere’s virtualization capabilities will boost the “availability and manageability” of Hadoop clusters.

George Takei Helps Facebook Debug MySQL

George Takei Helps Facebook Debug MySQL » Data Center Knowledge.

Wait – what was that last one? In today’s update on the Facebook Engineering blog, Mark Callaghan discusses the challenges in getting MySQL to scale on Facebook’s multi-core servers. The post provides technical insight into Facebook’s scalability initiatives, and then gives a shout out to Takei for helping resolve a database issue.

I have facebook blocked on my servers but will have to come back and look at the linked to blog entry.

IBM Parallel Sysplex

In computing, a Parallel Sysplex is a cluster of IBM mainframes acting together as a single system image with z/OS. Used for disaster recovery, Parallel Sysplex combines data sharing and parallel computing to allow a cluster of up to 32 systems to share a workload for high performance and high availability.

via IBM Parallel Sysplex – Wikipedia, the free encyclopedia.

How Web giants store big—and we mean big—data

The Great Disk Drive in the Sky: How Web giants store big—and we mean big—data.

The need for this kind of perpetually scalable, durable storage has driven the giants of the Web—Google, Amazon, Facebook, Microsoft, and others—to adopt a different sort of storage solution: distributed file systems based on object-based storage. These systems were at least in part inspired by other distributed and clustered filesystems such as Red Hat’s Global File System and IBM’s General Parallel Filesystem.

And one more blurb…

Google wanted to turn large numbers of cheap servers and hard drives into a reliable data store for hundreds of terabytes of data that could manage itself around failures and errors. And it needed to be designed for Google’s way of gathering and reading data, allowing multiple applications to append data to the system simultaneously in large volumes and to access it at high speeds.

Data sharing with a GFS storage cluster

GFS saves its file system descriptors in inodes that are allocated dynamically (referred to as dynamic nodes or dinodes). They are placed in a whole file system block (4096 bytes is the standard file system block size in Linux kernels). In a cluster file system, multiple servers access the file system at the same time; hence, the pooling of multiple dinodes in one block would lead to more competitive block accesses and false contention. For space efficiency and reduced disk accesses, file data is saved (stuffed) the dinode itself if the file is small enough to fit completely inside the dinode. In this case, only one block access is necessary to access smaller files. If the files are bigger, GFS uses a “flat file” structure. All pointers in a dinode have the same depth. There are only direct, indirect, or double indirect pointers. The tree height grows as much as necessary to store the file data as shown in Figure 1.

via redhat.com | Data sharing with a GFS storage cluster.

2. LVS: What is an LVS? Can I use an LVS?

A Linux Virtual Server (LVS) is a cluster of servers which appears to be one server to an outside client. This apparent single server is called here a “virtual server”. The individual servers (realservers) are under the control of a director (or load balancer), which runs a Linux kernel patched to include the ipvs code. The ipvs code running on the director is the essential feature of LVS. Other user level code is used to manage the LVS (set rules for services handled, handle failover). The director is basically a layer 4 router with a modified set of routing rules (i.e. connections do not originate or terminate on the director, it doesn’t send ACKs etc, it’s just a router).

via 2. LVS: What is an LVS? Can I use an LVS?.

The director uses 3 different methods of forwarding.

  • LVS-NAT based on network address translation (NAT)
  • LVS-DR (direct routing) where the MAC addresses on the packet are changed and the packet forwarded to the realserver
  • LVS-Tun (tunnelling) where the packet is IPIP encapsulated and forwarded to the realserver.

Corosync

http://corosync.org/doku.php

The Corosync Cluster Engine is a Group Communication System with additional features for implementing high availability within applications. The project provides four C Application Programming Interface features:

  • A closed process group communication model with virtual synchrony guarantees for creating replicated state machines.
  • A simple availability manager that restarts the application process when it has failed.
  • A configuration and statistics in-memory database that provide the ability to set, retrieve, and receive change notifications of information.
  • A quorum system that notifies applications when quorum is achieved or lost.

Our project is used as a High Availability framework by projects such as Apache Qpid and Pacemaker.

We are always looking for developers or users interested in clustering or participating in our project.

The project is hosted by Fedora Hosted and the The Linux Foundation.

Fencing and Stonith

Fencing is a very important concept in computer clusters for HA (High Availability). Unfortunately, given that fencing does not offer a visible service to users, it is often neglected.

Fencing may be defined as a method to bring an HA cluster to a known state. But, what is a “cluster state” after all? To answer that question we have to see what is in the cluster.

via Fencing and Stonith.

STONITH (Shoot The Other Node In The Head)

Stonith is our fencing implementation. It provides the node level fencing.

Gotta love how they come up with those acronyms.  🙂