What is Hadoop?

Primary tab navigation


What is Hadoop


Apache Hadoop® is an open source software project that enables distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer.


IBM Hadoop solution


The IBM Hadoop solution has three layers…

 

Featured customer

Findability Sciences provides customers with actionable insights

“Our joint activity with IBM allows us to take advantage of some really cutting-edge technology and combine it with our own solutions to deliver a great value proposition for clients. We feel very confident going to market because we are backed up by IBM.”


What we offer


The IBM Hadoop offering changes the economics and dynamics of large-scale computing:

 

Scalable

Add new servers and resources to your cluster without disturbing the dependent analytic workflows and applications.

Low-cost

Commodity servers connected in parallel radically reduce the cost of storing - and modeling -your data.

Fault tolerant

When a node goes down, the system automatically redirects work and continues processing without missing a beat.

Flexible

Because Hadoop is schema-free, it can manage structured and unstructured data with ease. Join and aggregate multiple sources to enable deep analysis.


Hadoop architecture


Hadoop is composed of four core components—Hadoop Common, Hadoop Distributed File System (HDFS), MapReduce and YARN.


Hadoop Common

A module containing the utilities that support the other Hadoop components.



MapReduce

A framework for writing applications that process large amounts of structured and unstructured data in parallel across a cluster of thousands of machines, in a reliable, fault-tolerant manner.

What is MapReduce?


HDFS

A file system that provides reliable data storage and access across all the nodes in a Hadoop cluster. It links together the file systems on many local nodes to create a single file system.

What is HDFS?


Yet Another Resource Negotiator (YARN)

The next-generation MapReduce, which assigns CPU, memory and storage to applications running on a Hadoop cluster. It enables application frameworks other than MapReduce to run on Hadoop, opening up a wealth of possibilities.

Hadoop is supplemented by an ecosystem of Apache open-source projects that extend the value of Hadoop and improve its usability.

 


Data access projects


Pig

A programming language designed to handle any type of data, helping users to focus more on analyzing large data sets and less on writing map programs and reduce programs.

What is Pig?


Hive

A Hadoop runtime component that allows those fluent with SQL to write Hive Query Language (HQL) statements, which are similar to SQL statements. These are broken down into MapReduce jobs and executed across the cluster.

What is Hive?


Flume

A distributed, reliable and available service for efficiently collecting, aggregating and moving large amounts of log data. Its main goal is to deliver data from applications to the HDFS.

What is Flume?


HCatalog

A table and storage management service for Hadoop data that presents a table abstraction so the user does not need to know where or how the data is stored.


Avro

An Apache open source project that provides data serialization and data exchange services for Hadoop.

What is Avro?


Spark

An open-source cluster computing framework with in-memory analytics performance that is up to 100 times faster than MapReduce, depending on the application.

What is Spark?


Sqoop

An ELT tool to support the transfer of data between Hadoop and structured data sources.


HBase

A column-oriented non-relational (noSQL) database that runs on top of HDFS and is often used for sparse data sets.

What is HBase?

 


Search projects


Solr

An enterprise search tool from the Apache Lucene project that offers powerful search capabilities, including hit highlighting, as well as indexing capabilities, reliability and scalability, a central configuration system, and failover and recovery.

 


Administration and security projects


Kerberos

A network authentication protocol that works on the basis of “tickets” to allow nodes communicating over a non-secure network to prove their identity to one another in a secure manner.


Zookeeper

A centralized infrastructure and set of services that enable synchronization across a cluster.

What is Zookeeper?


Oozie

A management application that simplifies workflow and coordination between MapReduce jobs.

What is Oozie?

Get started with Hadoop