What is Hadoop?
Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
Apache Hadoop has two pillars:
- YARN - Yet Another Resource Negotiator (YARN) assigns CPU, memory, and storage to applications running on a Hadoop cluster. The first generation of Hadoop could only run MapReduce applications. YARN enables other application frameworks (like Spark) to run on Hadoop as well, which opens up a wealth of possibilities.
- HDFS - Hadoop Distributed File System (HDFS) is a file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system.
Hadoop is supplemented by an ecosystem of Apache projects, such as Pig, Hive and Zookeeper, that extend the value of Hadoop and improves its usability.
So what’s the big deal?
Hadoop changes the economics and the dynamics of large scale computing. Its impact can be boiled down to four salient characteristics.
Hadoop enables a computing solution that is:
- Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
- Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
- Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
- Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a fright beat.
Big Data Hadoop Solutions, Q1 2014
Read the report to see why IBM InfoSphere BigInsights was named a leader and how it stands in relation to other big data Hadoop vendors.
Get the report
Hadoop in the cloud
Leverage big data analytics easily and cost-effectively with IBM InfoSphere BigInsights
Get the eBook
SQL-on-Hadoop without compromise
How Big SQL 3.0 from IBM represents an important leap forward for speed, portability and robust functionality in SQL-on-Hadoop solutions
Get the white paper
Understanding Big Data
Analytics for Enterprise Class Hadoop and Streaming Data
Download the ebook