Recently, we discussed big data and MapReduce. Hadoop is another method of dealing with big data that builds upon Google’s MapReduce algorithm.

What is Hadoop?

The software framework Hadoop was developed by Apache Software Foundation. It was originally for use in the development of the Apache Nutch web crawler that powers many active (and inactive) search engines. Apache Hadoop was inspired by the MapReduce algorithm that Google used for big data processing.

Apache Hadoop combined MapReduce with HDFS, Hadoop Distributed File System, to provide a complete big data storage and processing solution. Data is stored at a file-level within a distributed cluster of machines, and analytical processes can be performed on this data at a much more efficient rate than traditional methods.

What is HDFS?

The HDFS part of Apache Hadoop is a storage method that splits files down into blocks to be distributed across nodes in a cluster of computers. The distributed blocks of data can then be processed in parallel by the MapReduce part of Hadoop.

As well as MapReduce and HDFS, the Hadoop framework includes Hadoop YARN, and Hadoop Common. Hadoop YARN is a way of managing the resources of computers in the cluster, and Hadoop Common is a library of useful tools and modules for running a Hadoop ecosystem. Extra software packages that run alongside Hadoop are:

  • Apache Flume – a tool for the collection and movement of huge amounts of data
  • Apache Hive – for reading, transforming and querying the datasets
  • Apache Spark – used for fast big data processing and machine learning

Hadoop replication factor

By design, Hadoop is easily scaled from a few machines to a cluster of thousands of servers – it can scale across any amount of your servers on which it is installed. Data stored with a HDFS configuration is replicated amongst the cluster. A Hadoop user can set a replication factor on their data – of 1, 3, 5, 10 or whatever you choose – usually a replication factor of at least three is recommended. Choosing a replication factor of three means that the data is, obviously, replicated on three machines. If one of those machines goes down, data is re-replicated on another machine within the cluster to bring the replication factor back up to three. Hadoop’s redundancy is built on the premise that nodes in the cluster are likely to fail. If and when they do fail, data is redistributed and nothing is lost.

What is Hadoop used for?

Hadoop and its combination of HDFS and MapReduce make it a common storage solution for businesses that acquire huge amounts of data and need to process it efficiently. For example, in the financial services industry Hadoop can be used to process bulk transactional data – Hadoop is used in the New York Stock Exchange. Mobile phone companies use it to distribute their data storage. And it’s useful for storing sensor data acquired from Internet of Things devices.

Hadoop can be set up to run on any computer, virtual machine or server. Visit the Fasthosts website for next-generation cloud servers that scale on demand.