Apache Hadoop: Introduction

Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models. A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers. Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

Nowadays, Apache Hadoop is widely used as a back-end technology of many highly loaded web services, such as Yahoo! and Facebook. It was developed in Java by applying Map-Reduce programming model according to which the application is divided into a large number of identical elementary tasks executable on the cluster nodes and naturally reducible to the final result.

Hadoop Architecture

The core components of Apache Hadoop are:

  • Hadoop Distributed File System (HDFS) – distributed file systems, which stores information distributed across the different nodes of computing cluster and provides a high throughputs of data within the cluster.
  • Hadoop YARN – a resource management component, which is intended to manage computing resources and task scheduling.
  • Hadoop Common – set of middlware software libraries and utilities used for other Hadoop modules and related projects.
  • Hadoop MapReduce – component for programming and execution of MapReduce computations.

Hadoop Distributed File System (HDFS)

Hadoop Ditsributed File Systems (HDFS) is mainly designed to store a huge amount of data in blocks of fixed size, which are distributed across multiple nodes within a cluster. HDFS provides the replication mechanism of data that ensures the stability of distributed systems in the case of nodes failures. The organization of HDFS is close to the traditional file systems, which consists of file descriptors and data. In contrast with traditional file system, instead of file descriptors HDFS uses a special server - namenode, which stores the file system metadata and meta-information about the distribution of data units. Other nodes in cluster are declared as a datanodes, which stores the blocks of data. The basic architecture of HDFS is presented on the figure 1. Therefore, namenode is responsible for handling operations at the file and directories level, such as opening and closing files, directories management; and datanodes are directly execute operations of reading and writing data.

HDFS is an important component of Apache Hadoop project. However, it provides an opportunity to work with another cloud based distributed systems, for example Amazaon S3. On the other hand, the HDFS can be used not only to run the MapReduce jobs-but also as a general purpose distributed file system.

Hadoop MapReduce

Hadoop MapReduce is an engine, which allows parallel data processing within MapReduce programming model. Formally, MapReduce is programming model for distributed data processing, which first time was proposed by Google and was declared as an effective technology for processing a huge amount of data. The main idea of this approach is to divide data into multiple partitions, and apply the simple map and reduce operations on each portion of data. The basic scheme of MapReduce job is demonstrated in figure 3.2. Consequently, MapReduce job consists of the following steps:

  • Map Stage In this stage input data is preprocessed by applying Map-function. This work is very similar to the map operation in functional programming languages, where input data is transformed into key-value pairs according to user defined function.
  • Shuffle Stage. This stage collects and sorts all key-value pairs with the same key, after that it sends the set of key-value pairs with the same key to reducers.

  • Reduce Stage. This stage takes as input a set of key-value pairs with the same key and according to user-defined function generate the single value for each set.

Hadoop MapReduce engine is based on a «master-worker» architecture, where the master is a single instance of the control process - JobTracker, usually running on a NameNode. Worker-processes - is an arbitrary set of processes TaskTracker, running on DataNodes. JobTracker is responsible for coordination of MapReduce jobs on cluster, responsible for catching failures and restarting tasks. TaskTracker is responsible for execution of individual Map and Reduce operations on the NameNodes.

Apache Hadoop has been declared as a good big data processing technology for many years. However, architecture of Hadoop was design in such way, that it is strongly connected to it’s HDFS, that means that processing workflow requires a lot of read/write operations. Therefore, Hadoop tends to have a slow performance due to multiple manipulations with storage. As a result, Apache Spark was proposed as an efficient in-memory processing technology for distributed data processing, which has much higher performance of data processing..

results matching ""

    No results matching ""