This section describes the open source world of Hadoop related big data. The big data tools here are described according their function. For instance HDFS for storage, Flume for data movement.
Big Data describes a data set so large that storage, processing and accessing it are beyond the capabilities of most commercially available tools. This means data sets in the high Terra to Exabyte ranges.
Gartner refer to big data in terms of the 3Vs model, Velocity, Volume and Variety. That means how fast is it changing, how big is it and what of the complexity and variety of the various sources that provide the data.
In this section of the site we group the tools in the wider Hadoop eco system by the functionality that they provide. We concentrate on open source solutions like Apache Hadoop. Where we stray into commercial offerings it will be because these companies contribute to either Apache or Hadoop or have an offering that can be used with Hadoop.
This sections describes the software currently available that relates to Hadoop or it's use. We will add to this list frequently. By all means contact us if there is something that you would like us to investigate.
Ambari Monitoring Avro Data serialization Cassandra No SQL Database Chukwa Monitoring Drill Data analysis Flume Data collection Giraph Graph processing Hadoop Distributed Parallel processing Hama BSP large scale calculations HBase No SQL database HCatalog Data access abstraction Hive Data Warehouse Hue GUI Monitoring Impala Data query Nutch Web Scraping Oozie Scheduler Pig High level MapReduce programming Mahout Artificial intelligence Solr Search platform Spark Large scale data analysis Sqoop Bulk data transfer Zoo Keeper Configuration
HDFS ( Hadoop Distributed File System ) provides storage for Hadoop. Using a master Name Node and slave Data Nodes a distributed storage system is created across the Data Nodes. An optional secondary Name Node is used along with the primary Name Node to manage storage on the cluster.
HDFS is fault tolerant and resiliant, data is sharded across file system servers and node failure are anticipated and managed. Extremely large cluster and storage sizes are achievable, more so in Hadoop V2.
The diagram above described the processing architecture of Yarn, each Data Node has a Node Manager process which receives task requests from a Resource Manager task on the Name Node. A client process make a job request to the Resource Manager process, this process then manages the job. An Application Master (A.M.) process then manages the job sub tasks. This architecture allows YARN to achieve greater cluster sizes and higher transaction volmes than Hadoop V1.
The image above shows an example word count Map Reduce cycle where data on the left is sharded and passed to map tasks which create key value pairs from the data. In this case each word becomes the key and the value is 1.
The key value pairs are then shuffled by the reducers and values are aggregated by key so that the value becomes a count of key values. Then the data is output to become a count of words in the data.
Nutch is the tool that can be used along with Hadoop to store data on Hadoop via web crawling. Seed urls are used to initiate the crawl cycle. If Hadoop is available then crawl data can be stored on HDFS or via a component like Gora on other stores like HBase.
The Gora system allows Hadoop based storage to be defined when web crawling with Nutch. As an example HBase may be used to store vast amounts of searched data from a crawl. Once stored it may then be searched from HBase via SQL or other Hadoop tools like Pig.
HBase offers a very large scale columnar storage database for the unstructured crawled data.
Solr can then be used to both index and search the data. Using a combination of the tools described here and Hadoop based tools like Pig and Hive vast amounts of unstructured web based data may be sourced and processed.
Map Reduce algorithms may be developed using the API's provided with Hadoop in Java.
They may also be developed at a higher level using Pig Native.
Hadoop streaming functionality may be used with scripts like Python and Perl to stream data into Hadoop.
Hive may also be used with external tables to process HDFS based data .
Scheduling of resources on a Hadoop cluster can be achieved with the Fair and Capacity schedulers. The fair scheduler shares resources across multiple projects whilst the capacity scheduler shares resources across multiple customers each of which may have a heirarchy of task queues.
Tasks can also be scheduled using the Oozie tool, ETL sub task chains of functionality can be created using Oozie which can then be scheduled by time or event. Sub tasks can include the use of applications like Sqoop, Pig, Hive etc.
Sqoop may be used to move data between RDBMS systems and Hadoop.
Flume may be used to move log based data via streaming to Hadoop.
Hadoop commands may also be used to move data between the file system and HDFS, data can also be moved between clusters.
ZooKeeper is used by many Hadoop based applications to manage distributed configuration.
Hue may be used to access and monitor many of the Hadoop eco system tools from a single web based interface.
Ganglia can be used to monitor cluster based resources and provide historic graphs of resource usage.
Nagios can be used to monitor cluster resources and provide graphs of resource usage and usage based alerts.
Ambari is provided by Apache for cluster installatiuon and management.
Cloudera provide a good Hadoop cluster manager from CDH4 onwards.
The Mesos cluster manager scales to very large cluster sizes.
MapR also provide the Bright cluster manager for Hadoop cluster management.
Although oozie is available to create Hadoop based ETL chains this section concentrates on those tools that provide drag and drop functionality that allow ETL chains to be created as objects.
Pentaho's data integration tool Spoon offers a big data plugin for ETL chain creation and execution.
Talend's open studio provides visual drag and drop functionality for big data manipulation.
The profiling functionality in the Talend open studio product allows big data based reports to be created that concentrate on data quality.
The Splunk Hadoop based reporting tool, called Hunk, can be used to create reports and dashboards to examine big data.
Apache Spark is the in memory distributed processing platform, offering low latency processing, it also integrtaes with Hadoop.
Apache Spark SQL offers the ability to create a schema on top of Spark RDD data and process that data using SQL.
Apache Spark Streaming offers the ability to process streamed data in real time.
Apache Spark MlLib contains machine learning functionality. With this module Spark based data can be processed with machine learning functionality.
Apache Spark GraphX brings graph processing to Spark based data for graph based analysis.
The Sparkling water product from 0xData allows H2O to be used with Apache Spark. This provides Spark with extended machine learning functionality like deep learning.
The people behind Spark have developed databricks.com to provide an integrated and simplified Spark based data processing environment.