Over time, Hadoop has become the nucleus
of the Big Data ecosystem, where many new technologies have emerged and become
integrated with Hadoop. So it’s important that, first, we understand and
appreciate the nucleus of modern Big Data architecture.
The
Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers, using simple programming models. It is designed to scale up
from single servers to thousands of machines, each offering local computation
and storage.
Components of the
Hadoop Ecosystem
Let's
begin by looking at some of the components of the Hadoop ecosystem:
Hadoop Distributed
File System (HDFS™):
This is a distributed file system that provides high-throughput access
to application data. Data in a Hadoop cluster is broken down into smaller
pieces (called blocks) and distributed throughout the cluster. In this method,
the map and reduce functions can be executed on smaller subsets of your larger
data sets, and this provides the scalability needed for Big Data processing.
MapReduce
MapReduce is a programming model
specifically implemented for processing large data sets on Hadoop cluster. This
is the core component of the Hadoop framework, and it is the only execution engine available for Hadoop
1.0.
The
MapReduce framework consists of two parts:
A function called ‘Map’, which allows different points in the distributed cluster to distribute their work.
A function called ‘Reduce’, which is designed to reduce the final form of the clusters’ results into one output.
The main advantage of the MapReduce framework is its fault tolerance, where periodic reports from each node in the cluster are expected as soon as the work is completed.
The
MapReduce framework is inspired by the ‘Map’ and ‘Reduce’ functions used in functional
programming. The computational processing occurs on data stored in a file system
or within a database, which takes a set of input key values and produces a set
of output key values.
Each day, numerous MapReduce programs
and MapReduce jobs are executed on
Google's clusters. Programs are
automatically parallelized and executed on a large
cluster of commodity machines.
Map Reduce is used in distributed grep,
distributed sort, Web link-graph reversal, Web access log stats, document
clustering, Machine Learning and statistical machine translation.
Pig
Pig is a data flow language that allows
users to write complex MapReduce operations in simple scripting language. Then
Pig then transforms those scripts into a MapReduce job.
Hive
Apache Hive data warehouse software
facilitates querying and managing large datasets residing in distributed
storage. Hive provides a mechanism for querying the data using a SQL-like
language called HiveQL. At the same time, this language also allows traditional
map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient
or inefficient to express this logic in HiveQL.
Sqoop
Enterprises that use
Hadoop often find it necessary to transfer some of their data from traditional
relational database management systems (RDBMSs) to the Hadoop ecosystem.
Sqoop, an integral part of
Hadoop, can perform this transfer in an automated fashion. Moreover, the data
imported into Hadoop can be transformed with MapReduce before exporting them
back to the RDBMS. Sqoop can also generate Java classes for programmatically
interacting with imported data.
Sqoop uses a connector-based architecture that allows it to use plugins
to connect with external databases.
Flume
Flume is a service for
streaming logs into Hadoop. Apache Flume is a distributed, reliable, and
available service for efficiently collecting, aggregating, and moving large
amounts of streaming data into the Hadoop Distributed File System (HDFS).
Storm
Storm is a distributed,
real-time computation system for processing large volumes of high-velocity
data. Storm is extremely fast and can process over a million records per second
per node on a cluster of modest size. Enterprises harness this speed and
combine it with other data-access applications in Hadoop to prevent undesirable
events or to optimize positive outcomes.
Kafka
Apache Kafka supports a
wide range of use cases such as a general-purpose messaging system for
scenarios where high throughput, reliable delivery, and horizontal scalability are
important. Apache Storm and Apache HBase both work very well in combination
with Kafka.
Oozie
Oozie is a workflow scheduler system to manage
Apache Hadoop jobs. The Oozie Workflow jobs are Directed Acyclical Graphs
(DAGs) of actions, whereas the Oozie Coordinator jobs are recurrent Oozie
Workflow jobs triggered by time (frequency) and data availability.
Oozie is integrated with the rest of the Hadoop
stack and supports several types of Hadoop jobs out of the box (such as Java
map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as
system-specific jobs (such as Java programs and shell scripts).
Oozie is a scalable, reliable and extensible system.
Spark
Apache Spark is a fast, in-memory data processing
engine for distributed computing clusters like Hadoop. It runs on top of
existing Hadoop clusters and accesses the Hadoop data store (HDFS).
Spark can be integrated with Hadoop’s 2 YARN
architecture, but cannot be used with Hadoop 1.0.
NoSQL
The NoSQL database, also called Not Only SQL, is an
approach to data management and database design that's useful for very large
sets of distributed data. This database system is non-relational, distributed,
open-source and horizontally scalable. NoSQL seeks to solve the scalability and
big-data performance issues that relational databases weren’t designed to
address.
Apache
Cassandra
Apache Cassandra is an open-source distributed
database system designed for storing and managing large amounts of data across
commodity servers. Cassandra can serve as both a real-time operational data
store for online transactional applications and a read-intensive database for
large-scale business intelligence (BI) systems.
Google’s
BigTable
Google’s BigTable is a distributed, column-oriented
data store created by Google Inc. to handle very large amounts of structured
data associated with the company's Internet search and Web services operations.
BigTable was designed to support applications
requiring massive scalability; from its first iteration, the technology was
intended to be used with petabytes of data. The database was designed to be
deployed on clustered systems and uses a simple data model that Google has
described as "a sparse, distributed, persistent multidimensional sorted
map."
Data is assembled in order by row key, and indexing
of the map is arranged according to row, column keys, and timestamps. Here,
compression algorithms help achieve high capacity.
MongoDB
MongoDB is a cross-platform, document-oriented
database. Classified as a NoSQL database, MongoDB shuns the traditional
table-based relational database structure in favor of JSON-like documents with
dynamic schemas (MongoDB calls the format BSON), making the integration of data
in certain types of applications easier and faster.
MongoDB is developed by MongoDB Inc. and is
published as free and open-source software under a combination of the GNU
Affero General Public License and the Apache License. As of July 2015, MongoDB
is the fourth most popular type of database management system, and the most
popular for document stores.
CouchDB
CouchDB is a database that completely embraces the
web. It stores your data with JSON documents. It accesses your documents and
queries your indexes with your web browser, via HTTP. It indexes, combines, and
transforms your documents with JavaScript.
CouchDB works well with modern web and mobile apps.
You can even serve web apps directly out of CouchDB. You can distribute your
data, or your apps, efficiently using CouchDB’s incremental replication.
CouchDB supports master-master setups with automatic conflict detection.
HBase
Apache HBase (Hadoop
DataBase) is an open-source NoSQL database that runs on the top of the database
and provides real-time read/write access to those large data sets.
HBase scales linearly
to handle huge data sets with billions of rows and millions of columns, and it
easily combines data sources that use a wide variety of different structures and
schemas. HBase is natively integrated with Hadoop and works seamlessly
alongside other data access engines through YARN.
Neo4j
Neo4j is a graph
database management system developed by Neo Technology, Inc. Neo4j is described
by its developers as an ACID-compliant transactional database with native graph
storage and processing. According to db-engines.com, Neo4j is the most popular graph
database.
No comments:
Post a Comment