Different Components Used in Hadoop Ecosystem

components used in hadoop ecosystem


Hey Everyone. Have you checked the previous article on Best Movies on Data Science and Machine Learning? If not, then please check it.

So, Today we will look over an important topic in Big Data i.e. Components used in Hadoop Ecosystem. First we will define what is Hadoop Ecosystem, then it's components, and a detailed overview of it. 

The Hadoop Ecosystem is a suite of services that work together to solve big data problems. The four core components are MapReduce, YARN, HDFS, & Common. Let's get into detail conversation on this topics. Before that we will list out all the components which are used in Big Data Ecosystem 

Components of Hadoop Ecosystem

1. Hadoop Yarn
2. Hadoop HDFS
3. Mapreduce
4. Pig
5. Hive
6. Apache H Base
7. HUE
8. Zookeper
9. Ambari
10. Sqoop
11. Oozie
12. Flume
13. H Catalog
14. Thrift
15. Drill
16. Mahout
17. Avro
18. Chukwa and More..

• Hadoop Core Components

Starting with 1st component.. 

1. Hadoop Distributed File System:

Features of HDFS

- Most reliable storage system on the planet 

- It is Scalable and reliable

- It is Highly fault-tolerant 

2. MapReduce: 

Features of MapReduce are

- It is Simple, massively scalable, and fault tolerant 

- The Programming model is processing huge amounts of data in Mapreduce

- It takes computation to data 

3. Yet Another Resource Negotiator: 

- It Provides a stable, reliable, and shared operational services across multiple workloads 

- Distributed resource management layer 

- It enables Hadoop to provide a general processing platform

Hadoop High-level Data Processing Components 

There are only 2 components classified under this category

1. Hive: 

Features of HIVE

- It enable users to perform ad-hoc analysis over huge volume of data 

- Data warehousing on top of Hadoop

- It has SQL-like interface to query data 

- Hive is designed for easy data summarization 

2. Pig:

Features of PIG

- It's a Platform for analyzing large data sets with high-level language

- Top-level data processing engine  

- Compiles down to MapReduce jobs 

- It Uses the Pig Latin language 

• Hadoop NoSQL Components 

HBase is the only part which comes under this category. Features of Hbase are

- The Distributed NoSQL database modelled after Bigtable 

- It has Column oriented NoSQL DB 

- It handles Big Data with random read and writes 

- It is also Scalable and fault-tolerant 

• Hadoop Data Analysis Components 

There are five components listed under this category including Drill, Crunch, etc

1. Hama: 

Features of Hama are.. 

- It Provides SQL-like query interface & vertex/neuron centric programming models

- It's a Framework for Big Data analytics

- Bulk Synchronous Parallel (BSP) computing 

- It's a Cross-platform & distributed computing framework 

2. Drill: 

Unique Features of Drill:-

- Drill provides faster insights without the overhead of data loading, schema creation

- It is Schema-free SQL Query Engine for Hadoop

- Interactive analysis of large-scale datasets

- It analyze the multi-structured and nested data in non-relational datastores

3. Crunch: 

Unique Features of Crunch

- It's a Framework to write, test, and run MapReduce pipelines 

- Crunch Simplifies the complex task like joining and data aggregation 

- It Runs on top of MapReduce and Spark 

4. Mahout: 

Mahout Features Includes:-

- It's a Scalable machine learning library on top of Hadoop and also most widely used library 

- A popular data science tool automatically finds meaningful patterns from big data

- Distributed linear algebra framework

- It supports multiple distributed backends like Spark 

5. Lucene: 

- Lucene is a High-performance text search engine 

- Information-retrieval software library

- It is used for searching and indexing

- It is Accurate and Efficient Search Algorithms

- Cross-platform, Scalable, powerful, and accurate. 

• Hadoop Data Serialization Components 

Avro and Thrift are classified under this category 

1. Avro: 

- Avro is a Data serialization framework 

- It serializes data in compact, fast, binary data format 

- It uses JSON to define types and protocols 

- Also it provides a container file, to store persistent data 

2. Thrift:

Features of Thrift are.. 

- Thrift provides a language agnostic framework 

- Interface definition language and binary communication protocol 

- Its a Remote Procedure Call (RPC) Framework

• Hadoop Data Transfer Components 

Chukwa, Sqoop, and Flume comes under this category. 

1. Sqoop:
 
- This tool is designed for efficiently transferring bulk data between Hadoop and RDBMS 

- Sqoop Parallelizes data transfer 

- It Allows data imports from external datastores 

- It Uses MapReduce to import and export the data

2. Chukwa: 

- Chukwa is a Data collection system for monitoring large distributed systems 

- It provides scalable and robust toolkit to analyse logs C> huu 

- It is designed for log collection and analysis 

3. Flume: Data collection & aggregation system 

- Flume is a service for streaming event data. It is reliable, scalable, fault-tolerant and customizable 

- It has distributed pipeline architecture 

• Management Components used in Hadoop

1. HCatalog: 

- It's a table and storage management layer 

- Its an Interface between Hive, Pig, and MapReduce 

- It gives access to Hive metastore tables. Hcatalog has Shared Schema and Data Types

2. Oozie: Server-based workflow scheduling system 

- It Provides workflow management and coordination and it runs workflows based on predefined schedules

• Hadoop Monitoring Components 

1. Ambari: Hadoop deployment, management & monitoring tool 

- It provides wizard for installing Hadoop across number of hosts 

- Ambari is a central management for starting, stopping, and reconfiguring Hadoop services It contains dashboard for monitoring health and status of the Hadoop cluster 

2. ZooKeeper: Highly reliable distributed coordination system 
 
- It centralized service for maintaining configuration information and allows distributed processes to coordinate with each other. It is Reliable, fast, simple, and scalable component.

Anything Missing? If yes, then please share it with us on our social medias. And for more informative articles on AI, ML, Data Science, and Programming, stay tuned with us.