Distributed File Storage Systems

Posted January 17, 2022

Hadoop is an open-source software defined storage project administered by the Apache Software Foundation. Hadoop’s contributors work for some of the world’s biggest technology companies like Yahoo and Google. The open-source community as well as commercial companies providing commercial versions of Hadoop have produced a genuinely innovative platform for consolidating, combining and understanding data. Enterprises today collect and generate more data than ever before. Relational and data warehouse products excel at OLAP and OLTP workloads over structured data. Hadoop was designed to deal with unstructured data which needs: scalable, reliable storage and analysis of both structured and complex data. As a result, many enterprises deploy Hadoop alongside their legacy IT systems, allowing them to combine old and new data sets in powerful new ways. Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce. Hadoop runs on a collection of commodity, shared-nothing servers.

Unlike HPC which has centralized storage and distributed compute, Hadoop uses both distributed storage and distributed compute. Hadoop breaks your big data into blocks which is stored in distributed servers which handle both the storage and compute jobs. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. Hadoop can automatically store multiple copies of data which increases reliability. It can deliver data – and can run large-scale, high-performance processing jobs – in spite of system changes or failures. You can scale up your Hadoop system by just adding nodes. This tutorial does a good job of helping you grasp the map/reduce concept http://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm

MAIN ELEMENTS OF HADOOP

Hadoop Common – utilities and libraries referenced by other Hadoop software
Hadoop Distributed File System (HDFS) – Java-based code that stores data on multiple machines with prior structuring
MapReduce – It consists of two parts: map and reduce. Map converts a set of data into a different dataset; separate elements are put into tuples (key/value pairs). Reduce takes those data from the Map operation and combines them into smaller sets of tuples
YARN – A resource manager for scheduling and resource management. YARN means (Yet Another Resource Negotiator)

ADVANTAGES

Low cost per byte
Excellent for processing unstructured data
Provides storage close to compute resources
Scales to massive compute and storage sizes

DISADVANTAGES

Can be complex to implement and manage

BEST USES

Can provide HA(high availability) in some cases
Consider Hadoop for data volumes in the TB or PB range
If you have mixed data types in your data
If your organization has Java programming skills (how Hadoop is written)
Your data growth in the future will be big

MAIN ELEMENTS OF HADOOP

ADVANTAGES

DISADVANTAGES

BEST USES

Share this: