(877) 808-1010
 In news

Hadoop is an open-source software defined storage project administered by the Apache Software Foundation. Hadoop’s contributors work for some of the world’s biggest technology companies like Yahoo and Google. The open-source community as well as commercial companies providing commercial versions of Hadoop have produced a genuinely innovative platform for consolidating, combining and understanding data. Enterprises today collect and generate more data than ever before. Relational and data warehouse products excel at OLAP and OLTP workloads over structured data. Hadoop was designed to deal with unstructured data which needs: scalable, reliable storage and analysis of both structured and complex data. As a result, many enterprises deploy Hadoop alongside their legacy IT systems, allowing them to combine old and new data sets in powerful new ways. Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce. Hadoop runs on a collection of commodity, shared-nothing servers.

Unlike HPC which has centralized storage and distributed compute, Hadoop uses both distributed storage and distributed compute. Hadoop breaks your big data into blocks which is stored in distributed servers which handle both the storage and compute jobs. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. Hadoop can automatically store multiple copies of data which increases reliability. It can deliver data – and can run large-scale, high-performance processing jobs – in spite of system changes or failures. You can scale up your Hadoop system by just adding nodes. This tutorial does a good job of helping you grasp the map/reduce concept http://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm

MAIN ELEMENTS OF HADOOP

  • Hadoop Common – utilities and libraries referenced by other Hadoop software
  • Hadoop Distributed File System (HDFS) – Java-based code that stores data on multiple machines with prior structuring
  • MapReduce – It consists of two parts: map and reduce. Map converts a set of data into a different dataset; separate elements are put into tuples (key/value pairs). Reduce takes those data from the Map operation and combines them into smaller sets of tuples
  • YARN – A resource manager for scheduling and resource management. YARN means (Yet Another Resource Negotiator)

ADVANTAGES

  • Low cost per byte
  • Excellent for processing unstructured data
  • Provides storage close to compute resources
  • Scales to massive compute and storage sizes

DISADVANTAGES

  • Can be complex to implement and manage

BEST USES

  • Can provide HA(high availability) in some cases
  • Consider Hadoop for data volumes in the TB or PB range
  • If you have mixed data types in your data
  • If your organization has Java programming skills (how Hadoop is written)
  • Your data growth in the future will be big
Recent Posts

Start typing and press Enter to search