mapreduce

index↑

COT 4401 Top 10 Algorithms
Chris Lacher
Big Data: MapReduce

index↑

Notice: These notes will be undergoing upgrade (additions & corrections only) - return frequently until this notice dissappears.

Background & Resources

Paper - Jeffrey Dean and Sanjay Ghemawat (Google, Inc)
Slides (from talk)
Learn MapReduce with Playing Cards - Jesse Anderson (10 min video)
Apache MapReduce Tutorial
MapReduce WordBench 2.0 - essentially cop4530 WordBench (or WordSmith) using MapReduce on distributed data
Amazon Cloud EMR - Video getting started with Amazon Cloud Services MapReduce, another WordSmith (using CIA fact book) (hype/propaganda ends about minute 3)
Breaking the Minute Barrier for Terasort - wired.com 2012(11)

MapReduce

Hadoop

Hadoop is a framework to support big data science. Hardware failure tolerance and distrubuted computing management are supported below the level of required user cognizance. It is intended to be a "google-like" system with unknown and retargetable applications: Kind of "template < Application > Google;".

Quote from Wikipedia:

Apache Hadoop is an open-source software framework (written in Java) for distributed storage and distributed processing of very large data sets ("Big Data") on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework.
The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (MapReduce). Hadoop splits files into large blocks (default 64MB or 128MB) and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop Map/Reduce transfers code (specifically Jar files) to nodes that have the required data, which the nodes then process in parallel. This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking.
The base Apache Hadoop framework is composed of the following modules:

Hadoop Common - contains libraries and utilities needed by other Hadoop modules;

Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;

Hadoop YARN - a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications; and

Hadoop MapReduce - programming model for large scale data processing.

Since 2012, the term "Hadoop" often refers not just to the base modules above but also to the collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Spark, and others.
Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on their MapReduce and Google File System. [See references at top of these notes.]
The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts. For end-users, though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program. Other related projects expose other higher level user interfaces.
Prominent corporate users of Hadoop include Facebook and Yahoo. It can be deployed in traditional onsite datacenters as well as via the cloud; e.g., it is available on Microsoft Azure, Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3), Google App Engine and IBM Bluemix cloud services.
Apache Hadoop is a registered trademark of the Apache Software Foundation.

Note that a patent for the MapReduce algorithm has been issued to Google, Inc. That issuance has been disputed on several grounds, including existence of prior art and lack of novelty.

Exercise 1 History.
How did Hadoop get its name? Who is Doug Cutting? What was the first big task of (and possibly the reason for inventing) MapReduce?

index↑