by Faraz Ahmad, Seyong Lee, Mithuna Thottethodi, T. N. Vijaykumar
1. Word-Count
counts the
occurrences of each word in a large collection of documents. Map emits <word,1> tuples.
Reduce adds up the counts for a given word from all map tasks and outputs the final
count.
Input format: any
document (usually a web document in text/xml format)
Output format: <word> <count>
Dataset:
Downloaded from http://dumps.wikimedia.org/enwiki/
. Due to HDFS (Hadoop File System) limitations, the datasets needed some
processing such as (i) copying all files from multiple hierarchical directories
to one directory, (ii) merging multiple files together to create small number
of large-sized files rather than large number of small-sized files, and (iii)
eliminating special character file names.
Command-line execution: $ bin/hadoop jar
hadoop-*-examples.jar wordcount –r <num-reduces> <input-dir>
<output-dir
2. Inverted-Index
takes a list
of documents as input and generates word-to-document indexing. Map emits <word, docId> tuples with each word emitted once per docId. Reduce
combines all tuples on key <word> and emits <word,list(docId)> tuples after removing duplicates.
Input format: any
document (usually a web document in text/xml format)
Output format: <word> <docId>
Dataset: Same
as for Word-Count.
Command-line execution: $ bin/hadoop
jar hadoop-*-examples.jar invertedindex –m<num-maps> -r <num-reduces>
<input-dir> <output-dir>
Note: To let Hadoop figure out the number of map tasks, supply –m 1 here.
3. Term-Vector
determines the
most frequent words in a set of documents and is useful in the analyses of a
host’s relevance to a search. Map emits <host,termvector> tuples
where termvector is itself a tuple of the form <word, 1>. Reduce
discards the words whose frequency is below some cut-off, sorts the rest of the
list per key in a descending order with respect to count and emits tuples of
the form <host, list(termvector)>.
Input format: any document (usually a web document in text/xml format)
Output format: <host> <termvector>
Dataset: Same as
for Word-Count.
Command-line execution: $ bin/hadoop
jar hadoop-*-examples.jar termvectorperhost –m<num-maps> -r <num-reduces>
<input-dir> <output-dir>
4. Self-Join
is similar to the candidate generation part of the a priori data mining algorithm to
generate association among k+1 fields
given the set of k-field
associations. Map receives candidate lists of the form {element1, element2, ...., elementk}, each list in alphanumerically sorted order. Map breaks these lists into
<{element1, element2,
....,elementk-1}, {elementk}> tuples. Reduce prepares a sorted list of all the Map values for a given
key by building <{element1, element2, ....,
elementk-1}, {val1,val2, ...., valj}> tuples.
From these tuples, (k+1)-sized candidates
can be obtained by appending consecutive pairs of map values vali, vali+1 to the (k-1)-sized key. By avoiding
repetition of (k-1)-sized keys
for every pair of values in the list, tuples are a compact representation of
the (k+1)-sized candidates set.
Input format: {e1,e2,
….., ek}
Output format: <e1,e2,
….. ek-1>< ek,
ek+1 >
Dataset:
Synthetic data (details here)
Command-line execution: $ bin/hadoop
jar hadoop-*-examples.jar selfjoin –m<num-maps> -r <num-reduces>
<input-dir> <output-dir>
5. Adjacency-List
is similar
to search-engine computation to generate adjacency and reverse adjacency lists
of nodes of a graph for use by PageRank-like algorithms. Map receives as inputs
graph edges <p, q> of a directed graph that follows the power law of the World-wide Web.
For the input, we assume the probability, that a node has an out-degree of i, is proportional to 1/iskew
with an average out-degree of 7.2. Map emits tuples of the form <q, from_list{p}:to_list{}> and <p, from_list{}:to_list{q}>. For a
given key, Reduce generates unions of the respective lists in the from_list and to_list fields,
sorts the items within the union lists, and emits <p(and
q), from_list{sorted union of all individual from_list}:to_list{sorted union
of all individual to_list}> tuples.
Input format: {p,q}
Output format: <p><from{list_of_in_degree}:to{list_of_out_degree}>
, <q><from{list_of_in_degree}:to{list_of_out_degree}>
Dataset:
Synthetic data (details here)
Command-line execution: $ bin/hadoop
jar hadoop-*-examples.jar adjlist –m<num-maps> -r <num-reduces>
<input-dir> <output-dir>
6. K-Means
Input Format: {movie_id:
userid1_rating1, userid2_rating2, ...}
Output Format: kmeans produces two types of outputs:
(a) <centroid_num><{movie_id: userid1_rating1,
userid2_rating2, ...}> (list of
all movies associated with a particular centroid)
(b) <centroid_num><{similarity_value}{centroid_movie_id}{num_members}{userid1_rating1,
userid2_rating2, …}> (new centroid}
Datasets: movie
ratings dataset.
Command-line execution: $ bin/hadoop
jar hadoop-*-examples.jar kmeans –m <num-maps> -r <num-reduces>
<input-dir> <output-dir>
Note: To run multiple iterations of kmeans, a
wrapper script should be executed. The details can be found here.
7. Classification
classifies the
movies into one of k pre-determined
clusters. Similar to k-means, classification uses anonymized movies rating data which is of the form <movie_id: list{rater_id, rating}>. Similar to k-means, Map
computes the cosine vector similarity of a given movie with the centroids, and
determines the centroid to which the movie is closest (i.e., the cluster to
which it belongs). Map emits <centroid_id, movie_id)>. Unlike k-means, the
details of movie ratings are not emitted because there are no further
iterations which may need the details. Reduce collects all the movies in a
cluster and emits <centroid_id,
movie_id>.
Input Format: {movie_id:
userid1_rating1, userid2_rating2, ….}
Output Format: <centroid_num><movieid>
Datasets: movie
ratings dataset.
Command-line execution: $ bin/hadoop
jar hadoop-*-examples.jar classification –m <num-maps>
-r <num-reduces> <input-dir>
<output-dir>
8. Histogram-Movies
generates a
histogram of input data and is a generic tool used in many data analyses. We
use the movie rating data and the input is of the form <movie_id: list{rater_id, rating}>. Based on the average ratings of movies (ratings range from 1 to 5) we
bin the movies into 8 bins each with a range of 0.5. Map computes the average
rating for a movie, determines the bin, and emits <bin_value, 1> tuples. Reduce
collects all the tuples for a bin and outputs a <bin_value, n> tuple.
Input Format: {movie_id:
userid1_rating1, userid2_rating2, ….}
Output Format: <bin_value><num_of_movies>
Datasets: movie
ratings dataset.
Command-line execution: $ bin/hadoop
jar hadoop-*-examples.jar histogram_movies –m <num-maps> -r <num-reduces>
<input-dir> <output-dir>
9. Histogram-Ratings
generates a
histogram of the user ratings trend. The input is of the form <movie_id: list{rater_id, rating}>. Here, we bin the user ratings of 1-5 into 5 bins and Map emits <rating, 1> tuple
for each review. Reduce collects all the tuples for a rating and emits a <rating, n> tuple.
Input Format: {movie_id:
userid1_rating1, userid2_rating2, ….}
Output Format: <rating ><num_of_user_reviews>
Datasets: movie
ratings dataset.
Command-line execution: $ bin/hadoop
jar hadoop-*-examples.jar histogram_ratings –m <num-maps> -r <num-reduces>
<input-dir> <output-dir>
10. Sequence-Count
generates a count
of all unique sets of three consecutive words per document in the input data.
Map emits <word1|word2|word3|filename, 1> tuples.
Reduce adds up the counts for the multi-words from all map tasks and outputs the final
count.
Input format: any document (usually a web document in text/xml format)
Output format: <word1|word2|word3|filename>
<count>
Dataset: Same as
for Word-Count.
Command-line execution: $ bin/hadoop jar hadoop-*-examples.jar sequencecount –m <num-maps>
–r <num-reduces> <input-dir>
<output-dir>
11. Ranked-Inverted-Index
takes list of
words and their frequencies per document and generates lists of documents
containing the given words in decreasing order of frequency. Map takes
sequence-count benchmark’s output <word-sequence|filename,n> as its input and separates counts from the rest of the data in the
input. Map output format is <word-sequence,
{filename,n}>. Reduce takes all map outputs and produces a list per word-sequence in
decreasing order of occurrence in the respective documents <word-sequence><{count1,
file1},{count2, file2}, …>.
Input format: <word-sequence|filename><count>
Output format: <word-sequence> <count | file>
Dataset: Output
of Sequence-Count
Command-line execution: $ bin/hadoop
jar hadoop-*-examples.jar rankedinvertedindex –m <num-maps> –r <num-reduces>
<input-dir> <output-dir>
12. Tera-Sort
Input format: {10-bytes}{90-bytes}
Output format: <10-bytes><90-bytes>
Dataset:
Generated through TeraGen in Hadoop. Here is a sample
3GB dataset.
Command-line execution: $ bin/hadoop
jar hadoop-*-examples.jar terasort <input-dir> <output-dir> <num-reduces>
13. Grep
Input format: any document (usually a web document in text/xml format)
Output format: <regex> <count>
Dataset: Same as
for Word-Count.
Command-line execution: $ bin/hadoop
jar hadoop-*-examples.jar grep <input-dir>
<output-dir> <num-reduces>
<regex> [<group>]