Infrastructure

Linux Cluster hosted at Amazon EC2
Cloudera Cluster (CDH 5.1) running on Ubuntu 12.04
Access via SSH (Windows User can use Putty)
Python 2.7.3, Python Documentation: http://docs.python.org/
Hostname: cloud.luckow-hm.de (54.83.41.227)
HDFS Web: http://cloud.luckow-hm.de:50070/
YARN Web: http://cloud.luckow-hm.de:8088/cluster
Cloudera Manager: http://cloud.luckow-hm.de:7180/

Please change your initial password with passwd!

1. Using SSH and Linux

Data/Tools:

Use an SSH client of your choice (e.g. Putty for Windows or SSH in your Linux/Mac OS Terminal)
Data: cloud.luckow-hm.de:/data/NASA_access_log_Jul95

Please login into the Hadoop cluster on Amazon!
Answer the following questions using the command (hadoop dfsadmin -report):
- How big is the Hadoop cluster?
- How many data nodes are used?
Upload the file cloud.luckow-hm.de:/data/NASA_access_log_Jul95 to your HDFS home directory! How many blocks does HDFS allocate for this file? On what host are these blocks?

2. MapReduce Hello World

Data/Tools:

MapReduce Application: hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount

Run the WordCount example of Hadoop:

Create two test files containing text and upload them to HDFS!
Use the MapReduce program WordCount for processing these files!

3. Commandline Data Analytics

Data/Tools:

http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
Commandline data tools https://github.com/bitly/data_hacks
Data: cloud.luckow-hm.de:/data/NASA_access_log_Jul95

Use the commands head, cat, uniq, wc, sort, find, xargs, awk to evaluate the NASA log file:
- Which page was called the most?
- What was the most frequent return code?
- How many errors occurred? What is the percentage of errors?
Implement a Python version of this Unix Shell script using this script as template!

Run the Python script inside an Hadoop Streaming job.

 hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -info

4. Spark

Data/Tools:

Spark Programming Guide: https://spark.apache.org/docs/1.1.0/programming-guide.html (use Python API recommended)
Spark API: https://spark.apache.org/docs/1.1.0/api/python/index.html

Implement a wordcount using Spark. Make sure that you only allocate 1 core for the interactive Spark shell:
```
 pyspark --total-executor-cores 1
```
Implement the NASA log file analysis using Spark!

Solution

5. Hadoop SQL Engines

Data/Tools:

Hive User Guide: https://cwiki.apache.org/confluence/display/Hive/GettingStarted
Hive ORC: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.0.0.2/ds_Hive/orcfile.html
Hive Parquet: http://www.cloudera.com/content/cloudera/en/documentation/cdh5/v5-0-0/CDH5-Installation-Guide/cdh5ig_parquet.html

Create a Hive table for the NASA Log files! Use either python or awk to convert the log file to a structured format (CSV) that is manageable by Hive! Use the text format for the table definition!
```
 cat /data/NASA_access_log_Jul95 |awk -F' ' '{print "\""$4 $5"\","$(NF-1)","$(NF)}' > nasa.csv
```
Run an SQL query that outputs the number of occurrences of each HTTP response code!
Based on the initially created table define an ORC and Parquet-based table. Repeat the query!
Run the same query with Impala!

Solution

6. Data Analytics

Data/Tools:

Spark MLLib KMeans Example: https://spark.apache.org/docs/1.1.0/mllib-clustering.html

Run KMeans on the provided example dataset!
Validate the quality of the model using the sum of the squared error for each point!

Solution

7. Hadoop Benchmarking

Run the program Terasort on 1 GB of data - each record that TeraGen generates is 100 Bytes in size:

 hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen <number_of_records> <output_directory>

 hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort <input_directory> <output_directory>

How many containers are consumed during which phase of the application: teragen, terasort (map phase, reduce phase)? Please explain! See blog post.

In-Class Exercises

Infrastructure

1. Using SSH and Linux

2. MapReduce Hello World

3. Commandline Data Analytics

4. Spark

5. Hadoop SQL Engines

6. Data Analytics

7. Hadoop Benchmarking