Infrastructure

Please change your initial password with passwd!


1. Using SSH and Linux


Data/Tools:

  • Use an SSH client of your choice (e.g. Putty for Windows or SSH in your Linux/Mac OS Terminal)
  • Data: cloud.luckow-hm.de:/data/NASA_access_log_Jul95
  1. Please login into the Hadoop cluster on Amazon!

  2. Answer the following questions using the command (hadoop dfsadmin -report):
    • How big is the Hadoop cluster?
    • How many data nodes are used?
  3. Upload the file cloud.luckow-hm.de:/data/NASA_access_log_Jul95 to your HDFS home directory! How many blocks does HDFS allocate for this file? On what host are these blocks?


2. MapReduce Hello World


Data/Tools:

  • MapReduce Application: hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount

Run the WordCount example of Hadoop:

  1. Create two test files containing text and upload them to HDFS!
  2. Use the MapReduce program WordCount for processing these files!


3. Commandline Data Analytics


Data/Tools:


  1. Use the commands head, cat, uniq, wc, sort, find, xargs, awk to evaluate the NASA log file:

    • Which page was called the most?
    • What was the most frequent return code?
    • How many errors occurred? What is the percentage of errors?
  2. Implement a Python version of this Unix Shell script using this script as template!

  3. Run the Python script inside an Hadoop Streaming job.

     hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar -info
    


4. Spark



Data/Tools:

  1. Implement a wordcount using Spark. Make sure that you only allocate 1 core for the interactive Spark shell:

     pyspark --total-executor-cores 1
    
  2. Implement the NASA log file analysis using Spark!

Solution


5. Hadoop SQL Engines



Data/Tools:

  1. Create a Hive table for the NASA Log files! Use either python or awk to convert the log file to a structured format (CSV) that is manageable by Hive! Use the text format for the table definition!

     cat /data/NASA_access_log_Jul95 |awk -F' ' '{print "\""$4 $5"\","$(NF-1)","$(NF)}' > nasa.csv
    
  2. Run an SQL query that outputs the number of occurrences of each HTTP response code!

  3. Based on the initially created table define an ORC and Parquet-based table. Repeat the query!

  4. Run the same query with Impala!

Solution

6. Data Analytics


Data/Tools:

  1. Run KMeans on the provided example dataset!

  2. Validate the quality of the model using the sum of the squared error for each point!

Solution

7. Hadoop Benchmarking


  1. Run the program Terasort on 1 GB of data - each record that TeraGen generates is 100 Bytes in size:

     hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen <number_of_records> <output_directory>
    
     hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort <input_directory> <output_directory>
    
  2. How many containers are consumed during which phase of the application: teragen, terasort (map phase, reduce phase)? Please explain! See blog post.