Added paragraph on how to use the googlebooks-ngrams

Saverio Proto · Saverio Proto · commit 8fc6ed47443d · 2016-06-17T08:41:03.000+02:00
diff --git a/Readme.md b/Readme.md
@@ -201,3 +201,104 @@ Remeber to put an empty space before the command ( `hadoop` in our case) so that
 ```
 
 In general with this `-D` flag you can override any configuration from the `/etc/hadoop/core-site.xml`
+
+## Use Openstack SWIFT to access a public dataset on SWITCHengines
+
+To get started with Public Datasets hosted on SWITCHengines we loaded the googlebooks-ngrams dataset: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
+The dataset is about 5Tb of zipped files.
+
+To write this part of the tutorial I read first this blog post:
+https://dbaumgartel.wordpress.com/2014/04/10/an-elastic-mapreduce-streaming-example-with-python-and-ngrams-on-aws/
+
+We are going to so something similar but using Openstack instead of Amazon EC2.
+
+We will analyze a part of the dataset to understand how many words that start with the letter X appeared for the first time in the year 1999.
+Check the code in the files `mapper-ngrams.py` and `reducer-ngrams.py`.
+
+To configure Hadoop to access the dataset, add a new block in the `core-site.xml` config file.
+```
+<property>
+   <name>fs.swift.service.datasets.auth.url</name>
+   <value>https://keystone.cloud.switch.ch:5000/v2.0/tokens</value>
+ </property>
+ <property>
+   <name>fs.swift.service.datasets.auth.endpoint.prefix</name>
+   <value>/AUTH_</value>
+ </property>
+ <property>
+   <name>fs.swift.service.datasets.http.port</name>
+   <value>443</value>
+ </property>
+
+ <property>
+   <name>fs.swift.service.datasets.region</name>
+   <value>LS</value>
+ </property>
+ <property>
+   <name>fs.swift.service.datasets.public</name>
+   <value>true</value>
+ </property>
+ <property>
+   <name>fs.swift.service.datasets.tenant</name>
+   <value>datasets_readonly</value>
+ </property>
+ <property>
+   <name>fs.swift.service.switchengines.username</name>
+   <value>SWITCHengines-username</value>
+ </property>
+ <property>
+   <name>fs.swift.service.switchengines.password</name>
+   <value>secret</value>
+ </property>
+```
+
+*Make sure SWITCHengines admins added your user to the tenant datasets_readonly before trying the next steps. Contact support if unsure about this.*
+
+Now we should be able to download this file:
+
+    export OS_USERNAME=SWITCHengines-username
+    export OS_PASSWORD=secret
+    export OS_TENANT_NAME=datasets_readonly
+    export OS_AUTH_URL=https://keystone.cloud.switch.ch:5000/v2.0
+    export OS_REGION_NAME=LS
+    swift download  googlebooks-ngrams-gz-swift eng/googlebooks-eng-all-1gram-20120701-x.gz
+
+Let's check if our data pipeline works before using Hadoop for this map reduce example.
+
+    time zcat googlebooks-eng-all-1gram-20120701-x.gz | ./mapper-ngrams.py | sort -k1,1 | ./reducer-ngrams.py | sort -k2,2n
+
+The result should be a set with the words that appeared for the first time in the year 1999 and that start with the letter X and the number of occurences.
+
+Because we limited our analisys to a single file of 14Mb we are still able to check the pipeline without using Hadoop.
+
+Now we will do the same using Hadoop, reading the single file `googlebooks-eng-all-1gram-20120701-x.gz` from the swift container with the googlebooks-ngrams dataset and writing the result in swift in a container in our own tenant.
+Note that Hadoop is able to understand automatically that the input file is in zip format, and it will decompress it without any special configuration.
+
+```
+hadoop jar /usr/lib/hadoop/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
+-D fs.swift.service.switchengines.password=mysecretsecretpassword \
+-input swift://googlebooks-ngrams-gz-swift.datasets/eng/googlebooks-eng-all-1gram-20120701-x.gz \
+-output swift://results.switchengines/testnumber1 \
+-mapper mapper-ngrams.py \
+-reducer reducer-ngrams.py  \
+-numReduceTasks 1
+```
+
+When Hadoop finishes the processing you can download the results:
+
+    swift download results testnumber1/part-00000
+
+The result should be the same as the one you observed when testing the data pipeline.
+
+Now lets try with the file eng/googlebooks-eng-all-0gram-20120701-a.gz that is about 15Gb
+
+    ```
+    hadoop jar /usr/lib/hadoop/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
+    -D fs.swift.service.switchengines.password=mysecretsecretpassword \
+    -D fs.swift.service.datasets.password=mysecretsecretpassword \
+    -input swift://googlebooks-ngrams-gz-swift.datasets/eng/googlebooks-eng-all-0gram-20120701-a.gz \
+    -output swift://results.switchengines/testnumber2 \
+    -mapper mapper-ngrams.py \
+    -reducer reducer-ngrams.py  \
+    -numReduceTasks 1
+    ```
diff --git a/mapper-ngrams.py b/mapper-ngrams.py
@@ -0,0 +1,45 @@
+#!/usr/bin/env python
+# Code imported from
+# https://dbaumgartel.wordpress.com/2014/04/10/an-elastic-mapreduce-streaming-example-with-python-and-ngrams-on-aws/
+
+import sys
+ 
+def CleanWord(aword):
+    """
+    Function input: A string which is meant to be
+       interpreted as a single word.
+    Output: a clean, lower-case version of the word
+    """
+    # Make Lower Case
+    aword = aword.lower()
+    # Remvoe special characters from word
+    for character in '.,;:\'?':
+        aword = aword.replace(character,'')
+    # No empty words
+    if len(aword)==0:
+        return None
+    # Restrict word to the standard english alphabet
+    for character in aword:
+        if character not in 'abcdefghijklmnopqrstuvwxyz':
+            return None
+    # return the word
+    return aword
+ 
+# Now we loop over lines in the system input
+for line in sys.stdin:
+    # Strip the line of whitespace and split into a list
+    line = line.strip().split()
+    # Use CleanWord function to clean up the word
+    word = CleanWord(line[0])
+ 
+    # If CleanWord didn't return a string, move on
+    if word == None:
+        continue
+ 
+    # Get the year and the number of occurrences from
+    # the ngram line
+    year = int(line[1])
+    occurrences = int(line[2])
+ 
+    # Print the output: word, year, and number of occurrences
+    print '%s\t%s\t%s' % (word, year,occurrences)
diff --git a/reducer-ngrams.py b/reducer-ngrams.py
@@ -0,0 +1,75 @@
+#!/usr/bin/env python
+# code imported from
+# https://dbaumgartel.wordpress.com/2014/04/10/an-elastic-mapreduce-streaming-example-with-python-and-ngrams-on-aws/
+
+import sys
+ 
+# current_word will be the word in each loop iteration
+current_word = ''
+# word_in_progress will be the word we have been working
+# on for the last few iterations
+word_in_progress = ''
+ 
+# target_year_count is the number of word occurrences
+# in the target year
+target_year_count = 0
+# prior_year_count is the number of word occurrenes
+# in the years prior to the target year
+prior_year_count = 0
+ 
+# Define the target year, in our case 1999
+target_year = 1999
+ 
+# Loop over lines of input from STDIN
+for line in sys.stdin:
+ 
+    # Get the items in the line as a list
+    line = line.strip().split('\t')
+ 
+    # If for some reason there are not 3 items,
+    # then move on to next line
+    if len(line)!=3:
+        continue
+ 
+    # The line consists of a word, a year, and
+    # a number of occurrences
+    current_word, year, occurrences =  line
+ 
+    # If we are on a new word, check the info of the last word
+    # Print if it is a newly minted word, and zero our counters
+    if current_word != word_in_progress:
+        # Word exists in target year
+        if target_year_count > 0:
+            # Word doesn't exist in target year
+            if prior_year_count ==0:
+                # Print the cool new word and its occurrences
+                print '%s\t%s' % (word_in_progress,target_year_count)
+ 
+        # Zero our counters
+        target_year_count = 0
+        prior_year_count = 0
+        word_in_progress = current_word
+ 
+    # Get the year and occurences as integers
+    # Continue if there is a problem
+    try:
+        year = int(year)
+    except ValueError:
+        continue
+    try:
+         occurrences = int(occurrences)
+    except ValueError:
+        continue
+ 
+    # Update our variables
+    if year == target_year:
+        target_year_count += occurrences
+    if year < target_year:
+        prior_year_count += occurrences
+ 
+# Since the loop is over, print the last word if applicable
+if target_year_count > 0:
+    # Word doesn't exist in target year
+    if prior_year_count ==0:
+        # Print the cool new word and its occurrences
+        print '%s\t%s' % (word_in_progress,target_year_count)