Skip to content

Commit 8fc6ed4

Browse files
author
Saverio Proto
committed
Added paragraph on how to use the googlebooks-ngrams
1 parent e816db3 commit 8fc6ed4

File tree

3 files changed

+221
-0
lines changed

3 files changed

+221
-0
lines changed

Readme.md

+101
Original file line numberDiff line numberDiff line change
@@ -201,3 +201,104 @@ Remeber to put an empty space before the command ( `hadoop` in our case) so that
201201
```
202202

203203
In general with this `-D` flag you can override any configuration from the `/etc/hadoop/core-site.xml`
204+
205+
## Use Openstack SWIFT to access a public dataset on SWITCHengines
206+
207+
To get started with Public Datasets hosted on SWITCHengines we loaded the googlebooks-ngrams dataset: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
208+
The dataset is about 5Tb of zipped files.
209+
210+
To write this part of the tutorial I read first this blog post:
211+
https://dbaumgartel.wordpress.com/2014/04/10/an-elastic-mapreduce-streaming-example-with-python-and-ngrams-on-aws/
212+
213+
We are going to so something similar but using Openstack instead of Amazon EC2.
214+
215+
We will analyze a part of the dataset to understand how many words that start with the letter X appeared for the first time in the year 1999.
216+
Check the code in the files `mapper-ngrams.py` and `reducer-ngrams.py`.
217+
218+
To configure Hadoop to access the dataset, add a new block in the `core-site.xml` config file.
219+
```
220+
<property>
221+
<name>fs.swift.service.datasets.auth.url</name>
222+
<value>https://keystone.cloud.switch.ch:5000/v2.0/tokens</value>
223+
</property>
224+
<property>
225+
<name>fs.swift.service.datasets.auth.endpoint.prefix</name>
226+
<value>/AUTH_</value>
227+
</property>
228+
<property>
229+
<name>fs.swift.service.datasets.http.port</name>
230+
<value>443</value>
231+
</property>
232+
233+
<property>
234+
<name>fs.swift.service.datasets.region</name>
235+
<value>LS</value>
236+
</property>
237+
<property>
238+
<name>fs.swift.service.datasets.public</name>
239+
<value>true</value>
240+
</property>
241+
<property>
242+
<name>fs.swift.service.datasets.tenant</name>
243+
<value>datasets_readonly</value>
244+
</property>
245+
<property>
246+
<name>fs.swift.service.switchengines.username</name>
247+
<value>SWITCHengines-username</value>
248+
</property>
249+
<property>
250+
<name>fs.swift.service.switchengines.password</name>
251+
<value>secret</value>
252+
</property>
253+
```
254+
255+
*Make sure SWITCHengines admins added your user to the tenant datasets_readonly before trying the next steps. Contact support if unsure about this.*
256+
257+
Now we should be able to download this file:
258+
259+
export OS_USERNAME=SWITCHengines-username
260+
export OS_PASSWORD=secret
261+
export OS_TENANT_NAME=datasets_readonly
262+
export OS_AUTH_URL=https://keystone.cloud.switch.ch:5000/v2.0
263+
export OS_REGION_NAME=LS
264+
swift download googlebooks-ngrams-gz-swift eng/googlebooks-eng-all-1gram-20120701-x.gz
265+
266+
Let's check if our data pipeline works before using Hadoop for this map reduce example.
267+
268+
time zcat googlebooks-eng-all-1gram-20120701-x.gz | ./mapper-ngrams.py | sort -k1,1 | ./reducer-ngrams.py | sort -k2,2n
269+
270+
The result should be a set with the words that appeared for the first time in the year 1999 and that start with the letter X and the number of occurences.
271+
272+
Because we limited our analisys to a single file of 14Mb we are still able to check the pipeline without using Hadoop.
273+
274+
Now we will do the same using Hadoop, reading the single file `googlebooks-eng-all-1gram-20120701-x.gz` from the swift container with the googlebooks-ngrams dataset and writing the result in swift in a container in our own tenant.
275+
Note that Hadoop is able to understand automatically that the input file is in zip format, and it will decompress it without any special configuration.
276+
277+
```
278+
hadoop jar /usr/lib/hadoop/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
279+
-D fs.swift.service.switchengines.password=mysecretsecretpassword \
280+
-input swift://googlebooks-ngrams-gz-swift.datasets/eng/googlebooks-eng-all-1gram-20120701-x.gz \
281+
-output swift://results.switchengines/testnumber1 \
282+
-mapper mapper-ngrams.py \
283+
-reducer reducer-ngrams.py \
284+
-numReduceTasks 1
285+
```
286+
287+
When Hadoop finishes the processing you can download the results:
288+
289+
swift download results testnumber1/part-00000
290+
291+
The result should be the same as the one you observed when testing the data pipeline.
292+
293+
Now lets try with the file eng/googlebooks-eng-all-0gram-20120701-a.gz that is about 15Gb
294+
295+
```
296+
hadoop jar /usr/lib/hadoop/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
297+
-D fs.swift.service.switchengines.password=mysecretsecretpassword \
298+
-D fs.swift.service.datasets.password=mysecretsecretpassword \
299+
-input swift://googlebooks-ngrams-gz-swift.datasets/eng/googlebooks-eng-all-0gram-20120701-a.gz \
300+
-output swift://results.switchengines/testnumber2 \
301+
-mapper mapper-ngrams.py \
302+
-reducer reducer-ngrams.py \
303+
-numReduceTasks 1
304+
```

mapper-ngrams.py

+45
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
#!/usr/bin/env python
2+
# Code imported from
3+
# https://dbaumgartel.wordpress.com/2014/04/10/an-elastic-mapreduce-streaming-example-with-python-and-ngrams-on-aws/
4+
5+
import sys
6+
7+
def CleanWord(aword):
8+
"""
9+
Function input: A string which is meant to be
10+
interpreted as a single word.
11+
Output: a clean, lower-case version of the word
12+
"""
13+
# Make Lower Case
14+
aword = aword.lower()
15+
# Remvoe special characters from word
16+
for character in '.,;:\'?':
17+
aword = aword.replace(character,'')
18+
# No empty words
19+
if len(aword)==0:
20+
return None
21+
# Restrict word to the standard english alphabet
22+
for character in aword:
23+
if character not in 'abcdefghijklmnopqrstuvwxyz':
24+
return None
25+
# return the word
26+
return aword
27+
28+
# Now we loop over lines in the system input
29+
for line in sys.stdin:
30+
# Strip the line of whitespace and split into a list
31+
line = line.strip().split()
32+
# Use CleanWord function to clean up the word
33+
word = CleanWord(line[0])
34+
35+
# If CleanWord didn't return a string, move on
36+
if word == None:
37+
continue
38+
39+
# Get the year and the number of occurrences from
40+
# the ngram line
41+
year = int(line[1])
42+
occurrences = int(line[2])
43+
44+
# Print the output: word, year, and number of occurrences
45+
print '%s\t%s\t%s' % (word, year,occurrences)

reducer-ngrams.py

+75
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
#!/usr/bin/env python
2+
# code imported from
3+
# https://dbaumgartel.wordpress.com/2014/04/10/an-elastic-mapreduce-streaming-example-with-python-and-ngrams-on-aws/
4+
5+
import sys
6+
7+
# current_word will be the word in each loop iteration
8+
current_word = ''
9+
# word_in_progress will be the word we have been working
10+
# on for the last few iterations
11+
word_in_progress = ''
12+
13+
# target_year_count is the number of word occurrences
14+
# in the target year
15+
target_year_count = 0
16+
# prior_year_count is the number of word occurrenes
17+
# in the years prior to the target year
18+
prior_year_count = 0
19+
20+
# Define the target year, in our case 1999
21+
target_year = 1999
22+
23+
# Loop over lines of input from STDIN
24+
for line in sys.stdin:
25+
26+
# Get the items in the line as a list
27+
line = line.strip().split('\t')
28+
29+
# If for some reason there are not 3 items,
30+
# then move on to next line
31+
if len(line)!=3:
32+
continue
33+
34+
# The line consists of a word, a year, and
35+
# a number of occurrences
36+
current_word, year, occurrences = line
37+
38+
# If we are on a new word, check the info of the last word
39+
# Print if it is a newly minted word, and zero our counters
40+
if current_word != word_in_progress:
41+
# Word exists in target year
42+
if target_year_count > 0:
43+
# Word doesn't exist in target year
44+
if prior_year_count ==0:
45+
# Print the cool new word and its occurrences
46+
print '%s\t%s' % (word_in_progress,target_year_count)
47+
48+
# Zero our counters
49+
target_year_count = 0
50+
prior_year_count = 0
51+
word_in_progress = current_word
52+
53+
# Get the year and occurences as integers
54+
# Continue if there is a problem
55+
try:
56+
year = int(year)
57+
except ValueError:
58+
continue
59+
try:
60+
occurrences = int(occurrences)
61+
except ValueError:
62+
continue
63+
64+
# Update our variables
65+
if year == target_year:
66+
target_year_count += occurrences
67+
if year < target_year:
68+
prior_year_count += occurrences
69+
70+
# Since the loop is over, print the last word if applicable
71+
if target_year_count > 0:
72+
# Word doesn't exist in target year
73+
if prior_year_count ==0:
74+
# Print the cool new word and its occurrences
75+
print '%s\t%s' % (word_in_progress,target_year_count)

0 commit comments

Comments
 (0)