You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Readme.md
+101
Original file line number
Diff line number
Diff line change
@@ -201,3 +201,104 @@ Remeber to put an empty space before the command ( `hadoop` in our case) so that
201
201
```
202
202
203
203
In general with this `-D` flag you can override any configuration from the `/etc/hadoop/core-site.xml`
204
+
205
+
## Use Openstack SWIFT to access a public dataset on SWITCHengines
206
+
207
+
To get started with Public Datasets hosted on SWITCHengines we loaded the googlebooks-ngrams dataset: http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
208
+
The dataset is about 5Tb of zipped files.
209
+
210
+
To write this part of the tutorial I read first this blog post:
The result should be a set with the words that appeared for the first time in the year 1999 and that start with the letter X and the number of occurences.
271
+
272
+
Because we limited our analisys to a single file of 14Mb we are still able to check the pipeline without using Hadoop.
273
+
274
+
Now we will do the same using Hadoop, reading the single file `googlebooks-eng-all-1gram-20120701-x.gz` from the swift container with the googlebooks-ngrams dataset and writing the result in swift in a container in our own tenant.
275
+
Note that Hadoop is able to understand automatically that the input file is in zip format, and it will decompress it without any special configuration.
276
+
277
+
```
278
+
hadoop jar /usr/lib/hadoop/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar \
0 commit comments