-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathzeppelin_notebook.json
2056 lines (2056 loc) · 105 KB
/
zeppelin_notebook.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{
"paragraphs": [
{
"text": "%md\n\n\nOur boss, [PHP CEO](https://twitter.com/php_ceo) has just found out about a website called Wikipedia and wants us to help explain it to the board of directors. To quote our illustrious CEO, \"I found out that those Wikimedia idiots publish their traffic data _publically_, can you believe that? I\u0027m sure there\u0027s a way to make $50M doing data science on it so figure out a way to Hadoop it and make it happen! The company is counting on you!\"\n\nNever wanting to disappoint our CEO (or miss a paycheque), you decide to oblige.\n\nTurns out, Wikimedia does in fact publish clickstream data for users of Wikipedia.org and you find it available for download here https://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82. The entire data set is 5.7GB which is starting to feel like \"big data\" so you decide to analyze it using that thing you heard about on Hacker News - Apache Spark.\n\n[Quoting Wikipedia directly](https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream), we tell the boss that the dataset contains:\n\n\u003e counts of (referer, resource) pairs extracted from the request logs of Wikipedia. A referer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on. In other words, it gives a weighted network of articles, where each edge weight corresponds to how often people navigate from one page to another. To give an example, consider the figure below, which shows incoming and outgoing traffic to the \"London\" article on English Wikipedia during January 2015.\n\nHis eyes gloss over after the first sentence, but he fires back with a few questions for us:\n\n1. Is wikipedia.org even big? How many hits does it get?\n2. What do people even read on that site?\n\nAnd with that, we\u0027re off. Let\u0027s load in the data into a Spark [Resilient Distributed Dataset (RDD)](http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds) figure out what we\u0027re dealing with and try go get this monkey off our back.\n",
"dateUpdated": "Nov 12, 2016 9:31:49 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478288491564_-439546265",
"id": "20161104-154131_1723437402",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003e\u003cimg src\u003d\"https://www.evernote.com/l/AAHWzj78DlFKfYJJ0YPrxoYvb2zSe1w9h7kB/image.jpg\" alt\u003d\"PHP CEO\" /\u003e\u003c/p\u003e\n\u003cp\u003eOur boss, \u003ca href\u003d\"https://twitter.com/php_ceo\"\u003ePHP CEO\u003c/a\u003e has just found out about a website called Wikipedia and wants us to help explain it to the board of directors. To quote our illustrious CEO, \u0026ldquo;I found out that those Wikimedia idiots publish their traffic data \u003cem\u003epublically\u003c/em\u003e, can you believe that? I\u0027m sure there\u0027s a way to make $50M doing data science on it so figure out a way to Hadoop it and make it happen! The company is counting on you!\u0026rdquo;\u003c/p\u003e\n\u003cp\u003eNever wanting to disappoint our CEO (or miss a paycheque), you decide to oblige.\u003c/p\u003e\n\u003cp\u003eTurns out, Wikimedia does in fact publish clickstream data for users of Wikipedia.org and you find it available for download here https://datahub.io/dataset/wikipedia-clickstream/resource/be85cc68-d1e6-4134-804a-fd36b94dbb82. The entire data set is 5.7GB which is starting to feel like \u0026ldquo;big data\u0026rdquo; so you decide to analyze it using that thing you heard about on Hacker News - Apache Spark.\u003c/p\u003e\n\u003cp\u003e\u003ca href\u003d\"https://meta.wikimedia.org/wiki/Research:Wikipedia_clickstream\"\u003eQuoting Wikipedia directly\u003c/a\u003e, we tell the boss that the dataset contains:\u003c/p\u003e\n\u003cblockquote\u003e\u003cp\u003ecounts of (referer, resource) pairs extracted from the request logs of Wikipedia. A referer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on. In other words, it gives a weighted network of articles, where each edge weight corresponds to how often people navigate from one page to another. To give an example, consider the figure below, which shows incoming and outgoing traffic to the \u0026ldquo;London\u0026rdquo; article on English Wikipedia during January 2015.\u003c/p\u003e\n\u003c/blockquote\u003e\n\u003cp\u003eHis eyes gloss over after the first sentence, but he fires back with a few questions for us:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eIs wikipedia.org even big? How many hits does it get?\u003c/li\u003e\n\u003cli\u003eWhat do people even read on that site?\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eAnd with that, we\u0027re off. Let\u0027s load in the data into a Spark \u003ca href\u003d\"http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds\"\u003eResilient Distributed Dataset (RDD)\u003c/a\u003e figure out what we\u0027re dealing with and try go get this monkey off our back.\u003c/p\u003e\n"
},
"dateCreated": "Nov 4, 2016 3:41:31 AM",
"dateStarted": "Nov 12, 2016 9:31:49 PM",
"dateFinished": "Nov 12, 2016 9:31:49 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\nfrom __future__ import print_function\nfrom pprint import pprint\n\nCLICKSTREAM_FILE \u003d \u0027/Users/mikesukmanowsky/Downloads/1305770/2015_01_en_clickstream.tsv.gz\u0027\nclickstream \u003d sc.textFile(CLICKSTREAM_FILE) # sc \u003d SparkContext which is available for us automatically since we\u0027re using a \"%pyspark\" paragraph in Zeppelin \nclickstream",
"dateUpdated": "Nov 12, 2016 9:01:31 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/scala"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478288525992_873647704",
"id": "20161104-154205_1326198179",
"dateCreated": "Nov 4, 2016 3:42:05 AM",
"dateStarted": "Nov 10, 2016 10:18:11 PM",
"dateFinished": "Nov 10, 2016 10:18:39 PM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\nSo what the heck is this `clickstream` thing that we just loaded anyway? Spark says `MapPartitionsRDD[40]`. Let\u0027s ignore the `MapPartitions` part and focus on what\u0027s important: **RDD**.\n\n**Resilient Distributed Dataset or RDD** is Spark\u0027s core unit of abstraction. If you _really_ understand RDDs, then you\u0027ll understand a lot of what Spark is about. An RDD is just a collection of things which is partitioned so that Spark can perform computation on it in parallel. This is a core concept of \"big data\" processing. Take big data, and split into smaller parts so they can be operated on independently and in parallel.\n\nIn this case, we created an RDD out of a gzip file which begs the question, if an RDD is a collection of things, what is in our RDD?",
"dateUpdated": "Nov 12, 2016 9:31:46 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478567658068_-1920890636",
"id": "20161107-201418_1676567479",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eSo what the heck is this \u003ccode\u003eclickstream\u003c/code\u003e thing that we just loaded anyway? Spark says \u003ccode\u003eMapPartitionsRDD[40]\u003c/code\u003e. Let\u0027s ignore the \u003ccode\u003eMapPartitions\u003c/code\u003e part and focus on what\u0027s important: \u003cstrong\u003eRDD\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResilient Distributed Dataset or RDD\u003c/strong\u003e is Spark\u0027s core unit of abstraction. If you \u003cem\u003ereally\u003c/em\u003e understand RDDs, then you\u0027ll understand a lot of what Spark is about. An RDD is just a collection of things which is partitioned so that Spark can perform computation on it in parallel. This is a core concept of \u0026ldquo;big data\u0026rdquo; processing. Take big data, and split into smaller parts so they can be operated on independently and in parallel.\u003c/p\u003e\n\u003cp\u003eIn this case, we created an RDD out of a gzip file which begs the question, if an RDD is a collection of things, what is in our RDD?\u003c/p\u003e\n"
},
"dateCreated": "Nov 7, 2016 8:14:18 AM",
"dateStarted": "Nov 12, 2016 9:31:46 PM",
"dateFinished": "Nov 12, 2016 9:31:46 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\n\nclickstream.first()",
"dateUpdated": "Nov 10, 2016 9:32:52 AM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/python"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478567933184_-1355711471",
"id": "20161107-201853_1840213788",
"dateCreated": "Nov 7, 2016 8:18:53 AM",
"dateStarted": "Nov 10, 2016 9:32:54 AM",
"dateFinished": "Nov 10, 2016 9:33:20 AM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\nTurns out RDDs have a handy `first()` method which returns the first element in the collection which turns out to be the first line in the gzip file we loaded up. So `textFile` returns an RDD which is a collection of the lines in the file.\n\nLet\u0027s see how many lines are in that file by using the `count()` method on the RDD.",
"dateUpdated": "Nov 12, 2016 9:31:53 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478567943934_294491886",
"id": "20161107-201903_1247149665",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eTurns out RDDs have a handy \u003ccode\u003efirst()\u003c/code\u003e method which returns the first element in the collection which turns out to be the first line in the gzip file we loaded up. So \u003ccode\u003etextFile\u003c/code\u003e returns an RDD which is a collection of the lines in the file.\u003c/p\u003e\n\u003cp\u003eLet\u0027s see how many lines are in that file by using the \u003ccode\u003ecount()\u003c/code\u003e method on the RDD.\u003c/p\u003e\n"
},
"dateCreated": "Nov 7, 2016 8:19:03 AM",
"dateStarted": "Nov 12, 2016 9:31:53 PM",
"dateFinished": "Nov 12, 2016 9:31:53 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\n\u0027{:,}\u0027.format(clickstream.count())",
"dateUpdated": "Nov 10, 2016 9:32:52 AM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/python"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478568174925_-2062309665",
"id": "20161107-202254_709908107",
"dateCreated": "Nov 7, 2016 8:22:54 AM",
"dateStarted": "Nov 10, 2016 9:33:20 AM",
"dateFinished": "Nov 10, 2016 9:34:17 AM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\nAlmost 22 million lines. Not exactly big data, more like a little large but that\u0027s ok since we\u0027re just working locally.\n\nOne problem we have immediately is that the `clickstream` RDD isn\u0027t very useful since the data isn\u0027t parsed at all.\n\nThe clickstream data source in question actually has 5 fields delimited by a tab:\n\n* `prev_id`: if the referer doesn\u0027t correspond to an article in English Wikipedia, this is empty. Otherwise, it\u0027ll contain the unique MediaWiki page ID of the article correspodning to the referer.\n* `curr_id` the MediaWiki page ID of the article requested, this is always present\n* `n` the number of occurances (page views) of the `(referer, resource)` pair\n* `prev_title`: if referer was a Wikipedia article, this is the title of that article. Otherwise, this gets renamed to something like `other-\u003ewikipedia` (outside the English Wikipedia namespace), `other-empty` (empty referrer), `other-internal` (any other Wikimedia project, `other-google` (any Google site), `other-yahoo` (any Yahoo! site), `other-bing` (any Bing site), `other-facebook`, `other-twitter` and `other-other` (any other site).\n* `curr_title` the title of the Wikipedia article (more helpful than looking at the ID.\n\n\nOk, so we need to do a pretty simple Python string split on a tab delimiter. How do we do that? We can use one of Spark\u0027s **transformation** functions, `map()` which applies a function to all elements in a collection and returns a new collection.",
"dateUpdated": "Nov 12, 2016 9:31:55 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478289568712_1470082354",
"id": "20161104-155928_1642833557",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eAlmost 22 million lines. Not exactly big data, more like a little large but that\u0027s ok since we\u0027re just working locally.\u003c/p\u003e\n\u003cp\u003eOne problem we have immediately is that the \u003ccode\u003eclickstream\u003c/code\u003e RDD isn\u0027t very useful since the data isn\u0027t parsed at all.\u003c/p\u003e\n\u003cp\u003eThe clickstream data source in question actually has 5 fields delimited by a tab:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003eprev_id\u003c/code\u003e: if the referer doesn\u0027t correspond to an article in English Wikipedia, this is empty. Otherwise, it\u0027ll contain the unique MediaWiki page ID of the article correspodning to the referer.\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003ecurr_id\u003c/code\u003e the MediaWiki page ID of the article requested, this is always present\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003en\u003c/code\u003e the number of occurances (page views) of the \u003ccode\u003e(referer, resource)\u003c/code\u003e pair\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eprev_title\u003c/code\u003e: if referer was a Wikipedia article, this is the title of that article. Otherwise, this gets renamed to something like \u003ccode\u003eother-\u0026gt;wikipedia\u003c/code\u003e (outside the English Wikipedia namespace), \u003ccode\u003eother-empty\u003c/code\u003e (empty referrer), \u003ccode\u003eother-internal\u003c/code\u003e (any other Wikimedia project, \u003ccode\u003eother-google\u003c/code\u003e (any Google site), \u003ccode\u003eother-yahoo\u003c/code\u003e (any Yahoo! site), \u003ccode\u003eother-bing\u003c/code\u003e (any Bing site), \u003ccode\u003eother-facebook\u003c/code\u003e, \u003ccode\u003eother-twitter\u003c/code\u003e and \u003ccode\u003eother-other\u003c/code\u003e (any other site).\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003ecurr_title\u003c/code\u003e the title of the Wikipedia article (more helpful than looking at the ID.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eOk, so we need to do a pretty simple Python string split on a tab delimiter. How do we do that? We can use one of Spark\u0027s \u003cstrong\u003etransformation\u003c/strong\u003e functions, \u003ccode\u003emap()\u003c/code\u003e which applies a function to all elements in a collection and returns a new collection.\u003c/p\u003e\n"
},
"dateCreated": "Nov 4, 2016 3:59:28 AM",
"dateStarted": "Nov 12, 2016 9:31:55 PM",
"dateFinished": "Nov 12, 2016 9:31:55 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\nparsed \u003d clickstream.map(lambda line: line.split(\u0027\\t\u0027))\nparsed",
"dateUpdated": "Nov 10, 2016 9:32:53 AM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/python"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478396818605_712890923",
"id": "20161105-214658_1458525638",
"dateCreated": "Nov 5, 2016 9:46:58 AM",
"dateStarted": "Nov 10, 2016 9:33:20 AM",
"dateFinished": "Nov 10, 2016 9:34:17 AM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\nUh wait, why didn\u0027t that give us some output that we could see? That\u0027s because all transformations in Spark are lazy.\n\nWhen we called `map()` on our `clickstream` RDD, Spark returned a new RDD. Under the hood, RDDs are basically just directed acyclic graphs (DAGs) that outline the transformations required to get to a final result. This concept is powerful and I want to drive it home a bit. It\u0027s important to know that **all transformations return new RDDs**. Or, put another way, **RDDs are immutable**. You never have to be worried about modifying an RDD \"in-place\", because any time you modify an RDD, you\u0027ll get a new RDD.\n\nLet\u0027s say that we needed a version of the RDD with strings uppercased and in their original form.\n\n```\nclickstream_parsed \u003d clickstream.map(lambda line: line.split(\u0027\\t\u0027)\nclickstream_upper \u003d clickstream_parsed.map(lambda parts: [x.upper() for x in parts])\n```\n\nWe now have a parsed uppercased RDD and the parsed RDD in the original case. Since under the hood, RDDs are just DAGs assigning these two variables is not expensive at all. It\u0027s just storing instructions on how to produce data:\n\n* `clickstream_parsed`: `Load textFile() -\u003e apply line.split()`\n* `clickstream_upper`: `Load textFile() -\u003e apply line.split() -\u003e apply [x.upper() for x in parts]`\n\nSpark will also try to be clever when a part two children in a DAG have a common ancestor and will not compute the same transformations twice.\n\nAnything other transformations we do later just adds more steps to the DAG. If parts of your program refer to `clickstream_upper`, later on, you don\u0027t have to be worried about what another part did to this RDD, `clickstream_upper` is immutable and will always refer to the original transformation.\n\nSo how do we get to see the result of our `map()`? We can use `first()` again, but for fun, let\u0027s use another RDD method `take()` which takes the first _n_ elements from an RDD.",
"dateUpdated": "Nov 12, 2016 9:31:58 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478398272610_-964710637",
"id": "20161105-221112_1697982578",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eUh wait, why didn\u0027t that give us some output that we could see? That\u0027s because all transformations in Spark are lazy.\u003c/p\u003e\n\u003cp\u003eWhen we called \u003ccode\u003emap()\u003c/code\u003e on our \u003ccode\u003eclickstream\u003c/code\u003e RDD, Spark returned a new RDD. Under the hood, RDDs are basically just directed acyclic graphs (DAGs) that outline the transformations required to get to a final result. This concept is powerful and I want to drive it home a bit. It\u0027s important to know that \u003cstrong\u003eall transformations return new RDDs\u003c/strong\u003e. Or, put another way, \u003cstrong\u003eRDDs are immutable\u003c/strong\u003e. You never have to be worried about modifying an RDD \u0026ldquo;in-place\u0026rdquo;, because any time you modify an RDD, you\u0027ll get a new RDD.\u003c/p\u003e\n\u003cp\u003eLet\u0027s say that we needed a version of the RDD with strings uppercased and in their original form.\u003c/p\u003e\n\u003cpre\u003e\u003ccode\u003eclickstream_parsed \u003d clickstream.map(lambda line: line.split(\u0027\\t\u0027)\nclickstream_upper \u003d clickstream_parsed.map(lambda parts: [x.upper() for x in parts])\n\u003c/code\u003e\u003c/pre\u003e\n\u003cp\u003eWe now have a parsed uppercased RDD and the parsed RDD in the original case. Since under the hood, RDDs are just DAGs assigning these two variables is not expensive at all. It\u0027s just storing instructions on how to produce data:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003eclickstream_parsed\u003c/code\u003e: \u003ccode\u003eLoad textFile() -\u0026gt; apply line.split()\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eclickstream_upper\u003c/code\u003e: \u003ccode\u003eLoad textFile() -\u0026gt; apply line.split() -\u0026gt; apply [x.upper() for x in parts]\u003c/code\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eSpark will also try to be clever when a part two children in a DAG have a common ancestor and will not compute the same transformations twice.\u003c/p\u003e\n\u003cp\u003eAnything other transformations we do later just adds more steps to the DAG. If parts of your program refer to \u003ccode\u003eclickstream_upper\u003c/code\u003e, later on, you don\u0027t have to be worried about what another part did to this RDD, \u003ccode\u003eclickstream_upper\u003c/code\u003e is immutable and will always refer to the original transformation.\u003c/p\u003e\n\u003cp\u003eSo how do we get to see the result of our \u003ccode\u003emap()\u003c/code\u003e? We can use \u003ccode\u003efirst()\u003c/code\u003e again, but for fun, let\u0027s use another RDD method \u003ccode\u003etake()\u003c/code\u003e which takes the first \u003cem\u003en\u003c/em\u003e elements from an RDD.\u003c/p\u003e\n"
},
"dateCreated": "Nov 5, 2016 10:11:12 AM",
"dateStarted": "Nov 12, 2016 9:31:58 PM",
"dateFinished": "Nov 12, 2016 9:31:58 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\npprint(parsed.take(5))",
"dateUpdated": "Nov 10, 2016 9:32:53 AM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/python"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478569438111_-1629077470",
"id": "20161107-204358_2060716368",
"dateCreated": "Nov 7, 2016 8:43:58 AM",
"dateStarted": "Nov 10, 2016 9:34:17 AM",
"dateFinished": "Nov 10, 2016 9:34:17 AM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\nThat\u0027s better!\n\nBoth `first()` and `take()` are examples of a second class of functions on RDDs: **actions**. **Actions** trigger Spark to actually execute the steps in the DAG you\u0027ve been building up and perform some action. In this case, `first()` and `take()` load a sample of the data, evaluate the `split()` function, and return results to the driver script, this notebook, for us to print out.\n\nRDDs have quite a few of [**transformations**](http://spark.apache.org/docs/latest/programming-guide.html#transformations) and [**actions**](http://spark.apache.org/docs/latest/programming-guide.html#actions) but here are a few of the more popular ones you\u0027ll use:\n\n**Transformations**:\n\n* `map(func)`: return a new data set with `func()` applied to all elements\n* `filter(func)`: return a new dataset formed by selecting those elements of the source on which `func()` returns `true`\n* `sortByKey([ascending], [numTasks])`: sort an RDD by key instead of value (we\u0027ll get to the key vs. value distinction in a second)\n* `reduceByKey(func, [numTasks])`: similar to the Python `reduce()` function\n* `aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])`: also similar to Python\u0027s `reduce()`, but with customization options (we\u0027ll come back to this)\n* `repartition()`: reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them\n\n**Actions**:\n\n* `first()`, `take()`\n* `collect()`: return all the elements of the dataset as an array at the driver program.\n* `count()`: count the number of elements in the RDD\n* `saveAsTextFile(path)`: write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system\n\nBefore we keep going, something is bugging me. That `count()` that we did seemed to take a _really_ long time, wouldn\u0027t pure Python be faster at that?",
"dateUpdated": "Nov 12, 2016 9:32:02 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478569458083_-1878764182",
"id": "20161107-204418_803807675",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eThat\u0027s better!\u003c/p\u003e\n\u003cp\u003eBoth \u003ccode\u003efirst()\u003c/code\u003e and \u003ccode\u003etake()\u003c/code\u003e are examples of a second class of functions on RDDs: \u003cstrong\u003eactions\u003c/strong\u003e. \u003cstrong\u003eActions\u003c/strong\u003e trigger Spark to actually execute the steps in the DAG you\u0027ve been building up and perform some action. In this case, \u003ccode\u003efirst()\u003c/code\u003e and \u003ccode\u003etake()\u003c/code\u003e load a sample of the data, evaluate the \u003ccode\u003esplit()\u003c/code\u003e function, and return results to the driver script, this notebook, for us to print out.\u003c/p\u003e\n\u003cp\u003eRDDs have quite a few of \u003ca href\u003d\"http://spark.apache.org/docs/latest/programming-guide.html#transformations\"\u003e\u003cstrong\u003etransformations\u003c/strong\u003e\u003c/a\u003e and \u003ca href\u003d\"http://spark.apache.org/docs/latest/programming-guide.html#actions\"\u003e\u003cstrong\u003eactions\u003c/strong\u003e\u003c/a\u003e but here are a few of the more popular ones you\u0027ll use:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTransformations\u003c/strong\u003e:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003emap(func)\u003c/code\u003e: return a new data set with \u003ccode\u003efunc()\u003c/code\u003e applied to all elements\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003efilter(func)\u003c/code\u003e: return a new dataset formed by selecting those elements of the source on which \u003ccode\u003efunc()\u003c/code\u003e returns \u003ccode\u003etrue\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003esortByKey([ascending], [numTasks])\u003c/code\u003e: sort an RDD by key instead of value (we\u0027ll get to the key vs. value distinction in a second)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003ereduceByKey(func, [numTasks])\u003c/code\u003e: similar to the Python \u003ccode\u003ereduce()\u003c/code\u003e function\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003eaggregateByKey(zeroValue)(seqOp, combOp, [numTasks])\u003c/code\u003e: also similar to Python\u0027s \u003ccode\u003ereduce()\u003c/code\u003e, but with customization options (we\u0027ll come back to this)\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003erepartition()\u003c/code\u003e: reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eActions\u003c/strong\u003e:\u003c/p\u003e\n\u003cul\u003e\n\u003cli\u003e\u003ccode\u003efirst()\u003c/code\u003e, \u003ccode\u003etake()\u003c/code\u003e\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003ecollect()\u003c/code\u003e: return all the elements of the dataset as an array at the driver program.\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003ecount()\u003c/code\u003e: count the number of elements in the RDD\u003c/li\u003e\n\u003cli\u003e\u003ccode\u003esaveAsTextFile(path)\u003c/code\u003e: write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eBefore we keep going, something is bugging me. That \u003ccode\u003ecount()\u003c/code\u003e that we did seemed to take a \u003cem\u003ereally\u003c/em\u003e long time, wouldn\u0027t pure Python be faster at that?\u003c/p\u003e\n"
},
"dateCreated": "Nov 7, 2016 8:44:18 AM",
"dateStarted": "Nov 12, 2016 9:32:02 PM",
"dateFinished": "Nov 12, 2016 9:32:02 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\nimport gzip\n\nwith gzip.open(CLICKSTREAM_FILE) as fp:\n for count, _ in enumerate(fp):\n pass\n count +\u003d 1\n\n\u0027{:,}\u0027.format(count)",
"dateUpdated": "Nov 10, 2016 9:32:53 AM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478570540787_-1443862773",
"id": "20161107-210220_1849083685",
"dateCreated": "Nov 7, 2016 9:02:20 AM",
"dateStarted": "Nov 10, 2016 9:34:17 AM",
"dateFinished": "Nov 10, 2016 9:35:01 AM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\n54s versus 42s...Alright guys, tutorial is over. Python clearly rocks and Spark sucks.\n\n...or does it?\n\nI mentioned before that Spark gets most of its magic from the fact that it can operate on chunks of data in parallel. This is something that we can do in Python (think `threading` or `multiprocessing`), but it\u0027s non-trivial. Let\u0027s ask Spark how many partitions or chunks it\u0027s currently using for our dataset.",
"dateUpdated": "Nov 12, 2016 9:32:06 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478570722555_-1982451581",
"id": "20161107-210522_35023188",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003e54s versus 42s\u0026hellip;Alright guys, tutorial is over. Python clearly rocks and Spark sucks.\u003c/p\u003e\n\u003cp\u003e\u0026hellip;or does it?\u003c/p\u003e\n\u003cp\u003eI mentioned before that Spark gets most of its magic from the fact that it can operate on chunks of data in parallel. This is something that we can do in Python (think \u003ccode\u003ethreading\u003c/code\u003e or \u003ccode\u003emultiprocessing\u003c/code\u003e), but it\u0027s non-trivial. Let\u0027s ask Spark how many partitions or chunks it\u0027s currently using for our dataset.\u003c/p\u003e\n"
},
"dateCreated": "Nov 7, 2016 9:05:22 AM",
"dateStarted": "Nov 12, 2016 9:32:06 PM",
"dateFinished": "Nov 12, 2016 9:32:06 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\n\nclickstream.getNumPartitions()",
"dateUpdated": "Nov 10, 2016 9:32:53 AM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/python"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478570757658_2059213021",
"id": "20161107-210557_1180159027",
"dateCreated": "Nov 7, 2016 9:05:57 AM",
"dateStarted": "Nov 10, 2016 9:34:18 AM",
"dateFinished": "Nov 10, 2016 9:35:01 AM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\nWhat the hell Spark!? You have all this computing power at your disposal and you\u0027re treating my data as if it\u0027s one contigious blob?\n\nIf we ran this on a 200 cluster node, we wouldn\u0027t even keep a single node busy!\n\nWhat happened behind the scenes was dependent on two things:\n\n1. We\u0027re only reading in one file\n2. That one file happens to be gzipped which is an encoding that cannot be \"split\" - only one process can decompresses the file completely (there are compression algos that allow multiple processes to decompress a chunk like [LZO](http://www.oberhumer.com/opensource/lzo/))\n\nSince we\u0027re reading in exactly one file that can only be processed by a single core, Spark did the only thing it could and left it all in a single partition.\n\nIf we want to take advantage of the multiple cores on our laptops or, eventually, the multiple nodes in our cluster, we need to **repartition** our data and redistribute it. Let\u0027s do that now and see if we can speed things up.",
"dateUpdated": "Nov 12, 2016 9:32:09 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478570975923_-1258189398",
"id": "20161107-210935_8134385",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eWhat the hell Spark!? You have all this computing power at your disposal and you\u0027re treating my data as if it\u0027s one contigious blob?\u003c/p\u003e\n\u003cp\u003eIf we ran this on a 200 cluster node, we wouldn\u0027t even keep a single node busy!\u003c/p\u003e\n\u003cp\u003eWhat happened behind the scenes was dependent on two things:\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003eWe\u0027re only reading in one file\u003c/li\u003e\n\u003cli\u003eThat one file happens to be gzipped which is an encoding that cannot be \u0026ldquo;split\u0026rdquo; - only one process can decompresses the file completely (there are compression algos that allow multiple processes to decompress a chunk like \u003ca href\u003d\"http://www.oberhumer.com/opensource/lzo/\"\u003eLZO\u003c/a\u003e)\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eSince we\u0027re reading in exactly one file that can only be processed by a single core, Spark did the only thing it could and left it all in a single partition.\u003c/p\u003e\n\u003cp\u003eIf we want to take advantage of the multiple cores on our laptops or, eventually, the multiple nodes in our cluster, we need to \u003cstrong\u003erepartition\u003c/strong\u003e our data and redistribute it. Let\u0027s do that now and see if we can speed things up.\u003c/p\u003e\n"
},
"dateCreated": "Nov 7, 2016 9:09:35 AM",
"dateStarted": "Nov 12, 2016 9:32:09 PM",
"dateFinished": "Nov 12, 2016 9:32:09 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\n\nclickstream \u003d clickstream.repartition(sc.defaultParallelism)\n\u0027{:,}\u0027.format(clickstream.count())",
"dateUpdated": "Nov 10, 2016 10:18:55 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/python"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478571389447_-481616165",
"id": "20161107-211629_85220792",
"dateCreated": "Nov 7, 2016 9:16:29 AM",
"dateStarted": "Nov 10, 2016 10:18:41 PM",
"dateFinished": "Nov 10, 2016 10:18:41 PM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\n`sc.defaultParallelism` usually corresponds to the number of cores your CPU has when running Spark locally.\n\nI have 8 cores so I effectively asked Spark to create 8 partitions with equal amounts of data. Spark then counted up the elements in each partition in parallel threads and then summed the result.\n\n44s isn\u0027t amazing, but at least we\u0027re on par with pure Python now. Repartitioning isn\u0027t free and Spark programs do have some overhead so in this trivial example, it\u0027s pretty rough to beat a pure Python line count. But now that the data is partitioned, all subsequent operations we do on it are that much faster. Put another way, with one call to `repartition()` we just achieved a theoretical 8x increase in performance.\n\n**Data partitioning is very important in Spark**, it\u0027s usually the key reason why your Spark job either runs fast and utilizes all resources or runs slow as your cores sit idle.\n\nTo get a better idea of how partitioning helped us, let\u0027s look at a slightly harder example: summing up the total number of pageviews in our dataset.",
"dateUpdated": "Nov 12, 2016 9:32:12 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478571483128_1792019972",
"id": "20161107-211803_1176928010",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003e\u003ccode\u003esc.defaultParallelism\u003c/code\u003e usually corresponds to the number of cores your CPU has when running Spark locally.\u003c/p\u003e\n\u003cp\u003eI have 8 cores so I effectively asked Spark to create 8 partitions with equal amounts of data. Spark then counted up the elements in each partition in parallel threads and then summed the result.\u003c/p\u003e\n\u003cp\u003e44s isn\u0027t amazing, but at least we\u0027re on par with pure Python now. Repartitioning isn\u0027t free and Spark programs do have some overhead so in this trivial example, it\u0027s pretty rough to beat a pure Python line count. But now that the data is partitioned, all subsequent operations we do on it are that much faster. Put another way, with one call to \u003ccode\u003erepartition()\u003c/code\u003e we just achieved a theoretical 8x increase in performance.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData partitioning is very important in Spark\u003c/strong\u003e, it\u0027s usually the key reason why your Spark job either runs fast and utilizes all resources or runs slow as your cores sit idle.\u003c/p\u003e\n\u003cp\u003eTo get a better idea of how partitioning helped us, let\u0027s look at a slightly harder example: summing up the total number of pageviews in our dataset.\u003c/p\u003e\n"
},
"dateCreated": "Nov 7, 2016 9:18:03 AM",
"dateStarted": "Nov 12, 2016 9:32:12 PM",
"dateFinished": "Nov 12, 2016 9:32:12 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\ndef parse_line(line):\n \u0027\u0027\u0027Parse a line in the log file, but ensure that the \u0027n\u0027 or \u0027views\u0027 column is\n convereted to an integer.\u0027\u0027\u0027\n parts \u003d line.split(\u0027\\t\u0027)\n parts[2] \u003d int(parts[2])\n return parts",
"dateUpdated": "Nov 10, 2016 10:19:30 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/scala"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478572100247_994143222",
"id": "20161107-212820_1637898955",
"dateCreated": "Nov 7, 2016 9:28:20 AM",
"dateStarted": "Nov 10, 2016 10:19:30 PM",
"dateFinished": "Nov 10, 2016 10:19:30 PM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\nwith gzip.open(CLICKSTREAM_FILE) as fp:\n views \u003d 0\n for line in fp:\n parsed \u003d parse_line(line.strip(\u0027\\n\u0027))\n views +\u003d parsed[2]\n\n\u0027{:,}\u0027.format(views)",
"dateUpdated": "Nov 10, 2016 10:19:39 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478834357179_435356742",
"id": "20161110-221917_2093806364",
"dateCreated": "Nov 10, 2016 10:19:17 PM",
"status": "READY",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\nfrom operator import add\n\n(clickstream\n .map(parse_line)\n .map(lambda parts: parts[2]) # just get views\n .reduce(add)) # identical to Python\u0027s reduce(add, [1, 2, 3, 4])",
"dateUpdated": "Nov 12, 2016 9:32:31 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/python"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478572303140_-140791057",
"id": "20161107-213143_497530321",
"dateCreated": "Nov 7, 2016 9:31:43 AM",
"dateStarted": "Nov 10, 2016 9:35:51 AM",
"dateFinished": "Nov 10, 2016 9:37:55 AM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\nWe went from 1m18s in pure Python to 32s in Spark - a 59% reduction. Maybe this Spark thing isn\u0027t so bad after all!\n\nAlthough this is only running on our local laptops, what you\u0027ve just demonstrated is the key to what allows Spark to process petabyte datasets in minutes (or even seconds). It gives you a horizontal scaling escape hatch for your data. No matter how much data you have, Spark can handle it so long as you can provide it with the hardware. If you can\u0027t, you may have to wait a bit longer, but you still get the benefits of massively parallel processing.\n\nThere\u0027s one last trick I want to show before we review what we\u0027ve learned which ultimately made people pay a lot of attention to Spark: **caching**.\n\nOur clickstream dataset would probably be a lot faster if we could hold most of it in-memory. Turns out, Spark has an easy way to do that:",
"dateUpdated": "Nov 12, 2016 9:32:44 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478572422149_-2100196256",
"id": "20161107-213342_125686747",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eWe went from 1m18s in pure Python to 32s in Spark - a 59% reduction. Maybe this Spark thing isn\u0027t so bad after all!\u003c/p\u003e\n\u003cp\u003eAlthough this is only running on our local laptops, what you\u0027ve just demonstrated is the key to what allows Spark to process petabyte datasets in minutes (or even seconds). It gives you a horizontal scaling escape hatch for your data. No matter how much data you have, Spark can handle it so long as you can provide it with the hardware. If you can\u0027t, you may have to wait a bit longer, but you still get the benefits of massively parallel processing.\u003c/p\u003e\n\u003cp\u003eThere\u0027s one last trick I want to show before we review what we\u0027ve learned which ultimately made people pay a lot of attention to Spark: \u003cstrong\u003ecaching\u003c/strong\u003e.\u003c/p\u003e\n\u003cp\u003eOur clickstream dataset would probably be a lot faster if we could hold most of it in-memory. Turns out, Spark has an easy way to do that:\u003c/p\u003e\n"
},
"dateCreated": "Nov 7, 2016 9:33:42 AM",
"dateStarted": "Nov 12, 2016 9:32:43 PM",
"dateFinished": "Nov 12, 2016 9:32:43 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\nfrom pyspark import StorageLevel\n\nclickstream_parsed \u003d clickstream.map(parse_line)\n# Store whatever you can in memory, but spill anything that doesn\u0027t fit onto disk\nclickstream_parsed.persist(StorageLevel.MEMORY_AND_DISK)",
"dateUpdated": "Nov 12, 2016 5:03:00 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/python"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478573792808_-1289556038",
"id": "20161107-215632_61079791",
"dateCreated": "Nov 7, 2016 9:56:32 AM",
"dateStarted": "Nov 10, 2016 10:19:38 PM",
"dateFinished": "Nov 10, 2016 10:19:38 PM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\n\n(clickstream_parsed\n .map(lambda parts: parts[2])\n .reduce(add))",
"dateUpdated": "Nov 10, 2016 10:19:39 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478834343787_-1536896768",
"id": "20161110-221903_458344037",
"dateCreated": "Nov 10, 2016 10:19:03 PM",
"status": "READY",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\nThis took a bit longer than our last run, and the reason is that Spark loaded up the data in each partition into JVM heap. When it hit memory limits, it spilled data to disk.\n\nNow that our cache is warmed though, let\u0027s see how fast we can get a sum.",
"dateUpdated": "Nov 12, 2016 9:32:27 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478574023768_199562291",
"id": "20161107-220023_9554624",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eThis took a bit longer than our last run, and the reason is that Spark loaded up the data in each partition into JVM heap. When it hit memory limits, it spilled data to disk.\u003c/p\u003e\n\u003cp\u003eNow that our cache is warmed though, let\u0027s see how fast we can get a sum.\u003c/p\u003e\n"
},
"dateCreated": "Nov 7, 2016 10:00:23 AM",
"dateStarted": "Nov 12, 2016 9:32:27 PM",
"dateFinished": "Nov 12, 2016 9:32:27 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\n(clickstream_parsed\n .map(lambda parts: parts[2])\n .reduce(add))",
"dateUpdated": "Nov 10, 2016 9:32:54 AM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/scala"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478574092696_1786944308",
"id": "20161107-220132_1816285552",
"dateCreated": "Nov 7, 2016 10:01:32 AM",
"dateStarted": "Nov 10, 2016 9:37:56 AM",
"dateFinished": "Nov 10, 2016 9:39:38 AM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\nHECK YEAH. From 1m18s to 9s! To be clear, this is a good speed up, but isn\u0027t a fair comparison to the original Python version.\n\nTo compare apples-to-apples there, we would need to load the entire data file into something like a list and do a sum which would also be ridiculously fast.\n\nHopefully the main point is coming across though. Although we\u0027re working with only a single gzip file locally, everything you\u0027ve done so far could instead process millions of gzip files across hundreds of servers **without changing a single line of code**. That\u0027s the power of Spark.\n\nOk, let\u0027s rehash what we\u0027ve learned so far.",
"dateUpdated": "Nov 12, 2016 9:33:02 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorHide": true,
"editorMode": "ace/mode/markdown"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478574147569_-2112208315",
"id": "20161107-220227_1043819962",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eHECK YEAH. From 1m18s to 9s! To be clear, this is a good speed up, but isn\u0027t a fair comparison to the original Python version.\u003c/p\u003e\n\u003cp\u003eTo compare apples-to-apples there, we would need to load the entire data file into something like a list and do a sum which would also be ridiculously fast.\u003c/p\u003e\n\u003cp\u003eHopefully the main point is coming across though. Although we\u0027re working with only a single gzip file locally, everything you\u0027ve done so far could instead process millions of gzip files across hundreds of servers \u003cstrong\u003ewithout changing a single line of code\u003c/strong\u003e. That\u0027s the power of Spark.\u003c/p\u003e\n\u003cp\u003eOk, let\u0027s rehash what we\u0027ve learned so far.\u003c/p\u003e\n"
},
"dateCreated": "Nov 7, 2016 10:02:27 AM",
"dateStarted": "Nov 12, 2016 9:33:01 PM",
"dateFinished": "Nov 12, 2016 9:33:01 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\nFiguring out the boss\u0027 second question \"What do people read on the site?\" can be translated as \"What are the top en.wikipedia.org articles?\".\n\nTo answer that, we\u0027ll need to explore another core Spark concept which is the key-value or pair RDD.\n\nPairRDDs differ from regular RDDs in that all elements are assumed to be two-element tuples or lists of `(key, value)`.\n\nLet\u0027s create a PairRDD that maps `curr_title -\u003e views` for our dataset.",
"dateUpdated": "Nov 12, 2016 9:32:51 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478742542334_1099773527",
"id": "20161109-204902_599151766",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eFiguring out the boss\u0027 second question \u0026ldquo;What do people read on the site?\u0026rdquo; can be translated as \u0026ldquo;What are the top en.wikipedia.org articles?\u0026ldquo;.\u003c/p\u003e\n\u003cp\u003eTo answer that, we\u0027ll need to explore another core Spark concept which is the key-value or pair RDD.\u003c/p\u003e\n\u003cp\u003ePairRDDs differ from regular RDDs in that all elements are assumed to be two-element tuples or lists of \u003ccode\u003e(key, value)\u003c/code\u003e.\u003c/p\u003e\n\u003cp\u003eLet\u0027s create a PairRDD that maps \u003ccode\u003ecurr_title -\u0026gt; views\u003c/code\u003e for our dataset.\u003c/p\u003e\n"
},
"dateCreated": "Nov 9, 2016 8:49:02 AM",
"dateStarted": "Nov 12, 2016 9:32:51 PM",
"dateFinished": "Nov 12, 2016 9:32:51 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\n\n(clickstream_parsed\n .map(lambda parsed: (parsed[4], parsed[2]))\n .first())",
"dateUpdated": "Nov 10, 2016 9:32:54 AM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/python"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478743064843_1240739409",
"id": "20161109-205744_1312605228",
"dateCreated": "Nov 9, 2016 8:57:44 AM",
"dateStarted": "Nov 10, 2016 9:39:24 AM",
"dateFinished": "Nov 10, 2016 9:39:38 AM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\nCool, so we have a mapping of `curr_title -\u003e views`, now we know we want to do the equivalent of:\n\n```\nSELECT curr_title, SUM(views)\nFROM clickstream_parsed\nGROUP BY curr_title\nORDER BY SUM(views) DESC\nLIMIT 25\n```\n\nHow can we do that in Spark?",
"dateUpdated": "Nov 12, 2016 9:33:05 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478812703235_1794295241",
"id": "20161110-161823_651379",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eCool, so we have a mapping of \u003ccode\u003ecurr_title -\u0026gt; views\u003c/code\u003e, now we know we want to do the equivalent of:\u003c/p\u003e\n\u003cpre\u003e\u003ccode\u003eSELECT curr_title, SUM(views)\nFROM clickstream_parsed\nGROUP BY curr_title\nORDER BY SUM(views) DESC\nLIMIT 25\n\u003c/code\u003e\u003c/pre\u003e\n\u003cp\u003eHow can we do that in Spark?\u003c/p\u003e\n"
},
"dateCreated": "Nov 10, 2016 4:18:23 AM",
"dateStarted": "Nov 12, 2016 9:33:05 PM",
"dateFinished": "Nov 12, 2016 9:33:05 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark \nfrom operator import add\n\npprint(clickstream_parsed\n .map(lambda parsed: (parsed[4], parsed[2]))\n .reduceByKey(add)\n .top(25, key\u003dlambda (k, v): v))",
"dateUpdated": "Nov 12, 2016 5:11:31 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/python"
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478812782348_1793415185",
"id": "20161110-161942_227148540",
"dateCreated": "Nov 10, 2016 4:19:42 AM",
"dateStarted": "Nov 10, 2016 9:39:38 AM",
"dateFinished": "Nov 10, 2016 9:40:16 AM",
"status": "FINISHED",
"errorMessage": "",
"progressUpdateIntervalMs": 500
},
{
"text": "%md\n\nSweet! So no surprise that the top article was the `Main_Page`, but `Chris_Kyle` is a little surprising if you\u0027re not aware of who he is. \n\nChris Kyle was a US Navy Seal who died in Feb 2013 but was immortalized in the film American Sniper which premiered worldwide in mid-late January.\n\nThe interest in Charlie Hebdo correlates with the terrorist attacks against the satirical french magazine of the same name in January of 2015\n\nIn fact, most of the top pages, minus `Main_Page` seem to be correlated with news stories. People hear about a news event, maybe do a Google search for it and land at Wikipedia to learn more.\n\nOf course we haven\u0027t really validated the theory so let\u0027s try to do that for the `Chris_Kyle` article.\n\nIn SQL, we\u0027d want something like this:\n\n```\nSELECT curr_title, trafic_source, SUM(views)\nFROM clickstream\nWHERE curr_title \u003d \u0027Chris_Kyle\u0027\nGROUP BY curr_title, traffic_source\nORDER BY SUM(views) DESC\nLIMIT 100\n```\n\nReally all we\u0027re doing is adding another column to our group key. How would that translate to Spark?",
"dateUpdated": "Nov 12, 2016 9:33:10 PM",
"config": {
"colWidth": 12.0,
"graph": {
"mode": "table",
"height": 300.0,
"optionOpen": false,
"keys": [],
"values": [],
"groups": [],
"scatter": {}
},
"enabled": true,
"editorMode": "ace/mode/markdown",
"editorHide": true
},
"settings": {
"params": {},
"forms": {}
},
"jobName": "paragraph_1478813057260_-1996960245",
"id": "20161110-162417_1179182725",
"result": {
"code": "SUCCESS",
"type": "HTML",
"msg": "\u003cp\u003eSweet! So no surprise that the top article was the \u003ccode\u003eMain_Page\u003c/code\u003e, but \u003ccode\u003eChris_Kyle\u003c/code\u003e is a little surprising if you\u0027re not aware of who he is.\u003c/p\u003e\n\u003cp\u003eChris Kyle was a US Navy Seal who died in Feb 2013 but was immortalized in the film American Sniper which premiered worldwide in mid-late January.\u003c/p\u003e\n\u003cp\u003eThe interest in Charlie Hebdo correlates with the terrorist attacks against the satirical french magazine of the same name in January of 2015\u003c/p\u003e\n\u003cp\u003eIn fact, most of the top pages, minus \u003ccode\u003eMain_Page\u003c/code\u003e seem to be correlated with news stories. People hear about a news event, maybe do a Google search for it and land at Wikipedia to learn more.\u003c/p\u003e\n\u003cp\u003eOf course we haven\u0027t really validated the theory so let\u0027s try to do that for the \u003ccode\u003eChris_Kyle\u003c/code\u003e article.\u003c/p\u003e\n\u003cp\u003eIn SQL, we\u0027d want something like this:\u003c/p\u003e\n\u003cpre\u003e\u003ccode\u003eSELECT curr_title, trafic_source, SUM(views)\nFROM clickstream\nWHERE curr_title \u003d \u0027Chris_Kyle\u0027\nGROUP BY curr_title, traffic_source\nORDER BY SUM(views) DESC\nLIMIT 100\n\u003c/code\u003e\u003c/pre\u003e\n\u003cp\u003eReally all we\u0027re doing is adding another column to our group key. How would that translate to Spark?\u003c/p\u003e\n"
},
"dateCreated": "Nov 10, 2016 4:24:17 AM",
"dateStarted": "Nov 12, 2016 9:33:10 PM",
"dateFinished": "Nov 12, 2016 9:33:10 PM",
"status": "FINISHED",
"progressUpdateIntervalMs": 500
},
{
"text": "%pyspark\n\npprint(clickstream_parsed\n .filter(lambda parsed: parsed[4] \u003d\u003d \u0027Chris_Kyle\u0027)\n .map(lambda parsed: ((parsed[4], parsed[3]), parsed[2]))\n .reduceByKey(add)\n .top(10, key\u003dlambda (k, v): v))",