Hadoop-Wikistats

Project 1 for CS5621 at the University of Minnesota, Duluth

Deadlines

Project complete by 3/9

Architecture

Three MapReduce jobs:

Job 0: Identify the top M languages (largest number of unique pages) (Bai)
Job 1: Identify largest spike for each page over O days (David, Eric)
Job 2: Identify N pages from the top M languages with the largest spikes (Bai, Stephen)

Inputs

(N) Number of pages to return
(O) Length of spike to determine
(M) Number of top languages to return

Outputs

For each of the N pages:

Page language
Page name
Size of spike
Number of unique pages for each language

Results should be sorted by page count of language.

Interfaces

Job	Output of Map (Input of Reduce)	Output of Job (Input of Next Job)
0	Key: TBD Value: TBD	Key: TBD Value: TBD
1	Key: language + page The language is a two-character string. The page name is a string of characters. The language and page are not separated by a space. Example: "enMain_Page" Value: date + hour + pageviews The date is an 8-character string in the form YYYYMMDD. The hour is a two-character string. Pageviews is a string of characters. The date, hour, and pageviews are separated by spaces. Example: "20140601 00 156"	Key: language + page Value: spike: spike will be in form of "valueOfSpike"
2	Key: language Value: page + spike	TBD

Data Access

Shared WikiStats data location:

/panfs/roc/scratch/vanlo013/project1/inputdata

All data from June-August 2014 is in the folder inputdata. I've added permissions for the group to traverse the directories and read the data and can add additional permissions as required. Email me at vanlo013@d.umn.edu with questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Hadoop-Wikistats

Deadlines

Architecture

Inputs

Outputs

Interfaces

Data Access

Files

README.md

Latest commit

History

README.md

File metadata and controls

Hadoop-Wikistats

Deadlines

Architecture

Inputs

Outputs

Interfaces

Data Access