Project 1 for CS5621 at the University of Minnesota, Duluth
- Project complete by 3/9
Three MapReduce jobs:
- Job 0: Identify the top M languages (largest number of unique pages) (Bai)
- Job 1: Identify largest spike for each page over O days (David, Eric)
- Job 2: Identify N pages from the top M languages with the largest spikes (Bai, Stephen)
- (N) Number of pages to return
- (O) Length of spike to determine
- (M) Number of top languages to return
For each of the N pages:
- Page language
- Page name
- Size of spike
- Number of unique pages for each language
Results should be sorted by page count of language.
Job | Output of Map (Input of Reduce) | Output of Job (Input of Next Job) |
0 | Key: TBD Value: TBD | Key: TBD Value: TBD |
1 | Key: language + page
The language is a two-character string. The page name is a string of characters. The language and page are not separated by a space. Example: "enMain_Page" Value: date + hour + pageviews The date is an 8-character string in the form YYYYMMDD. The hour is a two-character string. Pageviews is a string of characters. The date, hour, and pageviews are separated by spaces. Example: "20140601 00 156" |
Key: language + page Value: spike: spike will be in form of "valueOfSpike" |
2 | Key: language Value: page + spike | TBD |
Shared WikiStats data location:
/panfs/roc/scratch/vanlo013/project1/inputdata
All data from June-August 2014 is in the folder inputdata. I've added permissions for the group to traverse the directories and read the data and can add additional permissions as required. Email me at [email protected] with questions.