Skip to content
This repository was archived by the owner on Dec 12, 2018. It is now read-only.

Project 1 for CS5621 at the University of Minnesota, Duluth

License

Notifications You must be signed in to change notification settings

d1vanloon/Hadoop-Wikistats

 
 

Repository files navigation

Hadoop-Wikistats

Project 1 for CS5621 at the University of Minnesota, Duluth

Deadlines

  • Project complete by 3/9

Architecture

Three MapReduce jobs:

  • Job 0: Identify the top M languages (largest number of unique pages) (Bai)
  • Job 1: Identify largest spike for each page over O days (David, Eric)
  • Job 2: Identify N pages from the top M languages with the largest spikes (Bai, Stephen)

Inputs

  • (N) Number of pages to return
  • (O) Length of spike to determine
  • (M) Number of top languages to return

Outputs

For each of the N pages:

  • Page language
  • Page name
  • Size of spike
  • Number of unique pages for each language

Results should be sorted by page count of language.

Interfaces

JobOutput of Map (Input of Reduce)Output of Job (Input of Next Job)
0Key: TBD
Value: TBD
Key: TBD
Value: TBD
1 Key: language + page
The language is a two-character string. The page name is a string of characters.
The language and page are not separated by a space.
Example: "enMain_Page"
Value: date + hour + pageviews
The date is an 8-character string in the form YYYYMMDD. The hour is a two-character string. Pageviews is a string of characters.
The date, hour, and pageviews are separated by spaces.
Example: "20140601 00 156"
Key: language + page
Value: spike:
spike will be in form of "valueOfSpike"
2Key: language
Value: page + spike
TBD

Data Access

Shared WikiStats data location:

/panfs/roc/scratch/vanlo013/project1/inputdata

All data from June-August 2014 is in the folder inputdata. I've added permissions for the group to traverse the directories and read the data and can add additional permissions as required. Email me at [email protected] with questions.

About

Project 1 for CS5621 at the University of Minnesota, Duluth

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 89.6%
  • Shell 10.4%