forked from jacobseunglee/information_retrieval_system
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTEST.txt
113 lines (104 loc) · 6.15 KB
/
TEST.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
THIS IS THE FILE WHERE WE DOCUMENT OUR QUERY TESTS
CATEGORY 1 - Bad Ranking Performance or Bad Runtime Performance
Query 1 - "machine learning"
Improved (Ranking):
- Added bigrams index to ensure machine and learning were together
Explanation:
- By rewarding document ranks when there is an equivalent bigram, the query showed results
related to the topic of machine learning rather than machine and learning separately.
Query 2 - "binary search trees"
Improved (Ranking):
- Added trigrams index to ensure words like binary search and trees were all together
Explanation:
- Many results had binary, search and trees separately in the document, not properly
catching the purpose of the query. Trigrams ensure that these words are in succession
Query 3 - "learning machinery in computation"
Improved (Ranking):
- Stemmed query and tokens in inverted indexes
Explanation:
- Stemming allows for past tense, present tense, future tense, to all map to the same stem
- All of the possible tenses of a word should be relavant when querying
Query 4 - "to be or not to be with extra words" :
Improved (Runtime):
- Added pre-query caching of common stop words
Explanation:
- Stop words are very common and have larger postings
- Most queries will include stop words so there is merit in preprocessing the stop words
Query 5 - "how to train your dragon with computer science principles in order to do binary search tree problems for a programming class"
Improved (Runtime):
- Added a threshold for max documents processed when considering candidate documents
Explanation:
- Since we order terms by rarity, the most relavant documents will already be in the accumulated candidate documents by
a certain threshold
Query 6 - "master of software engineering"
Improved (Runtime/Ranking):
- Added champion lists to pick out the best documents in a token
Explanation:
- Speeds up query processing because there are less documents to go through
- Doesn't really affect quality because we take the top documents from each token
Query 7 - "i want to be the very best software engineer to be the very but to be the very best like the to be"
Improved (Runtime):
- Made the query into a set before processing
Explanation:
- Removed the duplicates so there were less calls to find the token in the index
- Less iterations in the processing for loops
- Still kept integrity of positions for positional processing so no changes there
Query 8 - "how to do the computer science project where i need to follow basic instructions because i cannot read"
Improved (Runtime):
- Build index along with the processing
Explanation:
- Looking at the time, much of the preprocessing, building the local indexes, took a lot of time
- Changing so that building the index as you go allowed for less computation in the case of early termination
Query 9 - "mission the mission to mississippi state misspel word"
Improved (Runtime/Ranking):
- Changed bigrams and trigrams to compute multipliers instead of intersection
Explanation:
- Computing if all bigrams/trigrams appeared in a document was inefficient
- Very rare for all n-gram to appear in document
- Faster to compute multipliers for the documents
Query 10 - "is there a medical field in computer science when cows can fly and classes have papers to write with confusing instructions students struggle"
Improved (Runtime):
- Capped positional processing to maximum of 5 checks
Explanation:
- Too many combinations were being generated with little added benefit
- Changed to only take 5 and results stayed similar
CATEGORY 2 - Good Query Results
Query 1 - "video games as a tool"
Explanation:
- This returned good results but took some time to process
- After optimizations we were able to keep the good results but make the query processing faster
Query 2 - "acm"
Explanation:
- We get results fast and links are good
- After optimizations the pages we get are the same but the order changes and this new order makes more sense
Query 3 - "sports research"
Explanation:
- Still good after bigram bias
Query 4 - "cristina lopes"
Explanation:
- Pages include faculty pages and other related pages to Cristina Lopes
- After adding champion lists, we modified some parameters and the query stayed similar
Query 5 - "cs178"
Explanation:
- Pages included classes related to the course
Query 6 - "dragon tree"
Explanation:
- Not many pages with dragon in the corpus
- Was consistent throughtout changes with bigrams and rarity checking
- Kept dragon relavant without including trees from binary search trees
Query 7 - "big o notation"
Explanation:
- Always had related links to data structure class pages
- Positional index improved results a bit by rewarding big notation as o is referred to commonly as omega
Query 8 - "computer science internship"
Explanation:
- Shows related career websites and courses related to internships for computer science
- Continued to have similar results even when removing full conjunctive queries and adding positional/ngrams
Query 9 - "traveling salesman"
Explanation:
- Consistently had pages mentioning traveling and salesman together which were all related to graphs
- Did not change much when adding champion lists
Query 10 - "information retrieval"
Explanation:
- Shows all pages with computer science 121 course references
- None of the modifications with the exception of bigrams impacted this query