Bigram Indexer

Indexing Documents

When indexing documents using the bigram_indexer.py script the following parameters are necessary:

--new-docs
--type Primarily this is used for categorizing the type of document that is being indexed
-d This is the path of the documents that are being indexed. These should be in plain-text with the file name matching the ID of the document
-m This is the location of the mapping file which contains a mapping between the external ID (i.e. MongoDB object ID) and the document's ID

The process that runs through each batch of documents in the incoming document's directory is as follows:

Read the mapping file and the document IDs from the file names to create an in-memory map.
Insert records into the documents table which contains the external ID of the document
Read through each document and insert the terms into the term table--this will create a unique ID for each term
Finally, the IDs generated from the term rows will be used in cross reference tables gram, tf, and idf in which the terms are mapped back to the documents.

Notes

Currently there are deadlock related issues when running this process using multiple threads/processes. To fix this the inserts into the term table should be separated from the inserts into the gram, tf, and idf tables. This will allow for inserts to succeed or fail when appropriate. Afterwards, the inserts into gram should be able to be done in parallel. This will likely require less memory both on the database process and this python script.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bigram Indexer

Indexing Documents

Notes

Clone this wiki locally