-
Notifications
You must be signed in to change notification settings - Fork 13
Programmer Documentation: Web API v1
All of the visualizations and web pages on the Topical Guide use AJAX calls to retrieve data from the server. Data is accessed through the Web API. This framework allows for flexible data retrieval.
For publicly available data, a GET request to the following URL is used:
https://<domain>/api?<query parts>
For private data, a POST request to the following URL is used:
https://<domain>/user_api
As a courtesy we ask that all GET requests to the API are normalized to take advantage of caching. Sort each query key and each list of values alphanumerically in ascending order. If you're building in our JavaScript framework for visualizations then this is done for you when the submitQueryByHash
function is used.
Requests to the above URLs will yield a JSON string specifying the requested information. Currently, there are no other available formats at this time, but XML could be supported at a later date.
Each part, or CGI Parameter, of the query specifies what data to collect. In some cases the key:value pairs will specify constraints on some of the desired attributes.
Lets examine the following request:
https://localhost:8000/api?datasets=*&dataset_attr=metadata,metrics&analyses=*&analysis_attr=metadata,metrics
The datasets=* query asks to query over all datasets, optionally the user could specify the URL friendly name of the dataset to restrict the query to a single dataset or a comma separated list of dataset names for multiple datasets. This is useful for gathering the names of available datasets. The dataset_attr key specifies what information for each dataset should be gathered; the values metadata and metrics will return the available metadata and metrics for each dataset. The same rules apply for analyses.
Example Output in the form of a JSON string:
{
"datasets": {
"state_of_the_union": {
"metrics": {
"Document Count": 223.0
},
"analyses": {
"lda100topics": {
"metrics": {
"Token Count": 552564.0,
"Stopword Count": 0.0,
"Excluded Word Count": 0.0,
"Entropy": 6.638011436412113,
"Type Count": 117144.0
},
"metadata": {
"num_topics": 100,
"num_iterations": 10,
"readable_name": "LDA with 100 Topics",
"optimize_interval": 10,
"token_regex": "(([^\\W])+([-'\u2019,])?)+([^\\W])+",
"description": "Mallet LDA with 100 topics."
}
}
},
"metadata": {
"source": "WikiSource",
"readable_name": "State of the Union",
"description": "This dataset consists of State of the Union messages delivered by the U.S. President to Congress as mandated by the Constitution for the years 1790-2010.",
}
}
}
}
Context matters. Each analysis is in the context of the associated dataset, each topic and document are in the context of both a dataset and analysis. Specifying the document without a dataset or analysis will result in an empty result.
For each attribute of a dataset, analysis, topic, document, or word the name of the attribute will be a key nested under the corresponding dataset, etc. The available query keys and allowed values are as follows:
Takes a comma separated list of dataset names or '*' for all datasets.
Takes a comma separated list of analysis names or '*' for all analyses.
Takes a comma separated list of integers identifying the topics or '*' for all topics. Note that the integers are required to start at zero (0).
Takes a comma separated list of document file names or '*' for all documents. There are limits to the number of documents that can be retrieved since many datasets may have large amounts of documents.
Takes a comma separated list of words or '*' for all words. This key can allow for the retrieval of word counts, but is used in the context of a topic or document (not across the entire corpus).
Takes a comma separated list of attributes. The allowed attributes include metadata, metrics, document_count, and analysis_count.
- metadata: return a dictionary of key:value pairs
- metrics: return a dictionary of key:value pairs
Takes a comma separated list of attributes. The allowed attributes include metadata, metrics, and topic_count.
- See dataset_attr for metadata and metrics.
- topic_count: return an integer greater than or equal to zero
Takes a comma separated list of attributes. The allowed attributes include metadata, metrics, names, pairwise, top_n_words, top_n_documents, word_tokens, word_token_documents_and_locations.
- See dataset_attr for metadata and metrics.
- names: return a dictionary of key:value pairs where the key is the naming scheme used and value is the name for the topic
- pairwise: return a dictionary of key:value pairs where the key is the pairwise metric name and the value is an array of floats indicating the topic-topic metric value, the index position of the array is the number of the other topic
- top_n_words: requires the query keys
top_n_words
andwords
to also be specified; return a dictionary of maximum size n, where each word is a key and the associated count of the word in the scope of the topic is the value - top_n_documents: requires the query key
top_n_documents
to also be specified; return a dictionary of maximum size n, where each document is mapped to a token count of the number of tokens the corresponding topic contains in that document - word_tokens: requires that the query key
words
is set; return a dictionary of words mapping to an array of 2-tuples; the first element in the tuple is the document, the second the index of the token in that document - word_token_documents_and_locations: requires that the query key
documents
is set; return a dictionary mapping the document names to an array of 2-tuples; the first element in the tuple is the start index as a character offset and the second is the ending index of the token
Takes a comma separated list of attributes. The allowed attributes include metadata, metrics, text, top_n_topics, top_n_words, kwic, and word_token_topics_and_locations.
- See dataset_attr for metadata and metrics.
- text: return the document's raw text as given to the import system
- top_n_topics: return a dictionary mapping topic numbers to their token counts in the current document
- top_n_words: return a dictionary of words mapping each word to the token count
- kwic: stands for "key word in context"; returns a mapping from the specified
token_indices
to a 4-tuple of the following format: (text_before, word, text_after, word_type). Where text_before is the text before the word, word is the word as it appears in the text, text_after is the text after the word, and word_type is the word token (this is different than the word since it could be in lowercase or a different form than the word in context). - word_token_topics_and_locations: requires the desired words to be specified in the
words
list; return a dictionary mapping each word to a 2-tuple where the first element is the topic number and the second element is the character offset (from the beginning of the document text) of the word in the document
Requires that the pairwise
keyword is included in topic_attr
. Takes a list of the desired metrics. Currently the available metrics include Word Correlation
and Document Correlation
.
Specify an integer greater than zero.
Specify an integer greater than zero.
Specify an integer greater than zero. This integer will represent the document index to start at for sequential document retrieval.
Optionally specify an integer greater than zero but less than 1000. The default value is 1000.
Specify any integer to seed the random sampler. This will override sequential document retrieval.
Specify the number of top words to retrieve. Goes with the top_n_words attribute specifier.
Requires kwic
to be included in document_attr
. Takes a list of indices of tokens desired for the specified document.