Skip to content

Programmer Documentation: Web API v1

Ethan Garofolo edited this page Oct 13, 2015 · 2 revisions

Web API

All of the visualizations and web pages on the Topical Guide use AJAX calls to retrieve data from the server. Data is accessed through the Web API. This framework allows for flexible data retrieval.

URL

For publicly available data, a GET request to the following URL is used:

https://<domain>/api?<query parts>

For private data, a POST request to the following URL is used:

https://<domain>/user_api

Etiquette and Normalization

As a courtesy we ask that all GET requests to the API are normalized to take advantage of caching. Sort each query key and each list of values alphanumerically in ascending order. If you're building in our JavaScript framework for visualizations then this is done for you when the submitQueryByHash function is used.

Formats

Requests to the above URLs will yield a JSON string specifying the requested information. Currently, there are no other available formats at this time, but XML could be supported at a later date.

Example

Each part, or CGI Parameter, of the query specifies what data to collect. In some cases the key:value pairs will specify constraints on some of the desired attributes.

Lets examine the following request:

https://localhost:8000/api?datasets=*&dataset_attr=metadata,metrics&analyses=*&analysis_attr=metadata,metrics

The datasets=* query asks to query over all datasets, optionally the user could specify the URL friendly name of the dataset to restrict the query to a single dataset or a comma separated list of dataset names for multiple datasets. This is useful for gathering the names of available datasets. The dataset_attr key specifies what information for each dataset should be gathered; the values metadata and metrics will return the available metadata and metrics for each dataset. The same rules apply for analyses.

Example Output in the form of a JSON string:

{
    "datasets": {
        "state_of_the_union": {
            "metrics": {
                "Document Count": 223.0
            }, 
            "analyses": {
                "lda100topics": {
                    "metrics": {
                        "Token Count": 552564.0, 
                        "Stopword Count": 0.0, 
                        "Excluded Word Count": 0.0, 
                        "Entropy": 6.638011436412113, 
                        "Type Count": 117144.0
                    }, 
                    "metadata": {
                        "num_topics": 100, 
                        "num_iterations": 10, 
                        "readable_name": "LDA with 100 Topics", 
                        "optimize_interval": 10, 
                        "token_regex": "(([^\\W])+([-'\u2019,])?)+([^\\W])+", 
                        "description": "Mallet LDA with 100 topics."
                    }
                }
            }, 
            "metadata": {
                "source": "WikiSource", 
                "readable_name": "State of the Union", 
                "description": "This dataset consists of State of the Union messages delivered by the U.S. President to Congress as mandated by the Constitution for the years 1790-2010.", 
            }
        }
    }
}

Context

Context matters. Each analysis is in the context of the associated dataset, each topic and document are in the context of both a dataset and analysis. Specifying the document without a dataset or analysis will result in an empty result.

Other Options

For each attribute of a dataset, analysis, topic, document, or word the name of the attribute will be a key nested under the corresponding dataset, etc. The available query keys and allowed values are as follows:

datasets

Takes a comma separated list of dataset names or '*' for all datasets.

analyses

Takes a comma separated list of analysis names or '*' for all analyses.

topics

Takes a comma separated list of integers identifying the topics or '*' for all topics. Note that the integers are required to start at zero (0).

documents

Takes a comma separated list of document file names or '*' for all documents. There are limits to the number of documents that can be retrieved since many datasets may have large amounts of documents.

words

Takes a comma separated list of words or '*' for all words. This key can allow for the retrieval of word counts, but is used in the context of a topic or document (not across the entire corpus).

dataset_attr

Takes a comma separated list of attributes. The allowed attributes include metadata, metrics, document_count, and analysis_count.

  • metadata: return a dictionary of key:value pairs
  • metrics: return a dictionary of key:value pairs

analysis_attr

Takes a comma separated list of attributes. The allowed attributes include metadata, metrics, and topic_count.

  • See dataset_attr for metadata and metrics.
  • topic_count: return an integer greater than or equal to zero

topic_attr

Takes a comma separated list of attributes. The allowed attributes include metadata, metrics, names, pairwise, top_n_words, top_n_documents, word_tokens, word_token_documents_and_locations.

  • See dataset_attr for metadata and metrics.
  • names: return a dictionary of key:value pairs where the key is the naming scheme used and value is the name for the topic
  • pairwise: return a dictionary of key:value pairs where the key is the pairwise metric name and the value is an array of floats indicating the topic-topic metric value, the index position of the array is the number of the other topic
  • top_n_words: requires the query keys top_n_words and words to also be specified; return a dictionary of maximum size n, where each word is a key and the associated count of the word in the scope of the topic is the value
  • top_n_documents: requires the query key top_n_documents to also be specified; return a dictionary of maximum size n, where each document is mapped to a token count of the number of tokens the corresponding topic contains in that document
  • word_tokens: requires that the query key words is set; return a dictionary of words mapping to an array of 2-tuples; the first element in the tuple is the document, the second the index of the token in that document
  • word_token_documents_and_locations: requires that the query key documents is set; return a dictionary mapping the document names to an array of 2-tuples; the first element in the tuple is the start index as a character offset and the second is the ending index of the token

document_attr

Takes a comma separated list of attributes. The allowed attributes include metadata, metrics, text, top_n_topics, top_n_words, kwic, and word_token_topics_and_locations.

  • See dataset_attr for metadata and metrics.
  • text: return the document's raw text as given to the import system
  • top_n_topics: return a dictionary mapping topic numbers to their token counts in the current document
  • top_n_words: return a dictionary of words mapping each word to the token count
  • kwic: stands for "key word in context"; returns a mapping from the specified token_indices to a 4-tuple of the following format: (text_before, word, text_after, word_type). Where text_before is the text before the word, word is the word as it appears in the text, text_after is the text after the word, and word_type is the word token (this is different than the word since it could be in lowercase or a different form than the word in context).
  • word_token_topics_and_locations: requires the desired words to be specified in the words list; return a dictionary mapping each word to a 2-tuple where the first element is the topic number and the second element is the character offset (from the beginning of the document text) of the word in the document

topic_pairwise

Requires that the pairwise keyword is included in topic_attr. Takes a list of the desired metrics. Currently the available metrics include Word Correlation and Document Correlation.

top_n_documents

Specify an integer greater than zero.

top_n_words

Specify an integer greater than zero.

document_continue

Specify an integer greater than zero. This integer will represent the document index to start at for sequential document retrieval.

document_limit

Optionally specify an integer greater than zero but less than 1000. The default value is 1000.

document_seed

Specify any integer to seed the random sampler. This will override sequential document retrieval.

document_n_words

Specify the number of top words to retrieve. Goes with the top_n_words attribute specifier.

token_indices

Requires kwic to be included in document_attr. Takes a list of indices of tokens desired for the specified document.

Contents

User Documentation

Programmer Documentation

Clone this wiki locally