Use amazon CloudSearch to power archive search.
python cloudsearch-process-and-upload.py
How to setup the contents of this directory
- Run
sh setup.sh
incloudsearch/
directory to clone archives text. - Make sure you have AWS Credentials Setup
- Create conda environment
conda create -n cloudsearch_env python=3.7.4
- Activate conda environment
conda activate cloudsearch_env
- Install packages
pip install -r requirements.txt
Never check in AWS credentials into git. Instead:
- Install AWS CLI
- Run
aws configure
in shell, and set yourAWS Access Key ID
,AWS Secret Access Key
, setDefault region name
to beus-east-1
and setDefault output format
to text- You can always run
aws configure
again to update these values - Note: your data will be saved to the
~/.aws
directory.
- You can always run
A test file to get myself acquainted with the CloudSearch SDK. This file shows how to use SDK to
- create a domain
- get domain status
- configure and show access policies for domain
- configure and show index fields in domain
- force domain to index documents
- define a suggester
Note: you can do many other things, like update scaling parameters with the cloudsearch SDK. Checkout the boto3 docs (linked in resources).
A test file to get myself acquainted with the CloudSearchDomain SDK
processes the .txt files in archives-text to cloudsearch readable JSON
script to find possible author titles
an OO program which does same thing as process-archives-text.py
except neater/better. Designed to be run in parallel (one object/process per year, or year range). Check out doc in file for a little more detail.
gives the schema of columns in cloudsearch
- make estimates on size / costs of everything -- figure out optimal scaling option (e.g. search.m1.small)
- Seems like amazon does this automatically for us?
- CloudSearch vs CloudSearch 2?
- Use CloudSearch2. CloudSearch is the older version (I think pre 2013/4?)
- CloudSearch vs CloudSearchDomain
- Domain client used to submit search requests and document requests.
- CloudSearch allows you to create and modify domains, define details of that domains like index fields (i.e. searchable fields), scheme define a custom result suggester. This is more of administrative stuff / meta data. CloudSearch is basically API for the cloudsearch dashboard sidebar, under 'configure domain'
- CloudSearchDomain is used to upload documents, search for documents, suggest documents. Used for interacting with the actual data.
- Expression, more - Expressions that can be evaluated dynamically at search time, for sorting search results, or returning coupled information about search results.
- Analysis Scheme - allows custom text field analysis, to customize search results on text. Probably don't need to use this--default analysis scheme seems legit enough.
- Paginators - a layer of abstraction for pagination. Pagination is the process of making subsequent requests from an initial request (e.g. initial request returns IDs of search results, then a subsequent request would be to get some data about the documents which the IDs correspond to). I think this is also an example of when to use pagination.
- Domain Statuses & Meanings - Note: you can still upload documents while domain status is PROCESSING, NEEDS INDEXING or ACTIVE.
- Index Field - index fields contain the data which can be searched/returned. Like the schema of the data.
- 5 min introductory video from Amazon
- 45 min introductory video from Amazon - talks about data types, queries, highlighting, autocomplete etc.
- Good starting point - go down each of the topics in the left sidebar. One notable topic in the sidebar is the API reference.
- CloudSearch Developer Resources - note the Python doc is out of date. Use boto 3.
- Python CloudSearch SDK/API Ref
- Python CloudSearchDomain SDK/API Ref
- Some more doc
- Outdated guide, but still useful conceptually