CloudSearch

Description

Use amazon CloudSearch to power archive search.

Run

python cloudsearch-process-and-upload.py

Startup

How to setup the contents of this directory

Run sh setup.sh in cloudsearch/ directory to clone archives text.
Make sure you have AWS Credentials Setup
Create conda environment conda create -n cloudsearch_env python=3.7.4
Activate conda environment conda activate cloudsearch_env
Install packages pip install -r requirements.txt

AWS Credentials Setup

Never check in AWS credentials into git. Instead:

Install AWS CLI
Run aws configure in shell, and set your AWS Access Key ID, AWS Secret Access Key, set Default region name to be us-east-1 and set Default output format to text
1. You can always run aws configure again to update these values
2. Note: your data will be saved to the ~/.aws directory.

Files

`cloudsearch-test.py`

A test file to get myself acquainted with the CloudSearch SDK. This file shows how to use SDK to

create a domain
get domain status
configure and show access policies for domain
configure and show index fields in domain
force domain to index documents
define a suggester

Note: you can do many other things, like update scaling parameters with the cloudsearch SDK. Checkout the boto3 docs (linked in resources).

`cloudsearchdomain-test.py`

A test file to get myself acquainted with the CloudSearchDomain SDK

`process-archives-text.py`

processes the .txt files in archives-text to cloudsearch readable JSON

`find-author-titles-archives-text.py`

script to find possible author titles

`cloudsearch-process-and-upload.py`

an OO program which does same thing as process-archives-text.py except neater/better. Designed to be run in parallel (one object/process per year, or year range). Check out doc in file for a little more detail.

`docs/search.md`

gives the schema of columns in cloudsearch

Todo

make estimates on size / costs of everything -- figure out optimal scaling option (e.g. search.m1.small)
- Seems like amazon does this automatically for us?

Questions & answers when found & also terms/things which I found confusing

CloudSearch vs CloudSearch 2?
- Use CloudSearch2. CloudSearch is the older version (I think pre 2013/4?)
CloudSearch vs CloudSearchDomain
- Domain client used to submit search requests and document requests.
- CloudSearch allows you to create and modify domains, define details of that domains like index fields (i.e. searchable fields), scheme define a custom result suggester. This is more of administrative stuff / meta data. CloudSearch is basically API for the cloudsearch dashboard sidebar, under 'configure domain'
- CloudSearchDomain is used to upload documents, search for documents, suggest documents. Used for interacting with the actual data.
Expression, more - Expressions that can be evaluated dynamically at search time, for sorting search results, or returning coupled information about search results.
Analysis Scheme - allows custom text field analysis, to customize search results on text. Probably don't need to use this--default analysis scheme seems legit enough.
Paginators - a layer of abstraction for pagination. Pagination is the process of making subsequent requests from an initial request (e.g. initial request returns IDs of search results, then a subsequent request would be to get some data about the documents which the IDs correspond to). I think this is also an example of when to use pagination.
Domain Statuses & Meanings - Note: you can still upload documents while domain status is PROCESSING, NEEDS INDEXING or ACTIVE.
Index Field - index fields contain the data which can be searched/returned. Like the schema of the data.

Resources

5 min introductory video from Amazon
45 min introductory video from Amazon - talks about data types, queries, highlighting, autocomplete etc.
Good starting point - go down each of the topics in the left sidebar. One notable topic in the sidebar is the API reference.
CloudSearch Developer Resources - note the Python doc is out of date. Use boto 3.
Python CloudSearch SDK/API Ref
Python CloudSearchDomain SDK/API Ref
Some more doc
Outdated guide, but still useful conceptually

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CloudSearch

Description

Run

Startup

AWS Credentials Setup

Files

`cloudsearch-test.py`

`cloudsearchdomain-test.py`

`process-archives-text.py`

`find-author-titles-archives-text.py`

`cloudsearch-process-and-upload.py`

`docs/search.md`

Todo

Questions & answers when found & also terms/things which I found confusing

Resources

Files

README.md

Latest commit

History

README.md

File metadata and controls

CloudSearch

Description

Run

Startup

AWS Credentials Setup

Files

cloudsearch-test.py

cloudsearchdomain-test.py

process-archives-text.py

find-author-titles-archives-text.py

cloudsearch-process-and-upload.py

docs/search.md

Todo

Questions & answers when found & also terms/things which I found confusing

Resources

`cloudsearch-test.py`

`cloudsearchdomain-test.py`

`process-archives-text.py`

`find-author-titles-archives-text.py`

`cloudsearch-process-and-upload.py`

`docs/search.md`