timdex-embeddings

A CLI application for creating embeddings for TIMDEX.

Development

To preview a list of available Makefile commands: make help
To install with dev dependencies: make install
To update dependencies: make update
To run unit tests: make test
To lint the repo: make lint
To run the app: my-app --help (Note the hyphen - vs underscore _ that matches the project.scripts in pyproject.toml)

Environment Variables

Required

SENTRY_DSN=### If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.
WORKSPACE=### Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.

Optional

TE_MODEL_URI=# HuggingFace model URI
TE_MODEL_PATH=# Path where the model will be downloaded to and loaded from
HF_HUB_DISABLE_PROGRESS_BARS=#boolean to use progress bars for HuggingFace model downloads; defaults to 'true' in deployed contexts

Configuring an Embedding Model

This CLI application is designed to create embeddings for input texts. To do this, a pre-trained model must be identified and configured for use.

To this end, there is a base embedding class BaseEmbeddingModel that is designed to be extended and customized for a particular embedding model.

Once an embedding class has been created, the preferred approach is to set env vars TE_MODEL_URI and TE_MODEL_PATH directly in the Dockerfile to a) download a local snapshot of the model during image build, and b) set this model as the default for the CLI.

This allows invoking the CLI without specifying a model URI or local location, allowing this model to serve as the default, e.g.:

uv run --env-file .env embeddings test-model-load

CLI Commands

For local development, all CLI commands should be invoked with the following format to pickup environment variables from .env:

uv run --env-file .env embeddings <COMMAND> <ARGS>

`ping`

Usage: embeddings ping [OPTIONS]

  Emit 'pong' to debug logs and stdout.

`download-model`

Usage: embeddings download-model [OPTIONS]

  Download a model from HuggingFace and save locally.

Options:
  --model-uri TEXT   HuggingFace model URI (e.g., 'org/model-name')
                     [required]
  --model-path PATH  Path where the model will be downloaded to and loaded
                     from, e.g. '/path/to/model'.  [required]
  --help             Show this message and exit.

`test-model-load`

Usage: embeddings test-model-load [OPTIONS]

  Test loading of embedding class and local model based on env vars.

  In a deployed context, the following env vars are expected:     -
  TE_MODEL_URI     - TE_MODEL_PATH

  With these set, the embedding class should be registered successfully and
  initialized, and the model loaded from a local copy.

  This CLI command is NOT used during normal workflows.  This is used primary
  during development and after model downloading/loading changes to ensure the
  model loads correctly.

Options:
  --model-uri TEXT   HuggingFace model URI (e.g., 'org/model-name')
                     [required]
  --model-path PATH  Path where the model will be downloaded to and loaded
                     from, e.g. '/path/to/model'.  [required]
  --help             Show this message and exit.

`create-embeddings`

Usage: embeddings create-embeddings [OPTIONS]

  Create embeddings for TIMDEX records.

Options:
  --model-uri TEXT             HuggingFace model URI (e.g., 'org/model-name')
                               [required]
  --model-path PATH            Path where the model will be downloaded to and
                               loaded from, e.g. '/path/to/model'.  [required]
  -d, --dataset-location PATH  TIMDEX dataset location, e.g.
                               's3://timdex/dataset', to read records from.
                               [required]
  --run-id TEXT                TIMDEX ETL run id.  [required]
  --run-record-offset INTEGER  TIMDEX ETL run record offset to start from,
                               default = 0.  [required]
  --record-limit INTEGER       Limit number of records after --run-record-
                               offset, default = None (unlimited).  [required]
  --strategy [full_record]     Pre-embedding record transformation strategy.
                               Repeatable to apply multiple strategies.
                               [required]
  --output-jsonl TEXT          Optionally write embeddings to local JSONLines
                               file (primarily for testing).
  --help                       Show this message and exit.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github		.github
embeddings		embeddings
tests		tests
.aws-architecture		.aws-architecture
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

timdex-embeddings

Development

Environment Variables

Required

Optional

Configuring an Embedding Model

CLI Commands

`ping`

`download-model`

`test-model-load`

`create-embeddings`

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

MITLibraries/timdex-embeddings

Folders and files

Latest commit

History

Repository files navigation

timdex-embeddings

Development

Environment Variables

Required

Optional

Configuring an Embedding Model

CLI Commands

ping

download-model

test-model-load

create-embeddings

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

`ping`

`download-model`

`test-model-load`

`create-embeddings`

Packages