A CLI application for creating embeddings for TIMDEX.
- To preview a list of available Makefile commands:
make help - To install with dev dependencies:
make install - To update dependencies:
make update - To run unit tests:
make test - To lint the repo:
make lint - To run the app:
my-app --help(Note the hyphen-vs underscore_that matches theproject.scriptsinpyproject.toml)
SENTRY_DSN=### If set to a valid Sentry DSN, enables Sentry exception monitoring. This is not needed for local development.
WORKSPACE=### Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.TE_MODEL_URI=# HuggingFace model URI
TE_MODEL_PATH=# Path where the model will be downloaded to and loaded from
HF_HUB_DISABLE_PROGRESS_BARS=#boolean to use progress bars for HuggingFace model downloads; defaults to 'true' in deployed contextsThis CLI application is designed to create embeddings for input texts. To do this, a pre-trained model must be identified and configured for use.
To this end, there is a base embedding class BaseEmbeddingModel that is designed to be extended and customized for a particular embedding model.
Once an embedding class has been created, the preferred approach is to set env vars TE_MODEL_URI and TE_MODEL_PATH directly in the Dockerfile to a) download a local snapshot of the model during image build, and b) set this model as the default for the CLI.
This allows invoking the CLI without specifying a model URI or local location, allowing this model to serve as the default, e.g.:
uv run --env-file .env embeddings test-model-loadFor local development, all CLI commands should be invoked with the following format to pickup environment variables from .env:
uv run --env-file .env embeddings <COMMAND> <ARGS>Usage: embeddings ping [OPTIONS]
Emit 'pong' to debug logs and stdout.
Usage: embeddings download-model [OPTIONS]
Download a model from HuggingFace and save locally.
Options:
--model-uri TEXT HuggingFace model URI (e.g., 'org/model-name')
[required]
--model-path PATH Path where the model will be downloaded to and loaded
from, e.g. '/path/to/model'. [required]
--help Show this message and exit.
Usage: embeddings test-model-load [OPTIONS]
Test loading of embedding class and local model based on env vars.
In a deployed context, the following env vars are expected: -
TE_MODEL_URI - TE_MODEL_PATH
With these set, the embedding class should be registered successfully and
initialized, and the model loaded from a local copy.
This CLI command is NOT used during normal workflows. This is used primary
during development and after model downloading/loading changes to ensure the
model loads correctly.
Options:
--model-uri TEXT HuggingFace model URI (e.g., 'org/model-name')
[required]
--model-path PATH Path where the model will be downloaded to and loaded
from, e.g. '/path/to/model'. [required]
--help Show this message and exit.
Usage: embeddings create-embeddings [OPTIONS]
Create embeddings for TIMDEX records.
Options:
--model-uri TEXT HuggingFace model URI (e.g., 'org/model-name')
[required]
--model-path PATH Path where the model will be downloaded to and
loaded from, e.g. '/path/to/model'. [required]
-d, --dataset-location PATH TIMDEX dataset location, e.g.
's3://timdex/dataset', to read records from.
[required]
--run-id TEXT TIMDEX ETL run id. [required]
--run-record-offset INTEGER TIMDEX ETL run record offset to start from,
default = 0. [required]
--record-limit INTEGER Limit number of records after --run-record-
offset, default = None (unlimited). [required]
--strategy [full_record] Pre-embedding record transformation strategy.
Repeatable to apply multiple strategies.
[required]
--output-jsonl TEXT Optionally write embeddings to local JSONLines
file (primarily for testing).
--help Show this message and exit.