Skip to content

Commit

Permalink
feat!: break out CLI for more specific commands/more reasonable defau…
Browse files Browse the repository at this point in the history
…lts (#344)

close #210 

Miscellaneous quality of life improvements and feature additions to CLI:

* Add `metakb` as a console command.
* Break out CLI subcommands:
    * `metakb update` to run a complete harvest/transform/load step
* `metakb check-normalizers` and `metakb update-normalizers` to check
and force refreshing of normalizer data. This supports a simple workflow
like `metakb check-normalizers || metakb load-normalizers` to load if
unavailable, rather than requiring the user to force normalizer reload
while loading the MetaKB graph.
    * `metakb harvest` to just perform harvest of source(s)
* `metakb transform` to just perform transform of source(s), or `metakb
transform-file` to transform a specific harvested file
* `metakb load-cdm` to skip harvest/transform and directly load a CDM
file, either from local (default location), a specific file, or from S3
* `metakb clear-graph` to wipe the graph. No other CLI command will wipe
the graph. I thought about calling it when `update` is used without any
source qualifiers, but it seemed a little odd to include additional
behavior such that `metakb update <source> && metakb update <other
source>` is different from `metakb update`. Also thought about including
it as an option flag in some other commands, but at that point, you can
just do `metakb clear-graph && <other command>`.
* Previously, we had one command that could be used to do a lot of
things. IMO it's cleaner and more intuitive (and easier to
program/maintain) to have subcommands with specific purposes instead of
one big one. It does mean you will often have to chain commands together
but that's pretty normal.
* Support selection of specific sources for commands where it makes
sense. Pass them as arguments. Otherwise default to all sources.
* Support output directory option (`--output_directory`, `-o`) where it
makes sense. Unfortunately, most of these commands all produce `n`
output files so I don't think there's a simple way to specify the name
of the output file.
* Collapse the credentials CLI option to a single `username:password`
option, since you need to provide both at once. (Not sure why Neo4j
*requires* a password).
* When updating normalizers, keep going to the next one if one of them
fails.
* In general, give precedence to explicit args/options over env vars.
Consequently, pass normalizer DB URL as an arg to normalizers rather
than injecting it via env var to the transform step, which required a
small refactor.
* Refactor a few normalizer management functions out of the CLI module.
In general, I think it's good to separate reusable functions out of CLI
modules, and just have `cli.py` act as gateways/interfaces to them.
There's probably a little bit more of this that we could do but nothing
else stuck out to me.
* Various changes to make console printouts a bit prettier/more
organized.
  • Loading branch information
jsstevenson authored May 21, 2024
1 parent 6f5093f commit 5ce50c3
Show file tree
Hide file tree
Showing 8 changed files with 747 additions and 268 deletions.
13 changes: 6 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The intent of the project is to leverage the collective knowledge of the dispara
### Prerequisites

* A newer version of Python 3, preferably 3.8 or greater. To confirm on your system, run:
* A newer version of Python 3, preferably 3.10 or greater. To confirm on your system, run:

```
python3 --version
Expand Down Expand Up @@ -114,16 +114,15 @@ MetaKB relies on environment variables to set in order to work.

### Loading data

Once Neo4j and DynamoDB instances are both running, and necessary normalizer data has been placed, run the MetaKB CLI with the `--initialize_normalizers` flag to acquire all other necessary normalizer source data, and execute harvest, transform, and load operations into the graph datastore.
Once all service and data dependencies are available, clear the graph, load normalizer data, and initiate harvest, transform, and data loading operations:

In the MetaKB project root, run the following:

```sh
```shell
pipenv shell
python3 -m metakb.cli --db_url=bolt://localhost:7687 --db_username=neo4j --db_password=<neo4j-password-here> --load_normalizers_db
metakb metakb load-normalizers
metakb update --refresh_source_caches
```

For more information on the different CLI arguments, see the [CLI README](docs/cli/README.md).
The `--help` flag can be provided to any CLI command to bring up additional documentation.

### Starting the server

Expand Down
30 changes: 0 additions & 30 deletions docs/cli/README.md

This file was deleted.

1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ Source = "https://github.com/cancervariants/metakb"
"Bug Tracker" = "https://github.com/cancervariants/metakb/issues"

[project.scripts]
metakb = "metakb.cli:cli"

[build-system]
requires = ["setuptools>=64"]
Expand Down
Loading

0 comments on commit 5ce50c3

Please sign in to comment.