feat!: break out CLI for more specific commands/more reasonable defau…

…lts (#344) close #210 Miscellaneous quality of life improvements and feature additions to CLI: * Add `metakb` as a console command. * Break out CLI subcommands: * `metakb update` to run a complete harvest/transform/load step * `metakb check-normalizers` and `metakb update-normalizers` to check and force refreshing of normalizer data. This supports a simple workflow like `metakb check-normalizers || metakb load-normalizers` to load if unavailable, rather than requiring the user to force normalizer reload while loading the MetaKB graph. * `metakb harvest` to just perform harvest of source(s) * `metakb transform` to just perform transform of source(s), or `metakb transform-file` to transform a specific harvested file * `metakb load-cdm` to skip harvest/transform and directly load a CDM file, either from local (default location), a specific file, or from S3 * `metakb clear-graph` to wipe the graph. No other CLI command will wipe the graph. I thought about calling it when `update` is used without any source qualifiers, but it seemed a little odd to include additional behavior such that `metakb update <source> && metakb update <other source>` is different from `metakb update`. Also thought about including it as an option flag in some other commands, but at that point, you can just do `metakb clear-graph && <other command>`. * Previously, we had one command that could be used to do a lot of things. IMO it's cleaner and more intuitive (and easier to program/maintain) to have subcommands with specific purposes instead of one big one. It does mean you will often have to chain commands together but that's pretty normal. * Support selection of specific sources for commands where it makes sense. Pass them as arguments. Otherwise default to all sources. * Support output directory option (`--output_directory`, `-o`) where it makes sense. Unfortunately, most of these commands all produce `n` output files so I don't think there's a simple way to specify the name of the output file. * Collapse the credentials CLI option to a single `username:password` option, since you need to provide both at once. (Not sure why Neo4j *requires* a password). * When updating normalizers, keep going to the next one if one of them fails. * In general, give precedence to explicit args/options over env vars. Consequently, pass normalizer DB URL as an arg to normalizers rather than injecting it via env var to the transform step, which required a small refactor. * Refactor a few normalizer management functions out of the CLI module. In general, I think it's good to separate reusable functions out of CLI modules, and just have `cli.py` act as gateways/interfaces to them. There's probably a little bit more of this that we could do but nothing else stuck out to me. * Various changes to make console printouts a bit prettier/more organized.
cancervariants · May 21, 2024 · 5ce50c3 · 5ce50c3
1 parent 6f5093f
commit 5ce50c3
Show file tree

Hide file tree

Showing 8 changed files with 747 additions and 268 deletions.
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ The intent of the project is to leverage the collective knowledge of the dispara
 
 ### Prerequisites
 
-* A newer version of Python 3, preferably 3.8 or greater. To confirm on your system, run:
+* A newer version of Python 3, preferably 3.10 or greater. To confirm on your system, run:
 
 ```
 python3 --version
@@ -114,16 +114,15 @@ MetaKB relies on environment variables to set in order to work.
 
 ### Loading data
 
-Once Neo4j and DynamoDB instances are both running, and necessary normalizer data has been placed, run the MetaKB CLI with the `--initialize_normalizers` flag to acquire all other necessary normalizer source data, and execute harvest, transform, and load operations into the graph datastore.
+Once all service and data dependencies are available, clear the graph, load normalizer data, and initiate harvest, transform, and data loading operations:
 
-In the MetaKB project root, run the following:
-
-```sh
+```shell
 pipenv shell
-python3 -m metakb.cli --db_url=bolt://localhost:7687 --db_username=neo4j --db_password=<neo4j-password-here> --load_normalizers_db
+metakb metakb load-normalizers
+metakb update --refresh_source_caches
 ```
 
-For more information on the different CLI arguments, see the [CLI README](docs/cli/README.md).
+The `--help` flag can be provided to any CLI command to bring up additional documentation.
 
 ### Starting the server
 

diff --git a/docs/cli/README.md b/docs/cli/README.md
diff --git a/pyproject.toml b/pyproject.toml
@@ -54,6 +54,7 @@ Source = "https://github.com/cancervariants/metakb"
 "Bug Tracker" = "https://github.com/cancervariants/metakb/issues"
 
 [project.scripts]
+metakb = "metakb.cli:cli"
 
 [build-system]
 requires = ["setuptools>=64"]