Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat!: break out CLI for more specific commands/more reasonable defau…
…lts (#344) close #210 Miscellaneous quality of life improvements and feature additions to CLI: * Add `metakb` as a console command. * Break out CLI subcommands: * `metakb update` to run a complete harvest/transform/load step * `metakb check-normalizers` and `metakb update-normalizers` to check and force refreshing of normalizer data. This supports a simple workflow like `metakb check-normalizers || metakb load-normalizers` to load if unavailable, rather than requiring the user to force normalizer reload while loading the MetaKB graph. * `metakb harvest` to just perform harvest of source(s) * `metakb transform` to just perform transform of source(s), or `metakb transform-file` to transform a specific harvested file * `metakb load-cdm` to skip harvest/transform and directly load a CDM file, either from local (default location), a specific file, or from S3 * `metakb clear-graph` to wipe the graph. No other CLI command will wipe the graph. I thought about calling it when `update` is used without any source qualifiers, but it seemed a little odd to include additional behavior such that `metakb update <source> && metakb update <other source>` is different from `metakb update`. Also thought about including it as an option flag in some other commands, but at that point, you can just do `metakb clear-graph && <other command>`. * Previously, we had one command that could be used to do a lot of things. IMO it's cleaner and more intuitive (and easier to program/maintain) to have subcommands with specific purposes instead of one big one. It does mean you will often have to chain commands together but that's pretty normal. * Support selection of specific sources for commands where it makes sense. Pass them as arguments. Otherwise default to all sources. * Support output directory option (`--output_directory`, `-o`) where it makes sense. Unfortunately, most of these commands all produce `n` output files so I don't think there's a simple way to specify the name of the output file. * Collapse the credentials CLI option to a single `username:password` option, since you need to provide both at once. (Not sure why Neo4j *requires* a password). * When updating normalizers, keep going to the next one if one of them fails. * In general, give precedence to explicit args/options over env vars. Consequently, pass normalizer DB URL as an arg to normalizers rather than injecting it via env var to the transform step, which required a small refactor. * Refactor a few normalizer management functions out of the CLI module. In general, I think it's good to separate reusable functions out of CLI modules, and just have `cli.py` act as gateways/interfaces to them. There's probably a little bit more of this that we could do but nothing else stuck out to me. * Various changes to make console printouts a bit prettier/more organized.
- Loading branch information