-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor output directory structure and add TEI-to-JSON conversion #28
base: main
Are you sure you want to change the base?
Conversation
- Renamed output directory from `output_xmls` to `output_teis` for clarity and consistency. - Updated `grobid_service.py` to save processed PDFs to the new output directory. - Added a new module `tei_to_json.py` for converting TEI XML documents to JSON format. This module includes functions for recursive XML parsing and saving the JSON output. - Expanded development dependencies in `dev.txt` to include `lxml`.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code looks fine, but found some libraries that might do the same thing. Might be worth looking into the overall S2ORC project to see if they may have thought of or solved some issues that we haven't run into yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why write your own Grobid class when there is a client library that already exists? https://github.com/kermitt2/grobid_client_python
You can install it from git by putting git+https://github.com/kermitt2/grobid_client_python
in your requirements file and pip installing it or you can run pip install git+https://github.com/kermitt2/grobid_client_python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try this library and it was a complete pain and seemed like it was in need of being updated, so i just went with my own... I'm kind of blanking on what problems I ran into so maybe worth looking into again. But my memory is this library was not great.
return child_dict | ||
|
||
|
||
def tei_to_json(input_tei_path: str, output_json_path: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/allenai/s2orc-doc2json
I think this library does exactly what you are trying to do here. Can you reuse their work?
output_xmls
tooutput_teis
for clarity and consistency.grobid_service.py
to save processed PDFs to the new output directory.tei_to_json.py
for converting TEI XML documents to JSON format. This module includes functions for recursive XML parsing and saving the JSON output.dev.txt
to includelxml
.