Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor output directory structure and add TEI-to-JSON conversion #28

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

branhoff
Copy link
Collaborator

@branhoff branhoff commented Mar 2, 2024

  • Renamed output directory from output_xmls to output_teis for clarity and consistency.
  • Updated grobid_service.py to save processed PDFs to the new output directory.
  • Added a new module tei_to_json.py for converting TEI XML documents to JSON format. This module includes functions for recursive XML parsing and saving the JSON output.
  • Expanded development dependencies in dev.txt to include lxml.

- Renamed output directory from `output_xmls` to `output_teis` for clarity and consistency.
- Updated `grobid_service.py` to save processed PDFs to the new output directory.
- Added a new module `tei_to_json.py` for converting TEI XML documents to JSON format. This module includes functions for recursive XML parsing and saving the JSON output.
- Expanded development dependencies in `dev.txt` to include `lxml`.
@branhoff branhoff linked an issue Mar 2, 2024 that may be closed by this pull request
Copy link
Contributor

@markgrube markgrube left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code looks fine, but found some libraries that might do the same thing. Might be worth looking into the overall S2ORC project to see if they may have thought of or solved some issues that we haven't run into yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why write your own Grobid class when there is a client library that already exists? https://github.com/kermitt2/grobid_client_python
You can install it from git by putting git+https://github.com/kermitt2/grobid_client_python in your requirements file and pip installing it or you can run pip install git+https://github.com/kermitt2/grobid_client_python

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try this library and it was a complete pain and seemed like it was in need of being updated, so i just went with my own... I'm kind of blanking on what problems I ran into so maybe worth looking into again. But my memory is this library was not great.

return child_dict


def tei_to_json(input_tei_path: str, output_json_path: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/allenai/s2orc-doc2json

I think this library does exactly what you are trying to do here. Can you reuse their work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TEI/XML to JSONL conversion for ehanced Q&A generation
2 participants