From 4e7fba4c3c1a40d884ae5840c78bd24852e82aa2 Mon Sep 17 00:00:00 2001 From: Luca Foppiano Date: Wed, 23 Oct 2024 18:02:18 +0200 Subject: [PATCH] update documentation --- doc/Deep-Learning-models.md | 6 +++--- doc/Grobid-docker.md | 4 ++-- doc/Grobid-service.md | 30 +++++++++++++++--------------- doc/Principles.md | 6 +++--- doc/css/custom.css | 3 +++ doc/training/General-principles.md | 4 ++-- doc/training/fulltext.md | 2 +- doc/training/header.md | 2 +- doc/training/segmentation.md | 4 ++-- mkdocs.yml | 4 ++-- 10 files changed, 34 insertions(+), 31 deletions(-) create mode 100644 doc/css/custom.css diff --git a/doc/Deep-Learning-models.md b/doc/Deep-Learning-models.md index c3db143110..886597af46 100644 --- a/doc/Deep-Learning-models.md +++ b/doc/Deep-Learning-models.md @@ -18,7 +18,7 @@ Current neural models can be up to 50 times slower than CRF, depending on the ar ## Recommended Deep Learning models -By default, only CRF models are used by Grobid. You need to select the Deep Learning models you would like to use in the GROBID configuration yaml file (`grobid/grobid-home/config/grobid.yaml`). See [here](https://grobid.readthedocs.io/en/latest/Configuration/#configuring-the-models) for more details on how to select these models. The most convenient way to use the Deep Learning models is to use the full GROBID Docker image and pass a configuration file at launch of the container describing the selected models to be used instead of the default CRF ones. Note that the full GROBID Docker image is already configured to use Deep Learning models for bibliographical reference and affiliation-address parsing. +By default, only CRF models are used by Grobid. You need to select the Deep Learning models you would like to use in the GROBID configuration yaml file (`grobid/grobid-home/config/grobid.yaml`). See [here](Configuration.md#configuring-the-models) for more details on how to select these models. The most convenient way to use the Deep Learning models is to use the full GROBID Docker image and pass a configuration file at launch of the container describing the selected models to be used instead of the default CRF ones. Note that the full GROBID Docker image is already configured to use Deep Learning models for bibliographical reference and affiliation-address parsing. For current GROBID version 0.8.1, we recommend considering the usage of the following Deep Learning models: @@ -46,7 +46,7 @@ However, if you need a "local" library installation and build, prepare a lot of #### Classic python and Virtualenv -0. Install GROBID as indicated [here](https://grobid.readthedocs.io/en/latest/Install-Grobid/). +0. Install GROBID as indicated [here](Install-Grobid.md). The following was tested with Java version up to 17. @@ -130,7 +130,7 @@ INFO [2020-10-30 23:04:07,756] org.grobid.core.jni.DeLFTModel: Loading DeLFT mo INFO [2020-10-30 23:04:07,758] org.grobid.core.jni.JEPThreadPool: Creating JEP instance for thread 44 ``` -It is then possible to [benchmark end-to-end](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) the selected Deep Learning models as any usual GROBID benchmarking exercise. In practice, the CRF models should be mixed with Deep Learning models to keep the process reasonably fast and memory-hungry. In addition, note that, currently, due to the limited amount of training data, Deep Learning models perform significantly better than CRF only for a few models (`citation`, `affiliation-address`, `reference-segmenter`). This should of course certainly change in the future! +It is then possible to [benchmark end-to-end](End-to-end-evaluation.md) the selected Deep Learning models as any usual GROBID benchmarking exercise. In practice, the CRF models should be mixed with Deep Learning models to keep the process reasonably fast and memory-hungry. In addition, note that, currently, due to the limited amount of training data, Deep Learning models perform significantly better than CRF only for a few models (`citation`, `affiliation-address`, `reference-segmenter`). This should of course certainly change in the future! #### Anaconda diff --git a/doc/Grobid-docker.md b/doc/Grobid-docker.md index 7b447a0cec..8771974a67 100644 --- a/doc/Grobid-docker.md +++ b/doc/Grobid-docker.md @@ -57,7 +57,7 @@ Access the service: - open the browser at the address `http://localhost:8080` - the health check will be accessible at the address `http://localhost:8081` -Grobid web services are then available as described in the [service documentation](https://grobid.readthedocs.io/en/latest/Grobid-service/). +Grobid web services are then available as described in the [service documentation](Grobid-service.md). By default, this image runs Deep Learning models for: @@ -113,7 +113,7 @@ Access the service: - open the browser at the address `http://localhost:8080` - the health check will be accessible at the address `http://localhost:8081` -Grobid web services are then available as described in the [service documentation](https://grobid.readthedocs.io/en/latest/Grobid-service/). +Grobid web services are then available as described in the [service documentation](Grobid-service.md). ## Configure using the yaml config file diff --git a/doc/Grobid-service.md b/doc/Grobid-service.md index b28e843a8d..21ced218d9 100644 --- a/doc/Grobid-service.md +++ b/doc/Grobid-service.md @@ -59,7 +59,7 @@ If required, modify the file under `grobid/grobid-home/config/grobid.yaml` for s See the [configuration page](Configuration.md) for details on how to set the different parameters of the `grobid.yaml` configuration file. Service and logging parameters are also set in this configuration file. -If Docker is used, see [here](https://grobid.readthedocs.io/en/latest/Grobid-docker/#configure-using-the-yaml-config-file) on how to start a Grobid container with a modified configuration file. +If Docker is used, see [here](Grobid-docker.md#configure-using-the-yaml-config-file) on how to start a Grobid container with a modified configuration file. ### Model loading strategy You can choose to load all the models at the start of the service or lazily when a model is used the first time, the latter being the default. @@ -178,20 +178,20 @@ curl -v -H "Accept: application/x-bibtex" --form input=@./thefile.pdf localhost: Convert the complete input document into TEI XML format (header, body and bibliographical section). -| method | request type | response type | parameters | requirement | description | -|--- |--- |--- |--------------------------|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed | -| | | | `consolidateHeader` | optional | `consolidateHeader` is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the citation and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted). | -| | | | `consolidateCitations` | optional | `consolidateCitations` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only). | -| | | | `consolidatFunders` | optional | `consolidateFunders` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the funder and inject DOI only). | -| | | | `includeRawCitations` | optional | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). | -| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). | -| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). | -| | | | `teiCoordinates` | optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details | -| | | | `segmentSentences` | optional | Paragraphs structures in the resulting TEI will be further segmented into sentence elements | -| | | | `generateIDs` | optional | if supplied as a string equal to `1`, it generates uniqe identifiers for each text component | -| | | | `start` | optional | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF) | -| | | | `end` | optional | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `-1`, end with the last page of the PDF) | +| method | request type | response type | parameters | requirement | description | +|--- |--- |--- |--------------------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed | +| | | | `consolidateHeader` | optional | `consolidateHeader` is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the citation and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted). | +| | | | `consolidateCitations` | optional | `consolidateCitations` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only). | +| | | | `consolidatFunders` | optional | `consolidateFunders` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the funder and inject DOI only). | +| | | | `includeRawCitations` | optional | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). | +| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). | +| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). | +| | | | `teiCoordinates` | optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details | +| | | | `segmentSentences` | optional | Paragraphs structures in the resulting TEI will be further segmented into sentence elements | +| | | | `generateIDs` | optional | if supplied as a string equal to `1`, it generates uniqe identifiers for each text component | +| | | | `start` | optional | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF) | +| | | | `end` | optional | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `-1`, end with the last page of the PDF) | Response status codes: diff --git a/doc/Principles.md b/doc/Principles.md index 0f42d353b6..d0626b78c3 100644 --- a/doc/Principles.md +++ b/doc/Principles.md @@ -12,7 +12,7 @@ In large scale scientific document ingestion tasks, the large majority of docume To process publisher XML, complementary to GROBID, we built [Pub2TEI](https://github.com/kermitt2/Pub2TEI), a collection of style sheets developed over 11 years able to transform a variety of publisher XML formats to the same TEI XML format as produced by GROBID. This common format, which supersedes a dozen of publisher formats and many of their flavors, can centralize further any processing across PDF and heterogeneous XML sources without information loss, and support various applications (see __Fig. 1__). Similarly, LaTeX sources (typically all available arXiv sources) can be processed with our fork of [LaTeXML](https://github.com/kermitt2/LaTeXML) to produce a TEI representation compatible with GROBID and Pub2TEI output, without information loss from LaTeXML XML. -The rest of this page gives an overview of the main GROBID design principles. Skip it if you are not interested in the technical details. Functionalities are described in the [User Manual](https://grobid.readthedocs.io/en/latest/). Recent benchmarking are available [here](https://grobid.readthedocs.io/en/latest/Benchmarking/). +The rest of this page gives an overview of the main GROBID design principles. Skip it if you are not interested in the technical details. Functionalities are described in the [User Manual](index.md). Recent benchmarking are available [here](Benchmarking.md). ## Document parsing as a cascade of sequence labeling models @@ -79,13 +79,13 @@ GROBID does not use a vast amount of training data derived from existing publish - A lower amount of training data can keep models smaller (e.g. with CRF), faster to train and thus easier for setting hyperparameters. -In practice, the size of GROBID training data is smaller than the ones of CERMINE _(Tkaczyk et al., 2015)_ by a factor 30 to 100, and smaller than ScienceParse 2 by a factor 2500 to 10000. Still GROBID provides comparable or better accuracy scores. To help to ensure high-quality training data, we develop detailed [annotation guidelines](training/General-principles/) to remove as much as possible disagreements/inconsistencies regarding the annotation decision. The training data is reviewed regularly. We do not use double-blind annotation with reconciliation and do not compute Inter Annotator Agreement (as we should), because the average size of the annotation team is under 2 :) +In practice, the size of GROBID training data is smaller than the ones of CERMINE _(Tkaczyk et al., 2015)_ by a factor 30 to 100, and smaller than ScienceParse 2 by a factor 2500 to 10000. Still GROBID provides comparable or better accuracy scores. To help to ensure high-quality training data, we develop detailed [annotation guidelines](training/General-principles.md) to remove as much as possible disagreements/inconsistencies regarding the annotation decision. The training data is reviewed regularly. We do not use double-blind annotation with reconciliation and do not compute Inter Annotator Agreement (as we should), because the average size of the annotation team is under 2 :) ## Evaluation As the training data is crafted for accuracy and coverage, training data is strongly biased by undersampling non-edge cases. Or to rephrase it maybe more clearly: the less "informative" training examples, which are the most common ones, are less represented in our training data. Because of this bias, our manually labeled data cannot be used for evaluation. Evaluations of GROBID models are thus done with separated and stable holdout sets from publishers, which follow more realistic distributions of document variations. -See the current evaluations with [PubMed Central holdout set](https://grobid.readthedocs.io/en/latest/Benchmarking-pmc/) (1,943 documents, 90,125 bibliographical references in 139,835 citation contexts), [bioarXiv holdout set](https://grobid.readthedocs.io/en/latest/Benchmarking-biorxiv/) (2,000 documents, 98,753 bibliographical references in 142,796 citation contexts), [eLife holdout set](https://grobid.readthedocs.io/en/latest/Benchmarking-elife/) (984 documents, 63,664 bibliographical references in 109,022 reference contexts) and [PLOS holdout set](https://grobid.readthedocs.io/en/latest/Benchmarking-plos/) (1,000 documents, 48,449 bibliographical references in 69,755 reference contexts). +See the current evaluations with [PubMed Central holdout set](Benchmarking-pmc.md) (1,943 documents, 90,125 bibliographical references in 139,835 citation contexts), [bioarXiv holdout set](Benchmarking-biorxiv.md) (2,000 documents, 98,753 bibliographical references in 142,796 citation contexts), [eLife holdout set](Benchmarking-elife.md) (984 documents, 63,664 bibliographical references in 109,022 reference contexts) and [PLOS holdout set](Benchmarking-plos.md) (1,000 documents, 48,449 bibliographical references in 69,755 reference contexts). Our evaluation approach, however, raises two main issues: diff --git a/doc/css/custom.css b/doc/css/custom.css new file mode 100644 index 0000000000..3e7617d434 --- /dev/null +++ b/doc/css/custom.css @@ -0,0 +1,3 @@ +.wy-table-responsive table td, .wy-table-responsive table th { + white-space: inherit; +} \ No newline at end of file diff --git a/doc/training/General-principles.md b/doc/training/General-principles.md index dd40f3e3cc..837db61c68 100644 --- a/doc/training/General-principles.md +++ b/doc/training/General-principles.md @@ -8,7 +8,7 @@ This maybe of interest if the current state of the models does not correctly rec The addition of training in Grobid is __not__ done from scratch, but from pre-annotated training data generated by the existing models in Grobid. This ensures that the syntax of the new training data will be (normally) correct and that the stream of text will be easy to align with the text extracted from the PDF. This permits also to take advantage of the existing models which will annotate correctly a certain amount of text, and to focus on the corrections, thus improving the productivity of the annotator. -For generating pre-annotated training files for Grobid based on the existing models, see the instructions for running the software in batch [here](../../Training-the-models-of-Grobid/#generation-of-training-data) and [here](../../Grobid-batch/#createtraining). +For generating pre-annotated training files for Grobid based on the existing models, see the instructions for running the software in batch [here](../Training-the-models-of-Grobid.md#generation-of-training-data) and [here](../Grobid-batch.md#createtraining). After running the batch `createTraining` on a set of PDF files using methods for creating training data, each article comes with: @@ -35,7 +35,7 @@ The exact list of generated files depends on the structures occurring in the art | `*.training.references.authors.tei.xml` | citation | for all the authors appearing in the bibliographical references of the article | -These files must be reviewed and corrected manually before being added to the training data, taking into account that exploiting any additional training data requires GROBID to re-create its models - by [retraining](../Training-the-models-of-Grobid) them. +These files must be reviewed and corrected manually before being added to the training data, taking into account that exploiting any additional training data requires GROBID to re-create its models - by [retraining](../Training-the-models-of-Grobid.md) them. ## Correcting pre-annotated files diff --git a/doc/training/fulltext.md b/doc/training/fulltext.md index 870b1d7234..5cf8b3466d 100644 --- a/doc/training/fulltext.md +++ b/doc/training/fulltext.md @@ -67,7 +67,7 @@ Paragraphs constitute the main bulk of most typical articles or publications and

``` -> Note: The `` (line break) elements are there because they have been recognized as such in the PDF in the text flow. However the fact that they are located within or outside a tagged paragraph or section title has no impact. Just be sure NOT to modify the order of the text flow and `` as mentionned [here](General-principles/#correcting-pre-annotated-files). +> Note: The `` (line break) elements are there because they have been recognized as such in the PDF in the text flow. However, the fact that they are located within or outside a tagged paragraph or section title has no impact. Just be sure NOT to modify the order of the text flow and `` as mentionned [here](General-principles.md#correcting-pre-annotated-files). Following the TEI, formulas should be on the same hierarchical level as paragraphs, and not be contained inside paragraphs: diff --git a/doc/training/header.md b/doc/training/header.md index f87e5d6c6e..6754d7856a 100644 --- a/doc/training/header.md +++ b/doc/training/header.md @@ -2,7 +2,7 @@ ## Introduction -For the following guidelines, it is expected that training data has been generated as explained [here](../Training-the-models-of-Grobid/#generation-of-training-data). +For the following guidelines, it is expected that training data has been generated as explained [here](../Training-the-models-of-Grobid.md#generation-of-training-data). In Grobid, the document "header" corresponds to the bibliographical/metadata information sections about the document. This is typically all the information at the beginning of the article (often called the "front", title, authors, publication information, affiliations, abstrac, keywords, correspondence information, submission information, etc.), before the start of the document body (e.g. typically before the introduction section), but not only. Some of these elements can be located in the footnotes of the first page (e.g. affiliation of the authors), or at the end of the article (full list of authors, detailed affiliation and contact, how to cite, copyrights/licence and Open Access information). diff --git a/doc/training/segmentation.md b/doc/training/segmentation.md index 55053c81a7..6c174b9ff2 100644 --- a/doc/training/segmentation.md +++ b/doc/training/segmentation.md @@ -2,7 +2,7 @@ ## Introduction -For the following guidelines, it is expected that training data has been generated as explained [here](../Training-the-models-of-Grobid/#generation-of-training-data). +For the following guidelines, it is expected that training data has been generated as explained [here](../Training-the-models-of-Grobid.md#generation-of-training-data). The following TEI elements are used by the segmentation model: @@ -91,7 +91,7 @@ survival ``` -> Note: In general, whether the `` (line break) element is inside or outside the `` or other elements is of no importance. However as indicated [here](General-principles/#correcting-pre-annotated-files), the element should not be removed and should follow the stream of text. +> Note: In general, whether the `` (line break) element is inside or outside the `` or other elements is of no importance. However as indicated [here](General-principles.md#correcting-pre-annotated-files), the element should not be removed and should follow the stream of text. The following screenshot shows an example where an article starts mid-page, the end of the preceding one occupying the upper first third of the page. As this content does not belong to the article in question, don't add any elements and remove any `` or `` elements that could appear in the preceding article. diff --git a/mkdocs.yml b/mkdocs.yml index d6b9b08d51..609bec6e55 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -4,10 +4,10 @@ repo_name: GitHub theme: readthedocs site_description: Documentation for GROBID docs_dir: doc +extra_css: + - css/custom.css plugins: - search -theme: - name: readthedocs nav: - Home: 'index.md' - About: