From 4e7fba4c3c1a40d884ae5840c78bd24852e82aa2 Mon Sep 17 00:00:00 2001
From: Luca Foppiano |
-| | | | `generateIDs` | optional | if supplied as a string equal to `1`, it generates uniqe identifiers for each text component |
-| | | | `start` | optional | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF) |
-| | | | `end` | optional | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `-1`, end with the last page of the PDF) |
+| method | request type | response type | parameters | requirement | description |
+|--- |--- |--- |--------------------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed |
+| | | | `consolidateHeader` | optional | `consolidateHeader` is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the citation and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted). |
+| | | | `consolidateCitations` | optional | `consolidateCitations` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only). |
+| | | | `consolidatFunders` | optional | `consolidateFunders` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the funder and inject DOI only). |
+| | | | `includeRawCitations` | optional | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). |
+| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
+| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). |
+| | | | `teiCoordinates` | optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |
+| | | | `segmentSentences` | optional | Paragraphs structures in the resulting TEI will be further segmented into sentence elements |
+| | | | `generateIDs` | optional | if supplied as a string equal to `1`, it generates uniqe identifiers for each text component |
+| | | | `start` | optional | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF) |
+| | | | `end` | optional | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `-1`, end with the last page of the PDF) |
Response status codes:
diff --git a/doc/Principles.md b/doc/Principles.md
index 0f42d353b6..d0626b78c3 100644
--- a/doc/Principles.md
+++ b/doc/Principles.md
@@ -12,7 +12,7 @@ In large scale scientific document ingestion tasks, the large majority of docume
To process publisher XML, complementary to GROBID, we built [Pub2TEI](https://github.com/kermitt2/Pub2TEI), a collection of style sheets developed over 11 years able to transform a variety of publisher XML formats to the same TEI XML format as produced by GROBID. This common format, which supersedes a dozen of publisher formats and many of their flavors, can centralize further any processing across PDF and heterogeneous XML sources without information loss, and support various applications (see __Fig. 1__). Similarly, LaTeX sources (typically all available arXiv sources) can be processed with our fork of [LaTeXML](https://github.com/kermitt2/LaTeXML) to produce a TEI representation compatible with GROBID and Pub2TEI output, without information loss from LaTeXML XML.
-The rest of this page gives an overview of the main GROBID design principles. Skip it if you are not interested in the technical details. Functionalities are described in the [User Manual](https://grobid.readthedocs.io/en/latest/). Recent benchmarking are available [here](https://grobid.readthedocs.io/en/latest/Benchmarking/).
+The rest of this page gives an overview of the main GROBID design principles. Skip it if you are not interested in the technical details. Functionalities are described in the [User Manual](index.md). Recent benchmarking are available [here](Benchmarking.md).
## Document parsing as a cascade of sequence labeling models
@@ -79,13 +79,13 @@ GROBID does not use a vast amount of training data derived from existing publish
- A lower amount of training data can keep models smaller (e.g. with CRF), faster to train and thus easier for setting hyperparameters.
-In practice, the size of GROBID training data is smaller than the ones of CERMINE _(Tkaczyk et al., 2015)_ by a factor 30 to 100, and smaller than ScienceParse 2 by a factor 2500 to 10000. Still GROBID provides comparable or better accuracy scores. To help to ensure high-quality training data, we develop detailed [annotation guidelines](training/General-principles/) to remove as much as possible disagreements/inconsistencies regarding the annotation decision. The training data is reviewed regularly. We do not use double-blind annotation with reconciliation and do not compute Inter Annotator Agreement (as we should), because the average size of the annotation team is under 2 :)
+In practice, the size of GROBID training data is smaller than the ones of CERMINE _(Tkaczyk et al., 2015)_ by a factor 30 to 100, and smaller than ScienceParse 2 by a factor 2500 to 10000. Still GROBID provides comparable or better accuracy scores. To help to ensure high-quality training data, we develop detailed [annotation guidelines](training/General-principles.md) to remove as much as possible disagreements/inconsistencies regarding the annotation decision. The training data is reviewed regularly. We do not use double-blind annotation with reconciliation and do not compute Inter Annotator Agreement (as we should), because the average size of the annotation team is under 2 :)
## Evaluation
As the training data is crafted for accuracy and coverage, training data is strongly biased by undersampling non-edge cases. Or to rephrase it maybe more clearly: the less "informative" training examples, which are the most common ones, are less represented in our training data. Because of this bias, our manually labeled data cannot be used for evaluation. Evaluations of GROBID models are thus done with separated and stable holdout sets from publishers, which follow more realistic distributions of document variations.
-See the current evaluations with [PubMed Central holdout set](https://grobid.readthedocs.io/en/latest/Benchmarking-pmc/) (1,943 documents, 90,125 bibliographical references in 139,835 citation contexts), [bioarXiv holdout set](https://grobid.readthedocs.io/en/latest/Benchmarking-biorxiv/) (2,000 documents, 98,753 bibliographical references in 142,796 citation contexts), [eLife holdout set](https://grobid.readthedocs.io/en/latest/Benchmarking-elife/) (984 documents, 63,664 bibliographical references in 109,022 reference contexts) and [PLOS holdout set](https://grobid.readthedocs.io/en/latest/Benchmarking-plos/) (1,000 documents, 48,449 bibliographical references in 69,755 reference contexts).
+See the current evaluations with [PubMed Central holdout set](Benchmarking-pmc.md) (1,943 documents, 90,125 bibliographical references in 139,835 citation contexts), [bioarXiv holdout set](Benchmarking-biorxiv.md) (2,000 documents, 98,753 bibliographical references in 142,796 citation contexts), [eLife holdout set](Benchmarking-elife.md) (984 documents, 63,664 bibliographical references in 109,022 reference contexts) and [PLOS holdout set](Benchmarking-plos.md) (1,000 documents, 48,449 bibliographical references in 69,755 reference contexts).
Our evaluation approach, however, raises two main issues:
diff --git a/doc/css/custom.css b/doc/css/custom.css
new file mode 100644
index 0000000000..3e7617d434
--- /dev/null
+++ b/doc/css/custom.css
@@ -0,0 +1,3 @@
+.wy-table-responsive table td, .wy-table-responsive table th {
+ white-space: inherit;
+}
\ No newline at end of file
diff --git a/doc/training/General-principles.md b/doc/training/General-principles.md
index dd40f3e3cc..837db61c68 100644
--- a/doc/training/General-principles.md
+++ b/doc/training/General-principles.md
@@ -8,7 +8,7 @@ This maybe of interest if the current state of the models does not correctly rec
The addition of training in Grobid is __not__ done from scratch, but from pre-annotated training data generated by the existing models in Grobid. This ensures that the syntax of the new training data will be (normally) correct and that the stream of text will be easy to align with the text extracted from the PDF. This permits also to take advantage of the existing models which will annotate correctly a certain amount of text, and to focus on the corrections, thus improving the productivity of the annotator.
-For generating pre-annotated training files for Grobid based on the existing models, see the instructions for running the software in batch [here](../../Training-the-models-of-Grobid/#generation-of-training-data) and [here](../../Grobid-batch/#createtraining).
+For generating pre-annotated training files for Grobid based on the existing models, see the instructions for running the software in batch [here](../Training-the-models-of-Grobid.md#generation-of-training-data) and [here](../Grobid-batch.md#createtraining).
After running the batch `createTraining` on a set of PDF files using methods for creating training data, each article comes with:
@@ -35,7 +35,7 @@ The exact list of generated files depends on the structures occurring in the art
| `*.training.references.authors.tei.xml` | citation | for all the authors appearing in the bibliographical references of the article |
-These files must be reviewed and corrected manually before being added to the training data, taking into account that exploiting any additional training data requires GROBID to re-create its models - by [retraining](../Training-the-models-of-Grobid) them.
+These files must be reviewed and corrected manually before being added to the training data, taking into account that exploiting any additional training data requires GROBID to re-create its models - by [retraining](../Training-the-models-of-Grobid.md) them.
## Correcting pre-annotated files
diff --git a/doc/training/fulltext.md b/doc/training/fulltext.md
index 870b1d7234..5cf8b3466d 100644
--- a/doc/training/fulltext.md
+++ b/doc/training/fulltext.md
@@ -67,7 +67,7 @@ Paragraphs constitute the main bulk of most typical articles or publications and