Skip to content

Commit

Permalink
Merge pull request kermitt2#1189 from kermitt2/update-doc
Browse files Browse the repository at this point in the history
Fix internal links in the documentation
  • Loading branch information
lfoppiano authored Nov 20, 2024
2 parents 9fd4c77 + 4e7fba4 commit 354132f
Show file tree
Hide file tree
Showing 10 changed files with 34 additions and 31 deletions.
6 changes: 3 additions & 3 deletions doc/Deep-Learning-models.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Current neural models can be up to 50 times slower than CRF, depending on the ar

## Recommended Deep Learning models

By default, only CRF models are used by Grobid. You need to select the Deep Learning models you would like to use in the GROBID configuration yaml file (`grobid/grobid-home/config/grobid.yaml`). See [here](https://grobid.readthedocs.io/en/latest/Configuration/#configuring-the-models) for more details on how to select these models. The most convenient way to use the Deep Learning models is to use the full GROBID Docker image and pass a configuration file at launch of the container describing the selected models to be used instead of the default CRF ones. Note that the full GROBID Docker image is already configured to use Deep Learning models for bibliographical reference and affiliation-address parsing.
By default, only CRF models are used by Grobid. You need to select the Deep Learning models you would like to use in the GROBID configuration yaml file (`grobid/grobid-home/config/grobid.yaml`). See [here](Configuration.md#configuring-the-models) for more details on how to select these models. The most convenient way to use the Deep Learning models is to use the full GROBID Docker image and pass a configuration file at launch of the container describing the selected models to be used instead of the default CRF ones. Note that the full GROBID Docker image is already configured to use Deep Learning models for bibliographical reference and affiliation-address parsing.

For current GROBID version 0.8.1, we recommend considering the usage of the following Deep Learning models:

Expand Down Expand Up @@ -46,7 +46,7 @@ However, if you need a "local" library installation and build, prepare a lot of

#### Classic python and Virtualenv

<span>0.</span> Install GROBID as indicated [here](https://grobid.readthedocs.io/en/latest/Install-Grobid/).
<span>0.</span> Install GROBID as indicated [here](Install-Grobid.md).

The following was tested with Java version up to 17.

Expand Down Expand Up @@ -130,7 +130,7 @@ INFO [2020-10-30 23:04:07,756] org.grobid.core.jni.DeLFTModel: Loading DeLFT mo
INFO [2020-10-30 23:04:07,758] org.grobid.core.jni.JEPThreadPool: Creating JEP instance for thread 44
```

It is then possible to [benchmark end-to-end](https://grobid.readthedocs.io/en/latest/End-to-end-evaluation/) the selected Deep Learning models as any usual GROBID benchmarking exercise. In practice, the CRF models should be mixed with Deep Learning models to keep the process reasonably fast and memory-hungry. In addition, note that, currently, due to the limited amount of training data, Deep Learning models perform significantly better than CRF only for a few models (`citation`, `affiliation-address`, `reference-segmenter`). This should of course certainly change in the future!
It is then possible to [benchmark end-to-end](End-to-end-evaluation.md) the selected Deep Learning models as any usual GROBID benchmarking exercise. In practice, the CRF models should be mixed with Deep Learning models to keep the process reasonably fast and memory-hungry. In addition, note that, currently, due to the limited amount of training data, Deep Learning models perform significantly better than CRF only for a few models (`citation`, `affiliation-address`, `reference-segmenter`). This should of course certainly change in the future!

#### Anaconda

Expand Down
4 changes: 2 additions & 2 deletions doc/Grobid-docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ Access the service:
- open the browser at the address `http://localhost:8080`
- the health check will be accessible at the address `http://localhost:8081`

Grobid web services are then available as described in the [service documentation](https://grobid.readthedocs.io/en/latest/Grobid-service/).
Grobid web services are then available as described in the [service documentation](Grobid-service.md).

By default, this image runs Deep Learning models for:

Expand Down Expand Up @@ -113,7 +113,7 @@ Access the service:
- open the browser at the address `http://localhost:8080`
- the health check will be accessible at the address `http://localhost:8081`

Grobid web services are then available as described in the [service documentation](https://grobid.readthedocs.io/en/latest/Grobid-service/).
Grobid web services are then available as described in the [service documentation](Grobid-service.md).


## Configure using the yaml config file
Expand Down
30 changes: 15 additions & 15 deletions doc/Grobid-service.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ If required, modify the file under `grobid/grobid-home/config/grobid.yaml` for s

See the [configuration page](Configuration.md) for details on how to set the different parameters of the `grobid.yaml` configuration file. Service and logging parameters are also set in this configuration file.

If Docker is used, see [here](https://grobid.readthedocs.io/en/latest/Grobid-docker/#configure-using-the-yaml-config-file) on how to start a Grobid container with a modified configuration file.
If Docker is used, see [here](Grobid-docker.md#configure-using-the-yaml-config-file) on how to start a Grobid container with a modified configuration file.

### Model loading strategy
You can choose to load all the models at the start of the service or lazily when a model is used the first time, the latter being the default.
Expand Down Expand Up @@ -178,20 +178,20 @@ curl -v -H "Accept: application/x-bibtex" --form input=@./thefile.pdf localhost:

Convert the complete input document into TEI XML format (header, body and bibliographical section).

| method | request type | response type | parameters | requirement | description |
|--- |--- |--- |--------------------------|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed |
| | | | `consolidateHeader` | optional | `consolidateHeader` is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the citation and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted). |
| | | | `consolidateCitations` | optional | `consolidateCitations` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only). |
| | | | `consolidatFunders` | optional | `consolidateFunders` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the funder and inject DOI only). |
| | | | `includeRawCitations` | optional | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). |
| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). |
| | | | `teiCoordinates` | optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |
| | | | `segmentSentences` | optional | Paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
| | | | `generateIDs` | optional | if supplied as a string equal to `1`, it generates uniqe identifiers for each text component |
| | | | `start` | optional | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF) |
| | | | `end` | optional | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `-1`, end with the last page of the PDF) |
| method | request type | response type | parameters | requirement | description |
|--- |--- |--- |--------------------------|-----------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| POST, PUT | `multipart/form-data` | `application/xml` | `input` | required | PDF file to be processed |
| | | | `consolidateHeader` | optional | `consolidateHeader` is a string of value `0` (no consolidation), `1` (consolidate and inject all extra metadata, default value), `2` (consolidate the citation and inject DOI only), or `3` (consolidate using only extracted DOI - if extracted). |
| | | | `consolidateCitations` | optional | `consolidateCitations` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the citation and inject DOI only). |
| | | | `consolidatFunders` | optional | `consolidateFunders` is a string of value `0` (no consolidation, default value) or `1` (consolidate and inject all extra metadata), or `2` (consolidate the funder and inject DOI only). |
| | | | `includeRawCitations` | optional | `includeRawCitations` is a boolean value, `0` (default, do not include raw reference string in the result) or `1` (include raw reference string in the result). |
| | | | `includeRawAffiliations` | optional | `includeRawAffiliations` is a boolean value, `0` (default, do not include raw affiliation string in the result) or `1` (include raw affiliation string in the result). |
| | | | `includeRawCopyrights` | optional | `includeRawCopyrights` is a boolean value, `0` (default, do not include raw copyrights/license string in the result) or `1` (include raw copyrights/license string in the result). |
| | | | `teiCoordinates` | optional | list of element names for which coordinates in the PDF document have to be added, see [Coordinates of structures in the original PDF](Coordinates-in-PDF.md) for more details |
| | | | `segmentSentences` | optional | Paragraphs structures in the resulting TEI will be further segmented into sentence elements <s> |
| | | | `generateIDs` | optional | if supplied as a string equal to `1`, it generates uniqe identifiers for each text component |
| | | | `start` | optional | Start page number of the PDF to be considered, previous pages will be skipped/ignored, integer with first page starting at `1`, (default `-1`, start from the first page of the PDF) |
| | | | `end` | optional | End page number of the PDF to be considered, next pages will be skipped/ignored, integer with first page starting at `1` (default `-1`, end with the last page of the PDF) |

Response status codes:

Expand Down
Loading

0 comments on commit 354132f

Please sign in to comment.