Skip to content

Commit 1247de3

Browse files
Chore/upgrade build (#10)
* Add openaleph-procrastinate. Bump versions to satisfy dependencies (poetry lock). * πŸ§‘β€πŸ’» Add pre-commit, use requirements.txt, upgrade to python3.13 * πŸ§‘β€πŸ’» Add dev requirements only for test build * πŸ”₯ (github) Drop daily cache job * βœ… (tests/test_pdf) Fix whitespace errors from test results * πŸ”¨ (make) Build before test * πŸ‘· Inline base build * 🚧 Tweak builds and tags * πŸ‘· (github) Skip intermediate arm46 build for tests * πŸ‘· (github) Skip cache-from [tmp] * Revert "πŸ‘· (github) Skip cache-from [tmp]" This reverts commit 03f86fd. * πŸ‘· (github/docker) Try this * 🚨 Apply black * πŸ‘· (github/docker) Don't use registry cache * πŸ§ͺ (test_image) Skip gif test * πŸ‘· (github/docker) maybe this * Downgrade TesserOCR to 2.6.2 * Add MacOS flags in Makefile and LD_PRELOAD path in docker-compose.yml * Add Dockerfile.test which contains dev dependencies * Add path to Dockerfile.base in .github/workflows/docker-base.yml * Revert skipping test_tesseract_ocr_regression for GIF images --------- Co-authored-by: Alex ȘtefΔƒnescu <[email protected]>
1 parent 62519d6 commit 1247de3

20 files changed

+4369
-1896
lines changed

β€Ž.bumpversion.cfgβ€Ž

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,14 @@ tag_name = {new_version}
44
commit = True
55
tag = True
66
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)([-](?P<release>(pre|rc))(?P<build>\d+))?
7-
serialize =
7+
serialize =
88
{major}.{minor}.{patch}-{release}{build}
99
{major}.{minor}.{patch}
1010

1111
[bumpversion:part:release]
1212
optional_value = prod
1313
first_value = prod
14-
values =
14+
values =
1515
rc
1616
prod
1717

β€Ž.github/workflows/build.ymlβ€Ž

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,8 @@ jobs:
2626
type=ref,event=branch
2727
type=semver,pattern={{version}}
2828
type=sha
29-
type=raw,value=latest
29+
type=raw,value=cache
30+
type=raw,value=latest,enable=${{ startsWith(github.ref, 'refs/tags') }}
3031
3132
- name: Set up Docker Buildx
3233
uses: docker/setup-buildx-action@v3
@@ -50,10 +51,10 @@ jobs:
5051
uses: docker/build-push-action@v6
5152
with:
5253
context: .
53-
platforms: linux/amd64
5454
load: true
55-
cache-from: type=registry,ref=ghcr.io/openaleph/ingest-file:cache
56-
cache-to: type=registry,ref=ghcr.io/openaleph/ingest-file:cache,mode=max
55+
platforms: linux/amd64
56+
cache-from: type=gha
57+
cache-to: type=gha,mode=max
5758

5859
- name: Start services
5960
run: |
@@ -73,12 +74,11 @@ jobs:
7374

7475
- name: Push docker images
7576
uses: docker/build-push-action@v6
76-
if: (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/tags')) && github.actor != 'dependabot[bot]'
7777
with:
7878
context: .
79-
platforms: linux/amd64, linux/arm64
79+
platforms: linux/amd64,linux/arm64
8080
push: true
8181
tags: ${{ steps.meta.outputs.tags }}
8282
labels: ${{ steps.meta.outputs.labels }}
83-
cache-from: type=registry,ref=ghcr.io/openaleph/ingest-file:cache
84-
cache-to: type=registry,ref=ghcr.io/openaleph/ingest-file:cache,mode=max
83+
cache-from: type=gha
84+
cache-to: type=gha,mode=max

β€Ž.github/workflows/daily.ymlβ€Ž

Lines changed: 0 additions & 23 deletions
This file was deleted.
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
name: Build ingest-file-base
2+
3+
on:
4+
workflow_dispatch: {}
5+
schedule:
6+
- cron: "0 0 * * *"
7+
push:
8+
paths:
9+
- Dockerfile.base
10+
- .github/workflows/docker-base.yml
11+
12+
permissions:
13+
packages: write
14+
15+
jobs:
16+
docker:
17+
runs-on: ubuntu-latest
18+
steps:
19+
- uses: actions/checkout@v3
20+
- name: Set up QEMU
21+
uses: docker/setup-qemu-action@v2
22+
- name: Docker meta
23+
id: meta
24+
uses: docker/metadata-action@v4
25+
with:
26+
images: ghcr.io/openaleph/ingest-file-base
27+
tags: |
28+
type=ref,event=branch
29+
type=semver,pattern={{version}}
30+
type=sha
31+
type=raw,value=latest
32+
- name: Set up Docker Buildx
33+
uses: docker/setup-buildx-action@v2
34+
with:
35+
install: true
36+
- name: Login to GitHub Container Registry
37+
uses: docker/login-action@v2
38+
with:
39+
registry: ghcr.io
40+
username: ${{ github.actor }}
41+
password: ${{ secrets.GITHUB_TOKEN }}
42+
- name: Build and push release
43+
uses: docker/build-push-action@v3
44+
with:
45+
context: .
46+
file: ./Dockerfile.base
47+
platforms: linux/amd64,linux/arm64
48+
push: true
49+
tags: ${{ steps.meta.outputs.tags }}
50+
labels: ${{ steps.meta.outputs.labels }}
51+
cache-from: type=gha
52+
cache-to: type=gha,mode=max

β€Ž.gitignoreβ€Ž

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
data/model_type_prediction.ftz
12
# Byte-compiled / optimized / DLL files
23
__pycache__/
34
*.py[cod]

β€Ž.pre-commit-config.yamlβ€Ž

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# This is the configuration file for pre-commit (https://pre-commit.com/).
2+
# To use:
3+
# * Install pre-commit (https://pre-commit.com/#installation)
4+
# * Copy this file as ".pre-commit-config.yaml"
5+
# * Run "pre-commit install".
6+
repos:
7+
- repo: https://github.com/pre-commit/pre-commit-hooks
8+
rev: v5.0.0
9+
hooks:
10+
- id: check-added-large-files
11+
- id: check-case-conflict
12+
- id: check-merge-conflict
13+
- id: check-symlinks
14+
- id: check-toml
15+
- id: check-yaml
16+
- id: debug-statements
17+
- id: end-of-file-fixer
18+
- id: mixed-line-ending
19+
args: ["--fix=lf"]
20+
- id: trailing-whitespace
21+
22+
# - repo: https://github.com/asottile/pyupgrade
23+
# rev: v3.10.1
24+
# hooks:
25+
# - id: pyupgrade
26+
# args: [ "--py310-plus" ]
27+
28+
- repo: https://github.com/MarcoGorelli/absolufy-imports
29+
rev: v0.3.1
30+
hooks:
31+
- id: absolufy-imports
32+
33+
- repo: https://github.com/pycqa/isort
34+
rev: 6.0.1
35+
hooks:
36+
- id: isort
37+
args: ["--profile", "black"]
38+
39+
- repo: https://github.com/psf/black
40+
rev: 25.1.0
41+
hooks:
42+
- id: black
43+
44+
- repo: https://github.com/csachs/pyproject-flake8
45+
rev: v7.0.0
46+
hooks:
47+
- id: pyproject-flake8
48+
additional_dependencies: [flake8-bugbear]
49+
args: ["--extend-ignore", "E203, E501, W503"]
50+
exclude: (test_[\w]+\.py|\.csv|\.json|\.lock)$
51+
52+
- repo: https://github.com/codespell-project/codespell
53+
rev: v2.4.1
54+
hooks:
55+
- id: codespell
56+
exclude: (tests/.*|\.lock)$
57+
58+
- repo: https://github.com/pre-commit/pygrep-hooks
59+
rev: v1.10.0
60+
hooks:
61+
- id: python-check-blanket-noqa
62+
exclude: (test_[\w]+\.py)$
63+
- id: python-check-blanket-type-ignore
64+
- id: python-no-eval
65+
- id: python-use-type-annotations
66+
- id: rst-backticks
67+
- id: rst-directive-colons
68+
- id: rst-inline-touching-normal
69+
70+
- repo: https://github.com/python-poetry/poetry
71+
rev: 2.1.3
72+
hooks:
73+
- id: poetry-check
74+
- id: poetry-lock
75+
76+
- repo: https://github.com/python-poetry/poetry-plugin-export
77+
rev: 1.9.0
78+
hooks:
79+
- id: poetry-export
80+
args: ["--without-hashes", "-o", "requirements.txt"]
81+
- id: poetry-export
82+
args:
83+
["--without-hashes", "--only", "dev", "-o", "requirements-dev.txt"]

β€ŽDockerfileβ€Ž

Lines changed: 5 additions & 155 deletions
Original file line numberDiff line numberDiff line change
@@ -1,165 +1,15 @@
1-
FROM python:3.11-slim
2-
3-
ENV DEBIAN_FRONTEND="noninteractive"
4-
5-
LABEL org.opencontainers.image.title="FollowTheMoney File Ingestors"
6-
LABEL org.opencontainers.image.licenses="MIT"
7-
LABEL org.opencontainers.image.source="https://github.com/alephdata/ingest-file"
8-
9-
# Enable non-free archive for `unrar`.
10-
RUN echo "deb http://http.us.debian.org/debian stable non-free" >/etc/apt/sources.list.d/nonfree.list \
11-
&& apt-get -qq -y update \
12-
&& apt-get -qq -y install build-essential locales \
13-
# python deps (mostly to install their dependencies)
14-
git python3-dev \
15-
pkg-config libicu-dev \
16-
# tesseract
17-
tesseract-ocr libtesseract-dev libleptonica-dev \
18-
# libraries
19-
libldap2-dev libsasl2-dev \
20-
# package tools
21-
unrar p7zip-full \
22-
# audio & video metadata
23-
libmediainfo-dev \
24-
# image processing, djvu
25-
mdbtools djvulibre-bin \
26-
libtiff5-dev \
27-
libtiff-tools ghostscript librsvg2-bin jbig2dec \
28-
pst-utils libgif-dev \
29-
# necessary for python-magic
30-
libmagic1 \
31-
### tesseract
32-
tesseract-ocr-eng \
33-
tesseract-ocr-swa \
34-
tesseract-ocr-swe \
35-
# tesseract-ocr-tam \
36-
# tesseract-ocr-tel \
37-
tesseract-ocr-fil \
38-
# tesseract-ocr-tha \
39-
tesseract-ocr-tur \
40-
tesseract-ocr-ukr \
41-
# tesseract-ocr-vie \
42-
tesseract-ocr-nld \
43-
tesseract-ocr-nor \
44-
tesseract-ocr-pol \
45-
tesseract-ocr-por \
46-
tesseract-ocr-ron \
47-
tesseract-ocr-rus \
48-
tesseract-ocr-slk \
49-
tesseract-ocr-slv \
50-
tesseract-ocr-spa \
51-
# tesseract-ocr-spa_old \
52-
tesseract-ocr-sqi \
53-
tesseract-ocr-srp \
54-
tesseract-ocr-ind \
55-
tesseract-ocr-isl \
56-
tesseract-ocr-ita \
57-
# tesseract-ocr-ita_old \
58-
# tesseract-ocr-jpn \
59-
tesseract-ocr-kan \
60-
tesseract-ocr-kat \
61-
# tesseract-ocr-kor \
62-
tesseract-ocr-khm \
63-
tesseract-ocr-lav \
64-
tesseract-ocr-lit \
65-
# tesseract-ocr-mal \
66-
tesseract-ocr-mkd \
67-
tesseract-ocr-mya \
68-
tesseract-ocr-mlt \
69-
tesseract-ocr-msa \
70-
tesseract-ocr-est \
71-
# tesseract-ocr-eus \
72-
tesseract-ocr-fin \
73-
tesseract-ocr-fra \
74-
tesseract-ocr-frk \
75-
# tesseract-ocr-frm \
76-
# tesseract-ocr-glg \
77-
# tesseract-ocr-grc \
78-
tesseract-ocr-heb \
79-
tesseract-ocr-hin \
80-
tesseract-ocr-hrv \
81-
tesseract-ocr-hye \
82-
tesseract-ocr-hun \
83-
# tesseract-ocr-ben \
84-
tesseract-ocr-bul \
85-
tesseract-ocr-cat \
86-
tesseract-ocr-ces \
87-
tesseract-ocr-nep \
88-
# tesseract-ocr-chi_sim \
89-
# tesseract-ocr-chi_tra \
90-
# tesseract-ocr-chr \
91-
tesseract-ocr-dan \
92-
tesseract-ocr-deu \
93-
tesseract-ocr-ell \
94-
# tesseract-ocr-enm \
95-
# tesseract-ocr-epo \
96-
# tesseract-ocr-equ \
97-
tesseract-ocr-afr \
98-
tesseract-ocr-ara \
99-
tesseract-ocr-aze \
100-
tesseract-ocr-bel \
101-
tesseract-ocr-uzb \
102-
### pdf convert: libreoffice + a bunch of fonts
103-
libreoffice fonts-opensymbol hyphen-fr hyphen-de \
104-
hyphen-en-us hyphen-it hyphen-ru fonts-dejavu fonts-dejavu-extra \
105-
fonts-droid-fallback fonts-dustin fonts-f500 fonts-fanwood fonts-freefont-ttf \
106-
fonts-liberation fonts-lmodern fonts-lyx fonts-sil-gentium fonts-texgyre \
107-
fonts-tlwg-purisa \
108-
###
109-
&& apt-get -qq -y autoremove \
110-
&& apt-get clean \
111-
&& rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* \
112-
&& localedef -i en_US -c -f UTF-8 -A /usr/share/locale/locale.alias en_US.UTF-8
113-
114-
# Set up the locale and make sure the system uses unicode for the file system.
115-
ENV LANG='en_US.UTF-8' \
116-
TZ='UTC' \
117-
OMP_THREAD_LIMIT='1' \
118-
OPENBLAS_NUM_THREADS='1'
119-
120-
RUN groupadd -g 1000 -r app \
121-
&& useradd -m -u 1000 -s /bin/false -g app app
122-
123-
# Download the ftm-typepredict model
124-
RUN mkdir /models/ && \
125-
curl -o "/models/model_type_prediction.ftz" "https://public.data.occrp.org/develop/models/types/type-08012020-7a69d1b.ftz"
126-
127-
COPY requirements.txt /tmp/
128-
RUN pip3 install --no-cache-dir -q -U pip setuptools
129-
RUN pip3 install --no-binary=:pyicu: pyicu
130-
RUN pip3 install --no-cache-dir --no-binary "tesserocr" -r /tmp/requirements.txt
131-
132-
# Install spaCy models
133-
RUN python3 -m spacy download en_core_web_sm \
134-
&& python3 -m spacy download de_core_news_sm \
135-
&& python3 -m spacy download fr_core_news_sm \
136-
&& python3 -m spacy download es_core_news_sm
137-
RUN python3 -m spacy download ru_core_news_sm \
138-
&& python3 -m spacy download pt_core_news_sm \
139-
&& python3 -m spacy download ro_core_news_sm \
140-
&& python3 -m spacy download mk_core_news_sm
141-
RUN python3 -m spacy download el_core_news_sm \
142-
&& python3 -m spacy download pl_core_news_sm \
143-
&& python3 -m spacy download it_core_news_sm \
144-
&& python3 -m spacy download lt_core_news_sm \
145-
&& python3 -m spacy download nl_core_news_sm \
146-
&& python3 -m spacy download nb_core_news_sm \
147-
&& python3 -m spacy download da_core_news_sm
148-
# RUN python3 -m spacy download zh_core_web_sm
1+
FROM ghcr.io/openaleph/ingest-file-base:latest
1492

1503
COPY . /ingestors
1514
WORKDIR /ingestors
152-
RUN pip3 install --no-cache-dir --config-settings editable_mode=compat --use-pep517 -e /ingestors
153-
RUN chown -R app:app /ingestors
154-
5+
RUN pip3 install --no-cache-dir -r /ingestors/requirements.txt
6+
RUN pip3 install --no-cache-dir /ingestors
1557

1568
ENV ARCHIVE_TYPE=file \
1579
ARCHIVE_PATH=/data \
15810
FTM_STORE_URI=postgresql://aleph:aleph@postgres/aleph \
15911
REDIS_URL=redis://redis:6379/0 \
160-
TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata
161-
162-
ENV LD_PRELOAD="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
12+
TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata
16313

164-
# USER app
14+
USER app
16515
CMD ingestors process

0 commit comments

Comments
Β (0)