Skip to content

Commit 7b2125b

Browse files
authored
BREAKING CHANGE: remove legacy detectron2 model; remove layoutparser extras (#350)
### Summary First step in resolving Unstructured-IO/unstructured#3051. Per [this comment](Unstructured-IO/unstructured#3051 (comment)), we were having troubling running `unstructured` in the Python 3.12 `wolfi-base` contain due to issues related to `pycocotools`, which is only used for the legacy `detectron2` model from `layoutparser`. Since we've replaced this with `detectron2onnx`, this PR removes the `layoutparser` extra dependencies that caused issues with Python 3.12. The `layoutparser` base dependency is still required because we use layout objects from that library. It's likely we could remove these in a future iteration. Temporarily disabled the ingest tests, because they seem to have been broken for the past six months. Last commit that they passed for was [this one](0f0c2be). Opened #352 to reenable them. ### Testing If CI passes we should be good to go.
1 parent 81549a7 commit 7b2125b

File tree

12 files changed

+121
-288
lines changed

12 files changed

+121
-288
lines changed

.github/workflows/ci.yml

+44-42
Original file line numberDiff line numberDiff line change
@@ -104,48 +104,50 @@ jobs:
104104
CI=true make test
105105
make check-coverage
106106
107-
test_ingest:
108-
strategy:
109-
matrix:
110-
python-version: ["3.9","3.10"]
111-
runs-on: ubuntu-latest
112-
env:
113-
NLTK_DATA: ${{ github.workspace }}/nltk_data
114-
needs: lint
115-
steps:
116-
- name: Checkout unstructured repo for integration testing
117-
uses: actions/checkout@v4
118-
with:
119-
repository: 'Unstructured-IO/unstructured'
120-
- name: Checkout this repo
121-
uses: actions/checkout@v4
122-
with:
123-
path: inference
124-
- name: Set up Python ${{ matrix.python-version }}
125-
uses: actions/setup-python@v4
126-
with:
127-
python-version: ${{ matrix.python-version }}
128-
- name: Test
129-
env:
130-
GH_READ_ONLY_ACCESS_TOKEN: ${{ secrets.GH_READ_ONLY_ACCESS_TOKEN }}
131-
SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
132-
DISCORD_TOKEN: ${{ secrets.DISCORD_TOKEN }}
133-
run: |
134-
python${{ matrix.python-version }} -m venv .venv
135-
source .venv/bin/activate
136-
[ ! -d "$NLTK_DATA" ] && mkdir "$NLTK_DATA"
137-
make install-ci
138-
pip install -e inference/
139-
sudo apt-get update
140-
sudo apt-get install -y libmagic-dev poppler-utils libreoffice pandoc
141-
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
142-
sudo apt-get install -y tesseract-ocr
143-
sudo apt-get install -y tesseract-ocr-kor
144-
sudo apt-get install -y diffstat
145-
tesseract --version
146-
make install-all-ingest
147-
# only run ingest tests that check expected output diffs.
148-
bash inference/scripts/test-unstructured-ingest-helper.sh
107+
# NOTE(robinson) - disabling ingest tests for now, as of 5/22/2024 they seem to have been
108+
# broken for the past six months
109+
# test_ingest:
110+
# strategy:
111+
# matrix:
112+
# python-version: ["3.9","3.10"]
113+
# runs-on: ubuntu-latest
114+
# env:
115+
# NLTK_DATA: ${{ github.workspace }}/nltk_data
116+
# needs: lint
117+
# steps:
118+
# - name: Checkout unstructured repo for integration testing
119+
# uses: actions/checkout@v4
120+
# with:
121+
# repository: 'Unstructured-IO/unstructured'
122+
# - name: Checkout this repo
123+
# uses: actions/checkout@v4
124+
# with:
125+
# path: inference
126+
# - name: Set up Python ${{ matrix.python-version }}
127+
# uses: actions/setup-python@v4
128+
# with:
129+
# python-version: ${{ matrix.python-version }}
130+
# - name: Test
131+
# env:
132+
# GH_READ_ONLY_ACCESS_TOKEN: ${{ secrets.GH_READ_ONLY_ACCESS_TOKEN }}
133+
# SLACK_TOKEN: ${{ secrets.SLACK_TOKEN }}
134+
# DISCORD_TOKEN: ${{ secrets.DISCORD_TOKEN }}
135+
# run: |
136+
# python${{ matrix.python-version }} -m venv .venv
137+
# source .venv/bin/activate
138+
# [ ! -d "$NLTK_DATA" ] && mkdir "$NLTK_DATA"
139+
# make install-ci
140+
# pip install -e inference/
141+
# sudo apt-get update
142+
# sudo apt-get install -y libmagic-dev poppler-utils libreoffice pandoc
143+
# sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
144+
# sudo apt-get install -y tesseract-ocr
145+
# sudo apt-get install -y tesseract-ocr-kor
146+
# sudo apt-get install -y diffstat
147+
# tesseract --version
148+
# make install-all-ingest
149+
# # only run ingest tests that check expected output diffs.
150+
# bash inference/scripts/test-unstructured-ingest-helper.sh
149151

150152
changelog:
151153
runs-on: ubuntu-latest

CHANGELOG.md

+10-5
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,22 @@
1+
## 0.7.33
2+
3+
* BREAKING CHANGE: removes legacy detectron2 model
4+
* deps: remove layoutparser optional dependencies
5+
16
## 0.7.32
27

3-
* refactor: remove all code related to filling inferred elements text from embedded text (pdfminer).
8+
* refactor: remove all code related to filling inferred elements text from embedded text (pdfminer).
49
* bug: set the Chipper max_length variable
510

611
## 0.7.31
712

8-
* refactor: remove all `cid` related code that was originally added to filter out invalid `pdfminer` text
13+
* refactor: remove all `cid` related code that was originally added to filter out invalid `pdfminer` text
914
* enhancement: Wrapped hf_hub_download with a function that checks for local file before checking HF
1015

1116
## 0.7.30
1217

13-
* fix: table transformer doesn't return multiple cells with same coordinates
14-
*
18+
* fix: table transformer doesn't return multiple cells with same coordinates
19+
*
1520
## 0.7.29
1621

1722
* fix: table transformer predictions are now removed if confidence is below threshold
@@ -458,4 +463,4 @@ we have the mapping from standard language code to paddle language code.
458463

459464
## 0.2.0
460465

461-
* Initial release of unstructured-inference
466+
* Initial release of unstructured-inference

requirements/base.in

+4-1
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,13 @@
11
-c constraints.in
2-
layoutparser[layoutmodels,tesseract]
2+
layoutparser
33
python-multipart
44
huggingface-hub
55
opencv-python!=4.7.0.68
66
onnx
77
onnxruntime>=1.17.0
8+
matplotlib
9+
torch
10+
timm
811
# NOTE(alan): Pinned because this is when the most recent module we import appeared
912
transformers>=4.25.1
1013
rapidfuzz

requirements/base.txt

+22-40
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,6 @@
44
#
55
# pip-compile requirements/base.in
66
#
7-
antlr4-python3-runtime==4.9.3
8-
# via omegaconf
97
certifi==2024.2.2
108
# via requests
119
cffi==1.16.0
@@ -18,13 +16,11 @@ coloredlogs==15.0.1
1816
# via onnxruntime
1917
contourpy==1.2.1
2018
# via matplotlib
21-
cryptography==42.0.5
19+
cryptography==42.0.7
2220
# via pdfminer-six
2321
cycler==0.12.1
2422
# via matplotlib
25-
effdet==0.4.1
26-
# via layoutparser
27-
filelock==3.13.4
23+
filelock==3.14.0
2824
# via
2925
# huggingface-hub
3026
# torch
@@ -33,11 +29,11 @@ flatbuffers==24.3.25
3329
# via onnxruntime
3430
fonttools==4.51.0
3531
# via matplotlib
36-
fsspec==2024.3.1
32+
fsspec==2024.5.0
3733
# via
3834
# huggingface-hub
3935
# torch
40-
huggingface-hub==0.22.2
36+
huggingface-hub==0.23.1
4137
# via
4238
# -r requirements/base.in
4339
# timm
@@ -51,16 +47,16 @@ importlib-resources==6.4.0
5147
# via matplotlib
5248
iopath==0.1.10
5349
# via layoutparser
54-
jinja2==3.1.3
50+
jinja2==3.1.4
5551
# via torch
5652
kiwisolver==1.4.5
5753
# via matplotlib
58-
layoutparser[layoutmodels,tesseract]==0.3.4
54+
layoutparser==0.3.4
5955
# via -r requirements/base.in
6056
markupsafe==2.1.5
6157
# via jinja2
62-
matplotlib==3.8.4
63-
# via pycocotools
58+
matplotlib==3.9.0
59+
# via -r requirements/base.in
6460
mpmath==1.3.0
6561
# via sympy
6662
networkx==3.2.1
@@ -74,15 +70,12 @@ numpy==1.26.4
7470
# onnxruntime
7571
# opencv-python
7672
# pandas
77-
# pycocotools
7873
# scipy
7974
# torchvision
8075
# transformers
81-
omegaconf==2.3.0
82-
# via effdet
8376
onnx==1.16.0
8477
# via -r requirements/base.in
85-
onnxruntime==1.17.3
78+
onnxruntime==1.18.0
8679
# via -r requirements/base.in
8780
opencv-python==4.9.0.80
8881
# via
@@ -93,7 +86,6 @@ packaging==24.0
9386
# huggingface-hub
9487
# matplotlib
9588
# onnxruntime
96-
# pytesseract
9789
# transformers
9890
pandas==2.2.2
9991
# via layoutparser
@@ -109,24 +101,19 @@ pillow==10.3.0
109101
# matplotlib
110102
# pdf2image
111103
# pdfplumber
112-
# pytesseract
113104
# torchvision
114105
portalocker==2.8.2
115106
# via iopath
116107
protobuf==5.26.1
117108
# via
118109
# onnx
119110
# onnxruntime
120-
pycocotools==2.0.7
121-
# via effdet
122111
pycparser==2.22
123112
# via cffi
124113
pyparsing==3.1.2
125114
# via matplotlib
126-
pypdfium2==4.29.0
115+
pypdfium2==4.30.0
127116
# via pdfplumber
128-
pytesseract==0.3.10
129-
# via layoutparser
130117
python-dateutil==2.9.0.post0
131118
# via
132119
# matplotlib
@@ -139,14 +126,13 @@ pyyaml==6.0.1
139126
# via
140127
# huggingface-hub
141128
# layoutparser
142-
# omegaconf
143129
# timm
144130
# transformers
145-
rapidfuzz==3.8.1
131+
rapidfuzz==3.9.1
146132
# via -r requirements/base.in
147-
regex==2024.4.16
133+
regex==2024.5.15
148134
# via transformers
149-
requests==2.31.0
135+
requests==2.32.2
150136
# via
151137
# huggingface-hub
152138
# transformers
@@ -162,27 +148,23 @@ sympy==1.12
162148
# via
163149
# onnxruntime
164150
# torch
165-
timm==0.9.16
166-
# via effdet
151+
timm==1.0.3
152+
# via -r requirements/base.in
167153
tokenizers==0.19.1
168154
# via transformers
169-
torch==2.2.2
155+
torch==2.3.0
170156
# via
171-
# effdet
172-
# layoutparser
157+
# -r requirements/base.in
173158
# timm
174159
# torchvision
175-
torchvision==0.17.2
176-
# via
177-
# effdet
178-
# layoutparser
179-
# timm
180-
tqdm==4.66.2
160+
torchvision==0.18.0
161+
# via timm
162+
tqdm==4.66.4
181163
# via
182164
# huggingface-hub
183165
# iopath
184166
# transformers
185-
transformers==4.40.0
167+
transformers==4.41.0
186168
# via -r requirements/base.in
187169
typing-extensions==4.11.0
188170
# via
@@ -193,5 +175,5 @@ tzdata==2024.1
193175
# via pandas
194176
urllib3==2.2.1
195177
# via requests
196-
zipp==3.18.1
178+
zipp==3.18.2
197179
# via importlib-resources

0 commit comments

Comments
 (0)