Skip to content

Commit a49590a

Browse files
authored
Merge pull request #17 from UMassCDS/repo-merging
Merging all data from MSF-OCR-Streamlit to this repo
2 parents 8a79d47 + e7bc953 commit a49590a

9 files changed

+1291
-0
lines changed

.dockerignore

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
**/__pycache__
2+
**/.venv
3+
**/.classpath
4+
**/.dockerignore
5+
**/.env
6+
**/.git
7+
**/.gitignore
8+
**/.project
9+
**/.settings
10+
**/.toolstarget
11+
**/.vs
12+
**/.vscode
13+
**/*.*proj.user
14+
**/*.dbmdl
15+
**/*.jfm
16+
**/bin
17+
**/charts
18+
**/docker-compose*
19+
**/compose*
20+
**/Dockerfile*
21+
**/node_modules
22+
**/npm-debug.log
23+
**/obj
24+
**/secrets.dev.yaml
25+
**/values.dev.yaml
26+
LICENSE
27+
README.md
+26
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Creates a new docker image and pushes it to DockerHub when a version tag is created.
2+
name: create-docker-images
3+
on:
4+
push:
5+
tags:
6+
- 'v*.*'
7+
8+
jobs:
9+
# Based on instructions from https://www.docker.com/blog/multi-arch-build-and-images-the-simple-way/
10+
build-docker:
11+
runs-on: ubuntu-latest
12+
steps:
13+
- name: Checkout code
14+
uses: actions/checkout@v3
15+
- name: Set up QEMU
16+
uses: docker/setup-qemu-action@v2
17+
- name: Set up Docker Buildx
18+
id: buildx
19+
uses: docker/setup-buildx-action@v2
20+
- name: Login to docker hub
21+
run: echo "${{ secrets.DOCKER_PASSWORD }}" | docker login -u "${{ secrets.DOCKER_USERNAME }}" --password-stdin
22+
- name: Build and push images
23+
run: |
24+
docker buildx build --push \
25+
--tag umasscds/msf-ocr-streamlit:${{ github.ref_name }} \
26+
--platform linux/amd64,linux/arm64 .

.gitignore

+167
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,173 @@ __pycache__/
66
# C extensions
77
*.so
88

9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
share/python-wheels/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
MANIFEST
28+
29+
# PyInstaller
30+
# Usually these files are written by a python script from a template
31+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
32+
*.manifest
33+
*.spec
34+
35+
# Installer logs
36+
pip-log.txt
37+
pip-delete-this-directory.txt
38+
39+
# Unit test / coverage reports
40+
htmlcov/
41+
.tox/
42+
.nox/
43+
.coverage
44+
.coverage.*
45+
.cache
46+
nosetests.xml
47+
coverage.xml
48+
*.cover
49+
*.py,cover
50+
.hypothesis/
51+
.pytest_cache/
52+
cover/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
.pybuilder/
76+
target/
77+
78+
# Jupyter Notebook
79+
.ipynb_checkpoints
80+
81+
# IPython
82+
profile_default/
83+
ipython_config.py
84+
85+
# pyenv
86+
# For a library or package, you might want to ignore these files since the code is
87+
# intended to run in multiple environments; otherwise, check them in:
88+
# .python-version
89+
90+
# pipenv
91+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
93+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
94+
# install all needed dependencies.
95+
#Pipfile.lock
96+
97+
# poetry
98+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
99+
# This is especially recommended for binary packages to ensure reproducibility, and is more
100+
# commonly ignored for libraries.
101+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102+
#poetry.lock
103+
104+
# pdm
105+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106+
#pdm.lock
107+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108+
# in version control.
109+
# https://pdm.fming.dev/latest/usage/project/#working-with-version-control
110+
.pdm.toml
111+
.pdm-python
112+
.pdm-build/
113+
114+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
115+
__pypackages__/
116+
117+
# Celery stuff
118+
celerybeat-schedule
119+
celerybeat.pid
120+
121+
# SageMath parsed files
122+
*.sage.py
123+
124+
# Environments
125+
.env
126+
.venv
127+
env/
128+
venv/
129+
ENV/
130+
env.bak/
131+
venv.bak/
132+
133+
# Spyder project settings
134+
.spyderproject
135+
.spyproject
136+
137+
# Rope project settings
138+
.ropeproject
139+
140+
# mkdocs documentation
141+
/site
142+
143+
# mypy
144+
.mypy_cache/
145+
.dmypy.json
146+
dmypy.json
147+
148+
# Pyre type checker
149+
.pyre/
150+
151+
# pytype static type analyzer
152+
.pytype/
153+
154+
# Cython debug symbols
155+
cython_debug/
156+
157+
# PyCharm
158+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
159+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
160+
# and can be added to the global gitignore or merged into this file. For a more nuclear
161+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
162+
#.idea/
163+
164+
.vscode/
165+
166+
.streamlit/secrets.toml
167+
168+
settings.ini# Byte-compiled / optimized / DLL files
169+
__pycache__/
170+
*.py[cod]
171+
*$py.class
172+
173+
# C extensions
174+
*.so
175+
9176
# Distribution / packaging
10177
.Python
11178
build/

CHANGELOG.md

+3
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,9 @@ You should also add project tags for each release in Github, see [Managing relea
88

99
## [Unreleased]
1010

11+
### Added
12+
- Merged the MSF-OCR-Streamlit repository into this repository
13+
1114
## [1.1.0] - 2024-07-26
1215
### Changed
1316
- Requests to OpenAI are multithreaded to speed up time to get results for multiple images

Dockerfile

+43
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# For more information, please refer to https://aka.ms/vscode-docker-python
2+
FROM python:3.10-slim
3+
4+
# Keeps Python from generating .pyc files in the container
5+
ENV PYTHONDONTWRITEBYTECODE=1
6+
7+
# Turns off buffering for easier container logging
8+
ENV PYTHONUNBUFFERED=1
9+
10+
# Setting work directory
11+
WORKDIR /app
12+
13+
# Getting git to clone and system dependencies for DocTR
14+
RUN apt-get update && apt-get install -y \
15+
ffmpeg libsm6 libxext6 libhdf5-dev pkg-config \
16+
build-essential \
17+
curl \
18+
software-properties-common \
19+
git \
20+
&& rm -rf /var/lib/apt/lists/*
21+
22+
# Copying app files into container
23+
COPY . .
24+
25+
# Install pip requirements
26+
RUN pip install --no-cache-dir .[app]
27+
28+
# Streamlit listen to this container port
29+
EXPOSE 8501
30+
31+
# How to test if a container is still working
32+
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
33+
34+
# Run as executable
35+
ENTRYPOINT ["streamlit", "run", "app_llm.py", "--server.port=8501", "--server.address=0.0.0.0"]
36+
37+
# # Creates a non-root user with an explicit UID and adds permission to access the /app folder
38+
# # For more info, please refer to https://aka.ms/vscode-docker-python-configure-containers
39+
# RUN adduser -u 5678 --disabled-password --gecos "" appuser && chown -R appuser /app
40+
# USER appuser
41+
42+
# # During debugging, this entry point will be overridden. For more information, please refer to https://aka.ms/vscode-docker-python-debug
43+
# CMD ["python", "app.py"]

README.md

+43
Original file line numberDiff line numberDiff line change
@@ -80,3 +80,46 @@ If you are using the OpenAI's GPT model as your OCR engine, you will also need t
8080
This repository has unit tests in the `tests` directory configured using [pytest](https://pytest.org/) and the Github action defined in `.github/workflows/python_package.yml` will run tests every time you make a pull request to the main branch of the repository.
8181

8282
You can run tests locally using `pytest` or `python -m pytest` from the command line from the root of the repository or configure them to be [run with a debugger in your IDE](https://code.visualstudio.com/docs/python/testing).
83+
84+
85+
# MSF-OCR-Streamlit
86+
87+
Uses a Streamlit web app in conjunction with an Optical Character Recognition (OCR) library to allow for uploading documents, scanning them, and correcting information.
88+
89+
This repository contains two version of the application:
90+
- `app_llm.py` uses [OpenAI's GPT 4o model](https://platform.openai.com/docs/guides/vision) as an OCR engine to 'read' the tables from images
91+
- `app_doctr.py` uses a the [docTR](https://pypi.org/project/python-doctr/) library as an OCR engine to parse text from the tables in images.
92+
93+
## Necessary environment variables
94+
In order to use the application, you will need to set the following environment variables for the Streamlit app to access the DHIS2 server:
95+
```
96+
DHIS2_USERNAME=<your username>
97+
DHIS2_PASSWORD=<your password>
98+
DHIS2_SERVER_URL=<server url>
99+
```
100+
101+
If you are using the `app_llm.py` version of the application, you will also need to set `OPENAI_API_KEY` with an API key obtained from [OpenAI's online portal](https://platform.openai.com/).
102+
103+
## Running Locally
104+
1) Set your environment variables. On a unix system the easiest way to do this is put them in a `.env` file, then run `set -a && source .env && set +a`. You can also set them in your System Properties or shell environment profile.
105+
106+
2) Install the python dependencies with `pip install .[app]`.
107+
108+
3) Run your desired Streamlit application with one of the following commands:
109+
- OpenAI version: `streamlit run app_llm.py`
110+
- DocTR version: `streamlit run app_doctr.py`
111+
112+
## Docker Instructions
113+
We have provided a Dockerfile in order to easily build and deploy the OpenAI application version as a Docker container.
114+
115+
1) Build an image named `msf-streamlit`: `docker build -t msf-streamlit .`.
116+
117+
2) Run the `msf-streamlit` image in a container, passing the necessary environment variables:
118+
```
119+
docker run -p 8501:8501 -e DHIS2_USERNAME=<your username> -e DHIS2_PASSWORD=<your password> -e DHIS2_SERVER_URL=<server url> -e OPENAI_API_KEY=<your key> msf-streamlit
120+
```
121+
122+
If you have a `.env` file, you can keep things simple with `docker run -p 8501:8501 --env-file .env msf-streamlit`.
123+
124+
Make sure port 8501 is available, as it is the default for Streamlit.
125+

0 commit comments

Comments
 (0)