Merge pull request #17 from UMassCDS/repo-merging

rdziewietin · web-flow · commit a49590aba79b · 2024-07-30T13:12:36.000-04:00
Merging all data from MSF-OCR-Streamlit to this repo
diff --git a/.dockerignore b/.dockerignore
@@ -0,0 +1,27 @@
+**/__pycache__
+**/.venv
+**/.classpath
+**/.dockerignore
+**/.env
+**/.git
+**/.gitignore
+**/.project
+**/.settings
+**/.toolstarget
+**/.vs
+**/.vscode
+**/*.*proj.user
+**/*.dbmdl
+**/*.jfm
+**/bin
+**/charts
+**/docker-compose*
+**/compose*
+**/Dockerfile*
+**/node_modules
+**/npm-debug.log
+**/obj
+**/secrets.dev.yaml
+**/values.dev.yaml
+LICENSE
+README.md
diff --git a/.github/workflows/create-docker-images.yml b/.github/workflows/create-docker-images.yml
@@ -0,0 +1,26 @@
+# Creates a new docker image and pushes it to DockerHub when a version tag is created.
+name: create-docker-images
+on:
+  push:
+    tags:
+      - 'v*.*'
+
+jobs:
+  # Based on instructions from https://www.docker.com/blog/multi-arch-build-and-images-the-simple-way/
+  build-docker:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v3
+      - name: Set up QEMU
+        uses: docker/setup-qemu-action@v2
+      - name: Set up Docker Buildx
+        id: buildx
+        uses: docker/setup-buildx-action@v2
+      - name: Login to docker hub
+        run: echo "${{ secrets.DOCKER_PASSWORD }}" | docker login -u "${{ secrets.DOCKER_USERNAME }}" --password-stdin
+      - name: Build and push images
+        run: |
+          docker buildx build --push \
+            --tag umasscds/msf-ocr-streamlit:${{ github.ref_name }} \
+            --platform linux/amd64,linux/arm64 .
diff --git a/.gitignore b/.gitignore
@@ -6,6 +6,173 @@ __pycache__/
 # C extensions
 *.so
 
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+.pybuilder/
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# IPython
+profile_default/
+ipython_config.py
+
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/latest/usage/project/#working-with-version-control
+.pdm.toml
+.pdm-python
+.pdm-build/
+
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+
+# Pyre type checker
+.pyre/
+
+# pytype static type analyzer
+.pytype/
+
+# Cython debug symbols
+cython_debug/
+
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+
+.vscode/
+
+.streamlit/secrets.toml
+
+settings.ini# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
 # Distribution / packaging
 .Python
 build/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,9 @@ You should also add project tags for each release in Github, see [Managing relea
 
 ## [Unreleased]
 
+### Added
+- Merged the MSF-OCR-Streamlit repository into this repository
+
 ## [1.1.0] - 2024-07-26
 ### Changed
 - Requests to OpenAI are multithreaded to speed up time to get results for multiple images
diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,43 @@
+# For more information, please refer to https://aka.ms/vscode-docker-python
+FROM python:3.10-slim
+
+# Keeps Python from generating .pyc files in the container
+ENV PYTHONDONTWRITEBYTECODE=1
+
+# Turns off buffering for easier container logging
+ENV PYTHONUNBUFFERED=1
+
+# Setting work directory
+WORKDIR /app
+
+# Getting git to clone and system dependencies for DocTR
+RUN apt-get update && apt-get install -y \
+    ffmpeg libsm6 libxext6 libhdf5-dev pkg-config \
+    build-essential \
+    curl \
+    software-properties-common \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+
+# Copying app files into container
+COPY . .
+
+# Install pip requirements
+RUN pip install --no-cache-dir .[app]
+
+# Streamlit listen to this container port
+EXPOSE 8501
+
+# How to test if a container is still working
+HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
+
+# Run as executable
+ENTRYPOINT ["streamlit", "run", "app_llm.py", "--server.port=8501", "--server.address=0.0.0.0"]
+
+# # Creates a non-root user with an explicit UID and adds permission to access the /app folder
+# # For more info, please refer to https://aka.ms/vscode-docker-python-configure-containers
+# RUN adduser -u 5678 --disabled-password --gecos "" appuser && chown -R appuser /app
+# USER appuser
+
+# # During debugging, this entry point will be overridden. For more information, please refer to https://aka.ms/vscode-docker-python-debug
+# CMD ["python", "app.py"]
diff --git a/README.md b/README.md
@@ -80,3 +80,46 @@ If you are using the OpenAI's GPT model as your OCR engine, you will also need t
 This repository has unit tests in the `tests` directory configured using [pytest](https://pytest.org/) and the Github action defined in `.github/workflows/python_package.yml` will run tests every time you make a pull request to the main branch of the repository. 
 
 You can run tests locally using `pytest` or `python -m pytest` from the command line from the root of the repository or configure them to be [run with a debugger in your IDE](https://code.visualstudio.com/docs/python/testing).
+
+
+# MSF-OCR-Streamlit
+
+Uses a Streamlit web app in conjunction with an Optical Character Recognition (OCR) library to allow for uploading documents, scanning them, and correcting information.
+
+This repository contains two version of the application:
+- `app_llm.py` uses [OpenAI's GPT 4o model](https://platform.openai.com/docs/guides/vision) as an OCR engine to 'read' the tables from images
+- `app_doctr.py` uses a the [docTR](https://pypi.org/project/python-doctr/) library as an OCR engine to parse text from the tables in images.
+
+## Necessary environment variables
+In order to use the application, you will need to set the following environment variables for the Streamlit app to access the DHIS2 server:
+```
+DHIS2_USERNAME=<your username>
+DHIS2_PASSWORD=<your password>
+DHIS2_SERVER_URL=<server url>
+```
+
+If you are using the `app_llm.py` version of the application, you will also need to set `OPENAI_API_KEY` with an API key obtained from [OpenAI's online portal](https://platform.openai.com/).
+
+## Running Locally
+1) Set your environment variables. On a unix system the easiest way to do this is put them in a `.env` file, then run `set -a && source .env && set +a`. You can also set them in your System Properties or shell environment profile.  
+
+2) Install the python dependencies with `pip install .[app]`.
+
+3) Run your desired Streamlit application with one of the following commands:
+    - OpenAI version: `streamlit run app_llm.py` 
+    - DocTR version: `streamlit run app_doctr.py` 
+
+## Docker Instructions
+We have provided a Dockerfile in order to easily build and deploy the OpenAI application version as a Docker container. 
+
+1) Build an image named `msf-streamlit`: `docker build -t msf-streamlit .`.
+
+2) Run the `msf-streamlit` image in a container, passing the necessary environment variables: 
+    ```
+    docker run -p 8501:8501 -e DHIS2_USERNAME=<your username> -e DHIS2_PASSWORD=<your password> -e DHIS2_SERVER_URL=<server url> -e OPENAI_API_KEY=<your key> msf-streamlit
+    ```
+
+    If you have a `.env` file, you can keep things simple with `docker run -p 8501:8501 --env-file .env msf-streamlit`. 
+
+    Make sure port 8501 is available, as it is the default for Streamlit.
+
diff --git a/app_doctr.py b/app_doctr.py
diff --git a/app_llm.py b/app_llm.py
diff --git a/pyproject.toml b/pyproject.toml