Skip to content
This repository was archived by the owner on Jun 2, 2025. It is now read-only.

Commit b293c00

Browse files
authored
Merge pull request #19 from UMassCDS/readme_cleanups
Removing settings.ini. Reorg README and add table of contents
2 parents 7937d66 + 30551a4 commit b293c00

File tree

2 files changed

+74
-82
lines changed

2 files changed

+74
-82
lines changed

README.md

Lines changed: 74 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
1-
# DoctorsWithoutBorders
2-
1+
# DoctorsWithoutBordersOCR
2+
- [Introduction](#introduction)
3+
- [Getting Started](#getting-started)
4+
- [Installation](#installing-dependencies-and-packages)
5+
- [Streamlit Application](#streamlit-application)
6+
- [Tests](#tests)
7+
- [Extras](#extras)
8+
- [Docker Instructions](#docker-instructions)
9+
- [Downloading Test Data from Azure](#downloading-test-data-from-azure)
10+
11+
# Introduction
312
Code repository for 2024 Data Science for the Common Good project with Doctors Without Borders.
413

514
Doctors Without Borders collects data from their clinics and field locations using tally sheets. These tally sheets are standardized forms containing aggregate data, ensuring no individual can be identified and no protected health information (PHI) is included. Currently, this data is manually entered into their health system by over 100 employees, a time-consuming and tedious process. Our OCR pipeline is designed to automate this data entry process, improving efficiency and accuracy.
@@ -11,39 +20,80 @@ Use these steps for setting up a development environment to install and work wit
1120
1) Set up a Python 3 virtual environment using [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html#) or [Virtualenv](https://virtualenv.pypa.io/en/latest/index.html). Read [Python Virtual Environments: A Primer](https://realpython.com/python-virtual-environments-a-primer/#the-virtualenv-project) for details on how to get started with virtual environments and why you need them. For a _really detailed_ explanation, see [An unbiased evaluation of environment management and packaging tools](https://alpopkes.com/posts/python/packaging_tools/).
1221
2) Activate your virtual environment.
1322
3) Install the package.
14-
- If you want to just use the scripts and package features, install the project by running `pip install .` from the root directory.
15-
- If you will be changing the code and running tests, you can install it by running `pip install -e '.[test,dev]'`. The `-e/--editable` flag means local changes to the project code will always be available with the package is imported. You wouldn't use this in production, but it's useful for development.
23+
- **Only base package**: If you want to just use the scripts and package features, install the project by running `pip install .` from the root directory.
24+
- **Package with Streamlit app**: To also install dependencies for the Streamlit frontend application using the main LLM OCR, run `pip install .[app]`.
25+
- **Package with DocTR app**: To install dependencies for the Streamlit frontend application using the DocTR OCR, run `pip install .[app-doctr]`.
26+
- **All development dependencies**: If you will be changing the code and running tests, you can install it by running `pip install -e '.[app,test,dev]'`. The `-e/--editable` flag means local changes to the project code will always be available with the package is imported. You wouldn't use this in production, but it's useful for development. *PS*: use `pip install -e '.[app,test,dev]'` with the quote symbols for zshell/zsh (default shell on newer Macs).
27+
28+
For example, if you use the 'venv' Virtualenv module, you would do the following to create an environment named `venv` with Python version 3.10, then activate it and install the package in developer mode:
29+
- Make sure you have Python 3.10 or later installed on your system. You can check your Python version by running `python3 --version`.
30+
- Navigate to your project directory in the terminal and run the following command to create a virtual environment named `venv` - `python3 -m venv venv`.
31+
- To activate the virtual environment, use `venv\Scripts\activate` on Windows or `source venv/bin/activate` on Unix or MacOS.
32+
- To install the package, run `pip install .` for regular use or `pip install -e .[test,dev]` for development.
33+
34+
35+
## Streamlit Application
36+
Uses a Streamlit web app in conjunction with an Optical Character Recognition (OCR) library to allow users to upload documents, scan them, and correct information.
37+
38+
This repository contains two version of the application:
39+
- `app_llm.py` uses [OpenAI's GPT 4o model](https://platform.openai.com/docs/guides/vision) as an OCR engine to 'read' the tables from images
40+
- `app_doctr.py` uses a the [docTR](https://pypi.org/project/python-doctr/) library as an OCR engine to parse text from the tables in images.
41+
42+
### Application configuration
43+
#### DHIS2 Server
44+
In order to use the application, you will need to set the `DHIS2_SERVER_URL` environment variable. All users will also need a valid username and password for the DHIS2 server in order to authenticate and use the Streamlit application.
1645

17-
For example, if you use the 'venv' module, you would run the following to create an environment named `venv` with Python version 3.10, then activate it and install the package in developer mode.
18-
Make sure you have Python 3.10 or later installed on your system. You can check your Python version by running `python3 --version`.
19-
Navigate to your project directory in the terminal and run the following command to create a virtual environment named `venv` - `python3 -m venv venv`.
20-
To activate the virtual environment, use `venv\Scripts\activate` on Windows or `source venv/bin/activate` on Unix or MacOS.
21-
To install the package, run `pip install .` for regular use or `pip install -e .[test,dev]` for development. Ps use `pip install -e '.[test,dev]'` for zsh.
46+
#### OpenAI API Key
47+
If you are using the `app_llm.py` version of the application, you will also need to set `OPENAI_API_KEY` with an API key obtained from [OpenAI's online portal](https://platform.openai.com/).
48+
49+
### Running Streamlit Locally
50+
1) Set your environment variables as described just above. On a unix system the easiest way to do this is put them in a `.env` file, then run `set -a && source .env && set +a`. You can also set them in your System Properties or shell environment profile.
2251

52+
2) Install the python dependencies with `pip install .[app]`.
2353

24-
## Instructions for Downloading data from Azure
25-
This part demonstrates how to interact with Azure services to download blobs from Azure Blob Storage.
54+
3) Run your desired Streamlit application with one of the following commands:
55+
- OpenAI version: `streamlit run app_llm.py`
56+
- DocTR version: `streamlit run app_doctr.py`
57+
58+
59+
# Tests
60+
This repository has unit tests in the `tests` directory configured using [pytest](https://pytest.org/) and the Github action defined in `.github/workflows/python_package.yml` will run tests every time you make a pull request to the main branch of the repository.
2661

27-
First, launch Jupyter Notebook:
62+
If you have installed the `test` dependencies, you can run tests locally using `pytest` or `python -m pytest` from the command line from the root of the repository or configure them to be [run with a debugger in your IDE](https://code.visualstudio.com/docs/python/testing).
63+
64+
65+
# Extras
66+
## Docker Instructions
67+
We have provided a Dockerfile in order to easily build and deploy the OpenAI version of the Streamlit application as a Docker container.
68+
69+
1) Build an image named `msf-streamlit`: `docker build -t msf-streamlit .`.
70+
71+
2) Run the `msf-streamlit` image in a container, passing the necessary environment variables:
72+
```bash
73+
docker run -p 8501:8501 -e DHIS2_USERNAME=<your username> -e DHIS2_PASSWORD=<your password> -e DHIS2_SERVER_URL=<server url> -e OPENAI_API_KEY=<your key> msf-streamlit
74+
```
75+
76+
If you have a `.env` file, you can keep things simple with `docker run -p 8501:8501 --env-file .env msf-streamlit`.
77+
78+
Make sure port 8501 is available, as it is the default for Streamlit.
79+
80+
## Downloading Test Data from Azure
81+
This part demonstrates how to interact with Azure services to download blobs from Azure Blob Storage. Do this if you need to download test images of tally sheets from Azure Blob storage.
82+
83+
First, install the `dev` dependencies, then launch Jupyter Notebook:
2884
```bash
2985
jupyter notebook
3086
```
31-
In the Jupyter Notebook interface, navigate to the folder 'notebooks', open the 'ocr_azure_functions' file. Then you can run the cells.
32-
Here are the explanations for the various parameters used in the script.
33-
- **`keyvault_url`**:
34-
- **Usage**: Connects to the Key Vault.
87+
In the Jupyter Notebook interface, navigate to the folder 'notebooks', open the 'ocr_azure_functions' file. You will need to get the values for the following credentials from someone on the team, fill in the variables in the notebook, then run the cells.
88+
- **`keyvault_url`**: Connects to the Key Vault.
3589

36-
- **`secret_name`**:
37-
- **Usage**: Retrieves the secret from the Key Vault.
90+
- **`secret_name`**: Retrieves the secret from the Key Vault.
3891

39-
- **`storage_account_name`**:
40-
- **Usage**: Constructs the connection string for Blob Storage.
92+
- **`storage_account_name`**: Constructs the connection string for Blob Storage.
4193

42-
- **`container_name`**:
43-
- **Usage**: Specifies which container to list or download blobs from.
94+
- **`container_name`**: Specifies which container to list or download blobs from.
4495

45-
- **`storage_account_key = get_keyvault_secret(keyvault_url, secret_name, credential)`**:
46-
- **Usage**: Retrieves the storage account key from the Key Vault using the specified URL, secret name, and credential.
96+
- **`storage_account_key = get_keyvault_secret(keyvault_url, secret_name, credential)`**: Retrieves the storage account key from the Key Vault using the specified URL, secret name, and credential.
4797

4898
### Example Usage
4999
Please specify where to store downloaded files on your computer using the local_path variable.
@@ -66,60 +116,6 @@ list_blobs_in_container(storage_account_name, storage_account_key, container_nam
66116
download_blobs_in_container(storage_account_name, storage_account_key, container_name)
67117
```
68118

69-
## Uploading data to a DHIS2 server
70-
This repository assumes assumes you will eventually want to upload data extracted from form images to a [DHIS2 health information server](https://dhis2.org/). In order to configure your connection to the DHIS2, you will need to set the following environment variables:
71-
```
72-
DHIS2_USERNAME=<your username>
73-
DHIS2_PASSWORD=<your password>
74-
DHIS2_SERVER_URL=<server url>
75-
```
76119

77-
If you are using the OpenAI's GPT model as your OCR engine, you will also need to set `OPENAI_API_KEY` with an API key obtained from [OpenAI's online portal](https://platform.openai.com/).
78120

79-
# Tests
80-
This repository has unit tests in the `tests` directory configured using [pytest](https://pytest.org/) and the Github action defined in `.github/workflows/python_package.yml` will run tests every time you make a pull request to the main branch of the repository.
81-
82-
You can run tests locally using `pytest` or `python -m pytest` from the command line from the root of the repository or configure them to be [run with a debugger in your IDE](https://code.visualstudio.com/docs/python/testing).
83-
84-
85-
# MSF-OCR-Streamlit
86-
87-
Uses a Streamlit web app in conjunction with an Optical Character Recognition (OCR) library to allow for uploading documents, scanning them, and correcting information.
88-
89-
This repository contains two version of the application:
90-
- `app_llm.py` uses [OpenAI's GPT 4o model](https://platform.openai.com/docs/guides/vision) as an OCR engine to 'read' the tables from images
91-
- `app_doctr.py` uses a the [docTR](https://pypi.org/project/python-doctr/) library as an OCR engine to parse text from the tables in images.
92-
93-
## Necessary environment variables
94-
In order to use the application, you will need to set the following environment variables for the Streamlit app to access the DHIS2 server:
95-
```
96-
DHIS2_USERNAME=<your username>
97-
DHIS2_PASSWORD=<your password>
98-
DHIS2_SERVER_URL=<server url>
99-
```
100-
101-
If you are using the `app_llm.py` version of the application, you will also need to set `OPENAI_API_KEY` with an API key obtained from [OpenAI's online portal](https://platform.openai.com/).
102-
103-
## Running Locally
104-
1) Set your environment variables. On a unix system the easiest way to do this is put them in a `.env` file, then run `set -a && source .env && set +a`. You can also set them in your System Properties or shell environment profile.
105-
106-
2) Install the python dependencies with `pip install .[app]`.
107-
108-
3) Run your desired Streamlit application with one of the following commands:
109-
- OpenAI version: `streamlit run app_llm.py`
110-
- DocTR version: `streamlit run app_doctr.py`
111-
112-
## Docker Instructions
113-
We have provided a Dockerfile in order to easily build and deploy the OpenAI application version as a Docker container.
114-
115-
1) Build an image named `msf-streamlit`: `docker build -t msf-streamlit .`.
116-
117-
2) Run the `msf-streamlit` image in a container, passing the necessary environment variables:
118-
```
119-
docker run -p 8501:8501 -e DHIS2_USERNAME=<your username> -e DHIS2_PASSWORD=<your password> -e DHIS2_SERVER_URL=<server url> -e OPENAI_API_KEY=<your key> msf-streamlit
120-
```
121-
122-
If you have a `.env` file, you can keep things simple with `docker run -p 8501:8501 --env-file .env msf-streamlit`.
123-
124-
Make sure port 8501 is available, as it is the default for Streamlit.
125121

settings.ini

Lines changed: 0 additions & 4 deletions
This file was deleted.

0 commit comments

Comments
 (0)