Skip to content

Commit 0cc9bc7

Browse files
committed
Added sandbox for extracting figure and text out of PDF's and writting them out as JSON, ready to import into lambda-feedback.
1 parent 1182134 commit 0cc9bc7

File tree

3 files changed

+727
-1
lines changed

3 files changed

+727
-1
lines changed

wizard/README.md

+34-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,36 @@
11
This is a folder where we'll generate a 'wizard' to automatically process input documents ready for in2lambda to reacte a JSON (or to directly create a JSON)
22

3-
We'll work on this branch 'hackathon'
3+
We'll work on this branch 'hackathon'
4+
5+
6+
# README
7+
8+
## Overview
9+
This Jupyter Notebook (`sandbox.ipynb`) is designed for processing scientific documents, extracting mathematical expressions, and formatting them in Markdown. It leverages Azure OpenAI's LLM capabilities for text transformation.
10+
11+
## Features
12+
- Loads PDFs and extracts text using `UnstructuredPDFLoader` and `PyMuPDF`.
13+
- Converts mathematical expressions into properly formatted Markdown.
14+
- Uses `langchain` and `AzureChatOpenAI` for text processing.
15+
- Supports structured output parsing using `pydantic`.
16+
17+
## Requirements
18+
Ensure you have the following installed:
19+
- Python 3.8+
20+
- `pip install -r requirements.txt`
21+
- `langchain`, `langchain_openai`, `pydantic`, `dotenv`, `PyMuPDF`, `PIL`
22+
23+
## Setup
24+
1. Create a `.env` file in the root directory and add your Azure OpenAI API keys:
25+
```env
26+
AZURE_OPENAI_API_KEY=<your-api-key>
27+
AZURE_OPENAI_ENDPOINT=<your-endpoint>
28+
```
29+
4. Open `sandbox.ipynb` and execute the cells to process your documents.
30+
31+
## Notes
32+
- Ensure your API key and endpoint are correct, as they are required for LLM functionality.
33+
- The notebook is designed for scientific documents, but can be extended to other text formats.
34+
35+
36+

wizard/requirements.txt

+13
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
langchain
2+
pandas
3+
langchain-openai
4+
pypdf
5+
chromadb
6+
langchain_community
7+
unstructured
8+
PyMuPDF
9+
Pillow
10+
pdfminer.six
11+
unstructured[pdf]
12+
ipywidgets
13+
jupyter

0 commit comments

Comments
 (0)