PDF Document Text Chunk Classifier

This application provides a FastAPI service for classifying PDF documents based on their content quality. It uses sentence embeddings, machine learning, and Retrieval-Augmented Generation (RAG) to assign one of three ordinal labels — bad, neutral, good — to documents based on their writing quality, structure, and content.

Project Structure

.
├── README.md
├── data/               # Data directory
│   ├── test/          # Test data
│   └── training/      # Training data
├── src/               # Source code
│   ├── app.py         # Main application
│   └── create_example_pdfs.py
├── tests/             # Test scripts
│   ├── conftest.py    # Test configuration and fixtures
│   ├── test_app.py    # API and functionality tests
│   └── test_model.py
├── requirements.txt
└── setup.sh           # Setup and run script

How It Works

The classifier uses a three-stage approach:

Text Processing:
- Extracts text from PDF documents
- Splits text into manageable chunks
- Generates embeddings using SentenceTransformer
Vector Store and RAG:
- Stores document chunks in a FAISS vector store
- Uses RAG to retrieve relevant examples for classification
- Maintains metadata about document quality and sources
Classification:
- Uses a Logistic Regression model trained on labeled examples
- Optionally uses GPT-4 with RAG context for more nuanced classification
- Aggregates predictions across chunks to determine final document quality

Setup

Quick Start

The easiest way to get started is to use the provided setup script:

# Make the script executable
chmod +x setup.sh

# Run the setup script
./setup.sh

This script will:

Check for Python 3 installation
Create a virtual environment if it doesn't exist
Install all required dependencies
Check for OpenAI API key
Initialize the vector store
Start the FastAPI server

Manual Setup

If you prefer to set up manually:

Create a Python virtual environment:

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```
Set up your OpenAI API key (if using LLM classification):
```
export OPENAI_API_KEY=your_api_key_here
```

Initialize the vector store:

python -c "from src.app import initialize_vector_store; initialize_vector_store()"

Running the Application

Using the Setup Script

Simply run:

./setup.sh

Manual Start

Start the FastAPI server:

uvicorn src.app:app --reload

The server will start at http://localhost:8000.

API Endpoints

1. Train the Classifier

Endpoint: /train
Method: POST
Description: Train the classifier with labeled PDF documents
Input:
- pdfs: List of PDF files
- labels: Comma-separated list of labels ('bad', 'neutral', 'good')
Output: Status message

2. Predict Document Quality

Endpoint: /predict
Method: POST
Description: Predict the quality of a PDF document using RAG-enhanced classification
Input: PDF file
Output: Predicted quality label

3. Health Check

Endpoint: /
Method: GET
Description: Check if the service is running
Output: Service status

Testing

The application includes a comprehensive test suite that covers:

Text extraction from PDFs
Text splitting functionality
Vector store operations
RAG-based classification
API endpoints

Run the tests with:

python -m pytest tests/ -v

Vector Store Management

The application uses FAISS for efficient vector storage and retrieval. The vector store is automatically initialized when the application starts and persists between sessions. It stores:

Document chunks and their embeddings
Metadata about document quality
Source information for RAG context

Rate Limiting

The application includes rate limiting for API calls to prevent overload:

Token bucket algorithm implementation
Configurable rate limits
Automatic token refill

API Documentation

Once the server is running, you can access the interactive API documentation at:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Generating Example PDFs

To generate example PDFs for training or testing:

Training PDFs:
```
python src/create_example_pdfs.py
```
Test PDFs:
```
python src/create_test_pdfs.py
```

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.cursor		.cursor
src		src
tests		tests
.gitignore		.gitignore
.isort.cfg		.isort.cfg
README.md		README.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Document Text Chunk Classifier

Project Structure

How It Works

Setup

Quick Start

Manual Setup

Running the Application

Using the Setup Script

Manual Start

API Endpoints

1. Train the Classifier

2. Predict Document Quality

3. Health Check

Testing

Vector Store Management

Rate Limiting

API Documentation

Generating Example PDFs

About

Uh oh!

Releases

Packages

Uh oh!

Languages

project-delphi/classify-pdf-text-chunks

Folders and files

Latest commit

History

Repository files navigation

PDF Document Text Chunk Classifier

Project Structure

How It Works

Setup

Quick Start

Manual Setup

Running the Application

Using the Setup Script

Manual Start

API Endpoints

1. Train the Classifier

2. Predict Document Quality

3. Health Check

Testing

Vector Store Management

Rate Limiting

API Documentation

Generating Example PDFs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages