This application provides a FastAPI service for classifying PDF documents based on their content quality. It uses sentence embeddings, machine learning, and Retrieval-Augmented Generation (RAG) to assign one of three ordinal labels — bad
, neutral
, good
— to documents based on their writing quality, structure, and content.
.
├── README.md
├── data/ # Data directory
│ ├── test/ # Test data
│ └── training/ # Training data
├── src/ # Source code
│ ├── app.py # Main application
│ └── create_example_pdfs.py
├── tests/ # Test scripts
│ ├── conftest.py # Test configuration and fixtures
│ ├── test_app.py # API and functionality tests
│ └── test_model.py
├── requirements.txt
└── setup.sh # Setup and run script
The classifier uses a three-stage approach:
-
Text Processing:
- Extracts text from PDF documents
- Splits text into manageable chunks
- Generates embeddings using SentenceTransformer
-
Vector Store and RAG:
- Stores document chunks in a FAISS vector store
- Uses RAG to retrieve relevant examples for classification
- Maintains metadata about document quality and sources
-
Classification:
- Uses a Logistic Regression model trained on labeled examples
- Optionally uses GPT-4 with RAG context for more nuanced classification
- Aggregates predictions across chunks to determine final document quality
The easiest way to get started is to use the provided setup script:
# Make the script executable
chmod +x setup.sh
# Run the setup script
./setup.sh
This script will:
- Check for Python 3 installation
- Create a virtual environment if it doesn't exist
- Install all required dependencies
- Check for OpenAI API key
- Initialize the vector store
- Start the FastAPI server
If you prefer to set up manually:
-
Create a Python virtual environment:
python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set up your OpenAI API key (if using LLM classification):
export OPENAI_API_KEY=your_api_key_here
-
Initialize the vector store:
python -c "from src.app import initialize_vector_store; initialize_vector_store()"
Simply run:
./setup.sh
Start the FastAPI server:
uvicorn src.app:app --reload
The server will start at http://localhost:8000
.
- Endpoint:
/train
- Method: POST
- Description: Train the classifier with labeled PDF documents
- Input:
pdfs
: List of PDF fileslabels
: Comma-separated list of labels ('bad', 'neutral', 'good')
- Output: Status message
- Endpoint:
/predict
- Method: POST
- Description: Predict the quality of a PDF document using RAG-enhanced classification
- Input: PDF file
- Output: Predicted quality label
- Endpoint:
/
- Method: GET
- Description: Check if the service is running
- Output: Service status
The application includes a comprehensive test suite that covers:
- Text extraction from PDFs
- Text splitting functionality
- Vector store operations
- RAG-based classification
- API endpoints
Run the tests with:
python -m pytest tests/ -v
The application uses FAISS for efficient vector storage and retrieval. The vector store is automatically initialized when the application starts and persists between sessions. It stores:
- Document chunks and their embeddings
- Metadata about document quality
- Source information for RAG context
The application includes rate limiting for API calls to prevent overload:
- Token bucket algorithm implementation
- Configurable rate limits
- Automatic token refill
Once the server is running, you can access the interactive API documentation at:
- Swagger UI:
http://localhost:8000/docs
- ReDoc:
http://localhost:8000/redoc
To generate example PDFs for training or testing:
-
Training PDFs:
python src/create_example_pdfs.py
-
Test PDFs:
python src/create_test_pdfs.py