A RAG-based chatbot system for Australian Taxation Office (ATO) information retrieval and assistance, powered by data sourced from ato.gov.au.
https://ato-chat.streamlit.app/
Figure 1: Streamlit Chat Interface with example conversation
This project implements a Retrieval-Augmented Generation (RAG) chatbot system specifically designed for ATO-related queries. It consists of two main components:
- Data Pipeline & Model Training: A modular pipeline built with ZenML for data processing and index creation
- Interactive Interface: A Streamlit-based chat interface for user interactions
-
Data Collection & Processing
-
Machine Learning & AI
- OpenAI - Large Language Model API
- LlamaIndex - RAG framework and indexing
-
Backend & Infrastructure
-
Frontend
- Streamlit - Interactive web interface
- Streamlit-Chat - Chat UI components
The data pipeline is built using ZenML and consists of several key steps:
- Data Collection: Uses Firecrawl to extract content from ATO pages
- Data Cleaning: Processes and filters the collected data
- Index Creation: Creates embeddings and stores them in Qdrant
Figure 2: ZenML Pipeline Workflow showing data processing steps
Key pipeline components:
python:src/ato_chatbot/pipelines/simple_index_pipeline.py
The chat interface is built with Streamlit and implements a 3-step RAG process:
- Query Rephrasing: Improves query understanding
- Knowledge Retrieval: Fetches relevant information from Qdrant
- Response Generation: Uses OpenAI to generate contextual responses
Key interface components:
python:src/ato_chatbot/chat_interface.py
- Python 3.12+g
- Docker and Docker Compose
- OpenAI API key
- Clone the repository
- Install dependencies:
uv install
- Start required services:
make up
- Train the model:
make zen_run_simple_index
- Start the chat interface:
make streamlit
Key dependencies include:
toml:pyproject.toml
Apache License 2.0