This project allows you to:
- Index a folder of source code files (e.g. Python)
- Extract components (functions/classes)
- Create vector embeddings using UniXcoder
- Save and upload FAISS indices to Google Drive
- Search and ask questions about the code using LLMs like CodeLLaMA
- ✅ Extracts classes and functions from source code using
asttokens - ✅ Embeds code and components using microsoft/unixcoder-base
- ✅ Stores FAISS indices for code files and components
- ✅ Uploads index files to Google Drive for persistence
- ✅ Retrieves relevant files and components using similarity search
- ✅ Uses CodeLLaMA (or similar LLM) to answer questions about the code
-
File Loading & Parsing:
- The system recursively loads code files from a specified folder.
- It uses Python's
asttokensandastlibraries to extract functions and classes from each file.
-
Embedding Generation:
- Each file and extracted component is passed through microsoft/unixcoder-base, which converts them into vector representations (embeddings).
- These embeddings are saved using FAISS for fast similarity-based search.
-
Google Drive Upload:
- FAISS indices and metadata (e.g., code content, mapping of functions) are uploaded to a designated Google Drive folder.
- This allows persistence and reuse of the index without recomputing.
-
Search & Retrieval:
- When a user asks a question, the query is embedded using UniXcoder.
- The system retrieves the top-k relevant code files and components by comparing the query embedding against stored embeddings.
-
LLM-Based Answering:
- The most relevant code/context is passed along with the query to a language model (like CodeLLaMA).
- The model generates a human-readable explanation or answer based on the code context.
Install all required packages:
pip install torch transformers numpy faiss-cpu asttokens pathlib accelerate pydrive2 PyGithub sentence-transformersOr use the requirements.txt:
pip install -r requirements.txtPut all your code files (e.g. .py) in a folder.
# In main.py
python main.pyThis will:
- Authenticate with Google Drive
- Index the code folder
- Upload the FAISS index to Drive
- Start an interactive search console
- Make sure you're using this in an environment that supports Google Colab or Drive authentication
- The current LLM used is CodeLLaMA-7B; feel free to switch to another Hugging Face model