Skip to content

RohanDisa/CodeRAG

Repository files navigation

CodeRAG - Code Indexer and Search System

This project allows you to:

  • Index a folder of source code files (e.g. Python)
  • Extract components (functions/classes)
  • Create vector embeddings using UniXcoder
  • Save and upload FAISS indices to Google Drive
  • Search and ask questions about the code using LLMs like CodeLLaMA

Features

  • ✅ Extracts classes and functions from source code using asttokens
  • ✅ Embeds code and components using microsoft/unixcoder-base
  • ✅ Stores FAISS indices for code files and components
  • ✅ Uploads index files to Google Drive for persistence
  • ✅ Retrieves relevant files and components using similarity search
  • ✅ Uses CodeLLaMA (or similar LLM) to answer questions about the code

🧬 How It Works (Overview)

  1. File Loading & Parsing:

    • The system recursively loads code files from a specified folder.
    • It uses Python's asttokens and ast libraries to extract functions and classes from each file.
  2. Embedding Generation:

    • Each file and extracted component is passed through microsoft/unixcoder-base, which converts them into vector representations (embeddings).
    • These embeddings are saved using FAISS for fast similarity-based search.
  3. Google Drive Upload:

    • FAISS indices and metadata (e.g., code content, mapping of functions) are uploaded to a designated Google Drive folder.
    • This allows persistence and reuse of the index without recomputing.
  4. Search & Retrieval:

    • When a user asks a question, the query is embedded using UniXcoder.
    • The system retrieves the top-k relevant code files and components by comparing the query embedding against stored embeddings.
  5. LLM-Based Answering:

    • The most relevant code/context is passed along with the query to a language model (like CodeLLaMA).
    • The model generates a human-readable explanation or answer based on the code context.

🛠️ Installation

Install all required packages:

pip install torch transformers numpy faiss-cpu asttokens pathlib accelerate pydrive2 PyGithub sentence-transformers

Or use the requirements.txt:

pip install -r requirements.txt


🚀 How to Run

Step 1: Prepare Code Folder

Put all your code files (e.g. .py) in a folder.

Step 2: Run the Main Script

# In main.py
python main.py

This will:

  1. Authenticate with Google Drive
  2. Index the code folder
  3. Upload the FAISS index to Drive
  4. Start an interactive search console

📌 Notes

  • Make sure you're using this in an environment that supports Google Colab or Drive authentication
  • The current LLM used is CodeLLaMA-7B; feel free to switch to another Hugging Face model

About

AI-powered code indexer and search system with embeddings, FAISS, and LLM-based Q&A.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages