Skip to content

Conversation

LRriver
Copy link
Contributor

@LRriver LRriver commented Sep 30, 2025

LLM-based Gremlin QA Synthesis and Generalization in Vertical Scenarios.

πŸ—οΈ Project Structure

Vertical_Text2Gremlin/
β”œβ”€β”€ README.md
β”œβ”€β”€ __pycache__/
β”œβ”€β”€ data/
β”œβ”€β”€ db_data/
β”œβ”€β”€ graph2gremlin.py
β”œβ”€β”€ gremlin_checker.py
β”œβ”€β”€ gremlin_qa_dataset.csv
β”œβ”€β”€ instruct_convert.py
β”œβ”€β”€ llm_handler.py
└── qa_generalize.py
  • ./graph2gremlin.py: Initially generates Gremlin data based on templates and graph data, ensuring correctness through templates, and translates and preliminarily generalizes the Gremlin data and questions.
  • ./gremlin_checker.py: Performs syntax checking using Antlr4.
  • ./llm_handler.py: An LLM interaction model that inputs QA data for each batch of seed numbers (during seed data generation, queries undergo a small batch generalization), allowing the LLM to understand how to write text2gremlin, first generalizing Gremlin, then translating and generalizing the query.
  • ./qa_generalize.py: Calls gremlin_checker and llm_handler for seed data generalization.
  • ./instruct_convert.py: Handles instruction format conversion and the division of training and test sets.
  • ./db_data: Contains schema and graph data.
  • ./data/seed_data: Seed data (to be uploaded).
  • ./data/vertical_training_sets: Vertical scenario generalization data (to be uploaded).

Gremlin Corpus Generation System Based on Recursive Backtracking in General Scenarios.

πŸ“‹ Project Overview
This PR adds a complete Text-to-Gremlin corpus generation system based on a recursive backtracking recipe-guided generation approach, capable of automatically generating large-scale and diverse training data from Gremlin query templates.

πŸ—οΈ Project Structure

AST_Text2Gremlin/                   # Project root directory
β”œβ”€β”€ base/                           # Core system directory
β”‚   β”œβ”€β”€ generator.py                # Main generator entry point
β”‚   β”œβ”€β”€ GremlinTransVisitor.py      # ANTLR syntax tree visitor
β”‚   β”œβ”€β”€ TraversalGenerator.py       # Recursive backtracking generator
β”‚   β”œβ”€β”€ Schema.py                   # Graph database Schema management
β”‚   β”œβ”€β”€ GremlinBase.py              # Base component library
β”‚   β”œβ”€β”€ Config.py                   # Configuration management
β”‚   β”œβ”€β”€ cypher2gremlin_dataset.csv  # 3514 real query dataset
β”‚   └── test/                       # Test suite
β”œβ”€β”€ config.json                     # Global configuration file
β”œβ”€β”€ db_data/                        # Schema and data files
└── README.md                       # Detailed technical documentation

🎯 Core Features

  1. Recipe-Guided Generation

    • Parse Gremlin queries into Recipes using ANTLR
    • Perform intelligent parameter generalization based on Schema
    • Generate large numbers of valid variants through recursive backtracking
  2. Large-Scale Data Processing

    • Support batch loading of query templates from CSV files
    • Process 3514 real cypher2gremlin dataset entries
    • Global deduplication to ensure corpus quality
  3. Complete Error Handling

    • Support complex query types (g.call(), .with(), etc.)
    • Individual failures don't affect overall processing
    • Detailed statistics and error reporting
  4. Intelligent Constraint Mechanism

    • Schema connectivity validation
    • Syntax validity checking
    • Combinatorial explosion control (320k β†’ 7k valid combinations)

πŸ“Š System Capabilities

  • Query type support: V/E traversals, graph algorithm calls, complex filtering, etc.
  • Generation scale: Single complex template can generate 6000+ valid variants
  • Processing efficiency: Batch processing of 3514 templates with robust error handling
  • Output quality: JSON format with query-description pairs and detailed metadata

πŸ§ͺ Technical Features

  • Recursive backtracking algorithm: Systematically explore parameter combination space
  • Recipe abstraction: Structure queries into generalizable Recipes
  • Constraint optimization: 97%+ invalid combinations intelligently filtered
  • Modular design: Core components can be used and tested independently

πŸ“ˆ Application Value

  • Text-to-Gremlin training: Provide large-scale training data for NLP models
  • Query diversity: Generate rich query variants from limited templates
  • Data quality: Ensure syntactic correctness and semantic reasonableness of generated queries
  • Extensibility: Support extension of new schemas and query types

πŸ”§ Usage

# Basic usage
from generator import generate_corpus_from_templates

templates = ["g.V().hasLabel('person')", "g.V().out('knows')"]
result = generate_corpus_from_templates(templates)
print(f"Generated {result['total_unique_queries']} unique queries")

πŸ“‹ Documentation

  • README.md: Quick start guide

@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Sep 30, 2025
@LRriver LRriver changed the title Gremlin Corpus Generation System Based on Recursive Backtracking Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) Sep 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant