Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) #303

LRriver · 2025-09-30T14:18:17Z

LLM-based Gremlin QA Synthesis and Generalization in Vertical Scenarios.

🏗️ Project Structure

Vertical_Text2Gremlin/
├── README.md
├── __pycache__/
├── data/
├── db_data/
├── graph2gremlin.py
├── gremlin_checker.py
├── gremlin_qa_dataset.csv
├── instruct_convert.py
├── llm_handler.py
└── qa_generalize.py

./graph2gremlin.py: Initially generates Gremlin data based on templates and graph data, ensuring correctness through templates, and translates and preliminarily generalizes the Gremlin data and questions.
./gremlin_checker.py: Performs syntax checking using Antlr4.
./llm_handler.py: An LLM interaction model that inputs QA data for each batch of seed numbers (during seed data generation, queries undergo a small batch generalization), allowing the LLM to understand how to write text2gremlin, first generalizing Gremlin, then translating and generalizing the query.
./qa_generalize.py: Calls gremlin_checker and llm_handler for seed data generalization.
./instruct_convert.py: Handles instruction format conversion and the division of training and test sets.
./db_data: Contains schema and graph data.
./data/seed_data: Seed data (to be uploaded).
./data/vertical_training_sets: Vertical scenario generalization data (to be uploaded).

Gremlin Corpus Generation System Based on Recursive Backtracking in General Scenarios.

📋 Project Overview
This PR adds a complete Text-to-Gremlin corpus generation system based on a recursive backtracking recipe-guided generation approach, capable of automatically generating large-scale and diverse training data from Gremlin query templates.

🏗️ Project Structure

AST_Text2Gremlin/                   # Project root directory
├── base/                           # Core system directory
│   ├── generator.py                # Main generator entry point
│   ├── GremlinTransVisitor.py      # ANTLR syntax tree visitor
│   ├── TraversalGenerator.py       # Recursive backtracking generator
│   ├── Schema.py                   # Graph database Schema management
│   ├── GremlinBase.py              # Base component library
│   ├── Config.py                   # Configuration management
│   ├── cypher2gremlin_dataset.csv  # 3514 real query dataset
│   └── test/                       # Test suite
├── config.json                     # Global configuration file
├── db_data/                        # Schema and data files
└── README.md                       # Detailed technical documentation

🎯 Core Features

Recipe-Guided Generation
- Parse Gremlin queries into Recipes using ANTLR
- Perform intelligent parameter generalization based on Schema
- Generate large numbers of valid variants through recursive backtracking
Large-Scale Data Processing
- Support batch loading of query templates from CSV files
- Process 3514 real cypher2gremlin dataset entries
- Global deduplication to ensure corpus quality
Complete Error Handling
- Support complex query types (g.call(), .with(), etc.)
- Individual failures don't affect overall processing
- Detailed statistics and error reporting
Intelligent Constraint Mechanism
- Schema connectivity validation
- Syntax validity checking
- Combinatorial explosion control (320k → 7k valid combinations)

📊 System Capabilities

Query type support: V/E traversals, graph algorithm calls, complex filtering, etc.
Generation scale: Single complex template can generate 6000+ valid variants
Processing efficiency: Batch processing of 3514 templates with robust error handling
Output quality: JSON format with query-description pairs and detailed metadata

🧪 Technical Features

Recursive backtracking algorithm: Systematically explore parameter combination space
Recipe abstraction: Structure queries into generalizable Recipes
Constraint optimization: 97%+ invalid combinations intelligently filtered
Modular design: Core components can be used and tested independently

📈 Application Value

Text-to-Gremlin training: Provide large-scale training data for NLP models
Query diversity: Generate rich query variants from limited templates
Data quality: Ensure syntactic correctness and semantic reasonableness of generated queries
Extensibility: Support extension of new schemas and query types

🔧 Usage

# Basic usage
from generator import generate_corpus_from_templates

templates = ["g.V().hasLabel('person')", "g.V().out('knows')"]
result = generate_corpus_from_templates(templates)
print(f"Generated {result['total_unique_queries']} unique queries")

📋 Documentation

README.md: Quick start guide

…eneration parameters

… structures

…nnectors support

…d properties

… data instances

…ing and call/with support

…y variants from Recipe

…cation and error handling

…path settings

…and visitor classes

…with correctness guarantee and preliminary question generalization

…d translation

…and llm_handler

… set division

… data directory

LRriver added 16 commits September 30, 2025 20:52

feat: add configuration management module with dictionary paths and g…

52f2e01

…eneration parameters

feat: add Gremlin parsing base classes with Step, Traversal core data…

fadaaf7

… structures

feat: add Gremlin expression processing module with predicates and co…

b775d29

…nnectors support

feat: add graph database schema management with vertex/edge labels an…

f0588a1

…d properties

feat: add Gremlin base component library with synonym replacement and…

5f3b039

… data instances

feat: add ANTLR syntax tree visitor with Gremlin query to Recipe pars…

822272f

…ing and call/with support

feat: add recursive backtracking traversal generator for diverse quer…

441b32c

…y variants from Recipe

feat: add main corpus generator with batch processing, global dedupli…

2de2096

…cation and error handling

config: add global configuration file with generation parameters and …

c92f09a

…path settings

data: add cypher2gremlin dataset with 3514 real query templates

25ca990

docs: add project README with quick start guide and usage instructions

25a2876

feat: add ANTLR-generated Gremlin grammar package with lexer, parser …

541aa20

…and visitor classes

data: add schema and graph data

eb7eb01

feat: add template directory with schema dictionary and synonym files

f0579e8

test: add gremlin statement generalization generation test module

9c13457

test: add generator unit tests for corpus generation validation

b14ffb3

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels Sep 30, 2025

LRriver added 11 commits September 30, 2025 22:31

Add graph2gremlin.py: Initial template-based Gremlin data generation …

7cd8427

…with correctness guarantee and preliminary question generalization

Add gremlin_checker.py: Syntax checking using Antlr4

4da021c

Add llm_handler.py: LLM interaction model for query generalization an…

bc10fe2

…d translation

Add qa_generalize.py: Seed data generalization using gremlin_checker …

6ea48d5

…and llm_handler

Add instruct_convert.py: Instruction format conversion and train/test…

78f8c2a

… set division

Add da_data: Schema and graph data

b7f3f4a

Add data/seed_data: Seed data directory

332b879

Add data/vertical_training_sets: Vertical domain scenario generalized…

8a94bad

… data directory

Add books on Gremlin syntax knowledge to process data.

676d28c

Add a dataset of Gremlin QA pairs synthesized based on LLM.

90f346f

Add README.md

4120356

LRriver changed the title ~~Gremlin Corpus Generation System Based on Recursive Backtracking~~ Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) Sep 30, 2025

Compatible with OpenAI format

67b523a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) #303

Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) #303

Uh oh!

LRriver commented Sep 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) #303

Are you sure you want to change the base?

Text2Gremlin Data Generation and Model Fine-Tuning System (Vertical Scenarios and General Scenarios) #303

Uh oh!

Conversation

LRriver commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LLM-based Gremlin QA Synthesis and Generalization in Vertical Scenarios.

Gremlin Corpus Generation System Based on Recursive Backtracking in General Scenarios.

Uh oh!

Uh oh!

LRriver commented Sep 30, 2025 •

edited

Loading