Lang2SQL

우리는 함께 코드와 아이디어를 나누며 더 나은 데이터 환경을 만들기 위한 오픈소스 여정을 떠납니다. 🌍💡

A document-learning, read-only SQL analytics agent. Feed it your company's docs → it learns your business context → it keeps a separate set of definitions per team → it answers questions over an incomplete database → it remembers every definition and conversation.

👉 프로젝트 전체 그림(단일 SSOT): docs/PROJECT.md · 컨트리뷰터 한눈 가이드: docs/ARCHITECTURE.md

This is the v4.1 rebuild (배경/설계 의도: docs/discord_first_redesign_v4_1.md). Where most text-to-SQL projects compete on "generate better SQL," Lang2SQL competes on everything around the query: business-context learning, per-team semantics, robustness to messy databases, and memory. Discord is the Phase 1 interface, not the identity — Slack/Web are adapters on the same core.

The four pillars

Pillar	What it is
① Business-context learning	Documents are the source of truth. Drop in a doc → the agent extracts metric/dimension/rule candidates → you confirm → they land in the semantic layer.
② Two-axis robustness	(2a) DB robustness — works even when columns lack descriptions (auto-enrichment, v1.5). (2b) Semantic robustness — teams hold different definitions of the same term without conflict. This axis is the product/research identity.
③ Hermes memory	Conversations, facts, and preferences persist instead of resetting each session.
④ Multi-interface	Phase 1 Discord today; Slack/Web are future adapters. No platform lock-in.

Extensibility — outlets and appliances (콘센트/가전)

V1 ships the simplest single implementation of each extension point, but the abstraction (port) is already in place, so v1.5/v2 add a new implementation without touching existing code. Like a wall outlet: the V1 socket has one LED bulb plugged in, but because the socket is standard, you later plug in a fan or a smart light without rewiring the wall.

Four ★ extension patterns sit behind core/ports/:

★	Pattern	Port	Grows by
①	Safety pipeline	`ports/safety.py`	adding one layer class to the line (zero `run_sql` changes)
②	Memory service	`ports/memory.py`	swapping any of 3 axes — Store / Recall / Extractor — independently
③	Ingestion pipeline	`ports/ingestion.py`	a Source × Extractor matrix
④	Semantic federation	`ports/semantic_scope.py`	git-like per-team scope branches

Everything outside tenancy/concierge.py depends only on these Protocols, so the concrete classes (OpenAI, Postgres, SQLite) are swappable at the seams.

Quickstart

Requires Python ≥ 3.10 and uv.

uv sync                       # create .venv and install deps

1. Run the offline demo (no token, no database)

.venv/bin/python bench/ecommerce_demo.py

Shows the federation money-shot (one term, two team definitions, no conflict) and the safety gate (DROP/INSERT blocked, SELECT passes). See bench/README.md.

2. Run the CLI (developer driver)

.venv/bin/lang2sql "list the tables"

The CLI assembles a real HarnessContext and runs one turn through the agent loop. With OPENAI_API_KEY set it calls gpt-4.1-mini; otherwise it uses the offline FakeLLM.

3. Run the Discord bot

export DISCORD_BOT_TOKEN=...        # required
export OPENAI_API_KEY=...           # optional; offline FakeLLM if unset
export LANG2SQL_SECRET_KEY=...      # optional; Fernet key for secret encryption
.venv/bin/lang2sql-bot

The bot exits loudly if DISCORD_BOT_TOKEN is unset. Full setup and hosting: docs/DEPLOY.md. Copy .env.example to start.

What V1 does / does NOT do yet (honesty section)

Does:

3-scope semantic federation (guild / channel / member) with most-specific-wins resolution; term_custom registers definitions per scope (KV-backed).
Safety pipeline with the V1 layers (whitelist + timeout), gating every query.
Agent loop with eight tools: run_sql, explore_schema, enrich_schema, term_custom, org_setup, ingest_doc, remember, ask_user.
Memory service (in-memory store + inject-all recall + manual /remember).
Discord frontend (bot, commands, session router, render).
Encrypted-at-rest secrets (Fernet) and SQLite-backed persistence.

Does NOT yet:

Execute against a real database. PostgresExplorer is a V1 stub with canned orders/users schema and sample rows; real psycopg execution is v1.5.
Reason without a key. Without OPENAI_API_KEY, the FakeLLM returns deterministic canned tool cycles — useful for wiring tests, not for answers.
DB metadata auto-enrichment, AST-precise SQL validation, function blocklists, cost gating, /semantic diff / /semantic promote, keyword/vector recall, automatic fact extraction, URL/Notion ingestion — all scoped to v1.5+.
Persist across restarts by default: the V1 SqliteStore defaults to in-memory; point it at a file for durability.

Roadmap at a glance

Area	V1	V1.5	V2	V2.5
Safety	whitelist + timeout	+ AST validation, function blocklist, auto LIMIT, metadata enrichment, rate limit	+ cost gate (EXPLAIN), per-engine pipelines	—
Memory	in-memory + inject-all + manual	SQLite store + keyword recall + auto-extract	+ vector recall + conflict resolution	PostgreSQL + hybrid recall + confidence
Ingestion	file upload + LLM extract	+ URL fetch + DDL parsing	+ Notion/Confluence + hybrid	+ GitHub/Drive + chunked RAG
Federation	3-scope resolution, `/semantic show`	`/semantic diff`, `/semantic promote`, conflict alerts	git sync (semantic-as-code)	branch fork/merge UI, per-scope audit
Interface	Discord	(Anthropic/NIM eval)	Slack	Web

See docs/discord_first_redesign_v4_1.md for the full architecture write-up.

🤝 기여하기

처음 보시는 분은 docs/ARCHITECTURE.md — 디렉토리·레이어 책임, 한 메시지의 lifecycle, 어디를 수정하면 좋을지 가 한곳에 정리돼 있습니다.

git clone https://github.com/CausalInferenceLab/lang2sql.git
cd lang2sql
uv sync
.venv/bin/pytest -q          # 12 safety regressions + full suite must pass

새 기능에는 테스트 작성 (tests/test_<layer>.py)
PR은 master 브랜치 대상, 커밋 메시지에 feat: / fix: / docs: prefix 사용
버그/기능 요청은 이슈로

🙏 감사의 말 / License

Lang2SQL은 가짜연구소의 인과추론팀에서 개발 중인 프로젝트입니다. Licensed under the MIT License. 커뮤니티: Discord.

🏆 Our Team

Role	Name	Interests
Project Manager	이동욱	LLM, Open Source, Causal Inference
AI Engineer	문찬국	LLM, Agentic RAG, Open Source
Data Engineer	박경태	LLM-driven Data Engineering
AI Engineer	손봉균	LLM, RAG, AI Planning
Data Scientist	안재일	LLM, Data Analysis, RAG
ML Engineer	이호민	Multi-Agent Systems
AI Engineer	최세영	LLM, RAG, Multi-Agent
Full-Stack Developer	황윤진	LLM Orchestration
AI Engineer	김경서	LLM, FinNLP, FDS, RAG
Data Engineer	홍지영	LLM, Data Engineering
Data Operator	이화림	LLM, Data Engineering
AI Engineer	남경혜	LLM, RAG, Multi-Agent
AI Engineer	심세원	LLM, RAG, Multi-Agent
Business Analyst	서희진	LLM, Data Analysis

🌍 가짜연구소 소개

가짜연구소는 머신러닝과 AI 기술 발전에 중점을 둔 비영리 조직입니다. 공유, 동기부여, 그리고 협업의 기쁨이라는 핵심 가치를 바탕으로 영향력 있는 오픈소스 프로젝트를 만들어갑니다.

전 세계 5,000명 이상의 연구자들과 함께, 우리는 AI 지식의 민주화와 열린 협업을 통한 혁신 촉진에 전념하고 있습니다.

커뮤니티: 💬 Discord

Name		Name	Last commit message	Last commit date
Latest commit History 501 Commits
.github		.github
bench		bench
docs		docs
src/lang2sql		src/lang2sql
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
README.md		README.md
dev.sh		dev.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Lang2SQL

The four pillars

Extensibility — outlets and appliances (콘센트/가전)

Quickstart

1. Run the offline demo (no token, no database)

2. Run the CLI (developer driver)

3. Run the Discord bot

What V1 does / does NOT do yet (honesty section)

Roadmap at a glance

🤝 기여하기

🙏 감사의 말 / License

🏆 Our Team

🌍 가짜연구소 소개

🎯 기여자들

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Lang2SQL

The four pillars

Extensibility — outlets and appliances (콘센트/가전)

Quickstart

1. Run the offline demo (no token, no database)

2. Run the CLI (developer driver)

3. Run the Discord bot

What V1 does / does NOT do yet (honesty section)

Roadmap at a glance

🤝 기여하기

🙏 감사의 말 / License

🏆 Our Team

🌍 가짜연구소 소개

🎯 기여자들

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages