우리는 함께 코드와 아이디어를 나누며 더 나은 데이터 환경을 만들기 위한 오픈소스 여정을 떠납니다. 🌍💡
A document-learning, read-only SQL analytics agent. Feed it your company's docs → it learns your business context → it keeps a separate set of definitions per team → it answers questions over an incomplete database → it remembers every definition and conversation.
👉 프로젝트 전체 그림(단일 SSOT): docs/PROJECT.md · 컨트리뷰터 한눈 가이드: docs/ARCHITECTURE.md
This is the v4.1 rebuild (배경/설계 의도: docs/discord_first_redesign_v4_1.md).
Where most text-to-SQL projects compete on "generate better SQL," Lang2SQL
competes on everything around the query: business-context learning, per-team
semantics, robustness to messy databases, and memory. Discord is the Phase 1
interface, not the identity — Slack/Web are adapters on the same core.
| Pillar | What it is |
|---|---|
| ① Business-context learning | Documents are the source of truth. Drop in a doc → the agent extracts metric/dimension/rule candidates → you confirm → they land in the semantic layer. |
| ② Two-axis robustness | (2a) DB robustness — works even when columns lack descriptions (auto-enrichment, v1.5). (2b) Semantic robustness — teams hold different definitions of the same term without conflict. This axis is the product/research identity. |
| ③ Hermes memory | Conversations, facts, and preferences persist instead of resetting each session. |
| ④ Multi-interface | Phase 1 Discord today; Slack/Web are future adapters. No platform lock-in. |
V1 ships the simplest single implementation of each extension point, but the abstraction (port) is already in place, so v1.5/v2 add a new implementation without touching existing code. Like a wall outlet: the V1 socket has one LED bulb plugged in, but because the socket is standard, you later plug in a fan or a smart light without rewiring the wall.
Four ★ extension patterns sit behind core/ports/:
| ★ | Pattern | Port | Grows by |
|---|---|---|---|
| ① | Safety pipeline | ports/safety.py |
adding one layer class to the line (zero run_sql changes) |
| ② | Memory service | ports/memory.py |
swapping any of 3 axes — Store / Recall / Extractor — independently |
| ③ | Ingestion pipeline | ports/ingestion.py |
a Source × Extractor matrix |
| ④ | Semantic federation | ports/semantic_scope.py |
git-like per-team scope branches |
Everything outside tenancy/concierge.py depends only on these Protocols, so the
concrete classes (OpenAI, Postgres, SQLite) are swappable at the seams.
Requires Python ≥ 3.10 and uv.
uv sync # create .venv and install deps.venv/bin/python bench/ecommerce_demo.pyShows the federation money-shot (one term, two team definitions, no conflict) and
the safety gate (DROP/INSERT blocked, SELECT passes). See bench/README.md.
.venv/bin/lang2sql "list the tables"The CLI assembles a real HarnessContext and runs one turn through the agent
loop. With OPENAI_API_KEY set it calls gpt-4.1-mini; otherwise it uses the
offline FakeLLM.
export DISCORD_BOT_TOKEN=... # required
export OPENAI_API_KEY=... # optional; offline FakeLLM if unset
export LANG2SQL_SECRET_KEY=... # optional; Fernet key for secret encryption
.venv/bin/lang2sql-botThe bot exits loudly if DISCORD_BOT_TOKEN is unset. Full setup and hosting:
docs/DEPLOY.md. Copy .env.example to start.
Does:
- 3-scope semantic federation (guild / channel / thread) with most-specific-wins
resolution;
define_metricwrites to the current scope. - Safety pipeline with the V1 layers (whitelist + timeout), gating every query.
- Agent loop with six tools:
run_sql,explore_schema,define_metric,ingest_doc,remember,ask_user. - Memory service (in-memory store + inject-all recall + manual
/remember). - Discord frontend (bot, commands, session router, render).
- Encrypted-at-rest secrets (Fernet) and SQLite-backed persistence.
Does NOT yet:
- Execute against a real database.
PostgresExploreris a V1 stub with cannedorders/usersschema and sample rows; real psycopg execution is v1.5. - Reason without a key. Without
OPENAI_API_KEY, theFakeLLMreturns deterministic canned tool cycles — useful for wiring tests, not for answers. - DB metadata auto-enrichment, AST-precise SQL validation, function blocklists,
cost gating,
/semantic diff//semantic promote, keyword/vector recall, automatic fact extraction, URL/Notion ingestion — all scoped to v1.5+. - Persist across restarts by default: the V1
SqliteStoredefaults to in-memory; point it at a file for durability.
| Area | V1 | V1.5 | V2 | V2.5 |
|---|---|---|---|---|
| Safety | whitelist + timeout | + AST validation, function blocklist, auto LIMIT, metadata enrichment, rate limit | + cost gate (EXPLAIN), per-engine pipelines | — |
| Memory | in-memory + inject-all + manual | SQLite store + keyword recall + auto-extract | + vector recall + conflict resolution | PostgreSQL + hybrid recall + confidence |
| Ingestion | file upload + LLM extract | + URL fetch + DDL parsing | + Notion/Confluence + hybrid | + GitHub/Drive + chunked RAG |
| Federation | 3-scope resolution, /semantic show |
/semantic diff, /semantic promote, conflict alerts |
git sync (semantic-as-code) | branch fork/merge UI, per-scope audit |
| Interface | Discord | (Anthropic/NIM eval) | Slack | Web |
See docs/discord_first_redesign_v4_1.md
for the full architecture write-up.
처음 보시는 분은 docs/ARCHITECTURE.md — 디렉토리·레이어 책임, 한 메시지의 lifecycle, 어디를 수정하면 좋을지 가 한곳에 정리돼 있습니다.
git clone https://github.com/CausalInferenceLab/lang2sql.git
cd lang2sql
uv sync
.venv/bin/pytest -q # 12 safety regressions + full suite must pass- 새 기능에는 테스트 작성 (
tests/test_<layer>.py) - PR은
master브랜치 대상, 커밋 메시지에feat:/fix:/docs:prefix 사용 - 버그/기능 요청은 이슈로
Lang2SQL은 가짜연구소의 인과추론팀에서 개발 중인 프로젝트입니다. Licensed under the MIT License. 커뮤니티: Discord.
가짜연구소는 머신러닝과 AI 기술 발전에 중점을 둔 비영리 조직입니다. 공유, 동기부여, 그리고 협업의 기쁨이라는 핵심 가치를 바탕으로 영향력 있는 오픈소스 프로젝트를 만들어갑니다.
전 세계 5,000명 이상의 연구자들과 함께, 우리는 AI 지식의 민주화와 열린 협업을 통한 혁신 촉진에 전념하고 있습니다.
커뮤니티: 💬 Discord