Skip to content

Latest commit

ย 

History

History
580 lines (454 loc) ยท 23.7 KB

File metadata and controls

580 lines (454 loc) ยท 23.7 KB

TableMagnifier

TableMagnifier

PseudoLab Discord Community Stars Badge Forks Badge Pull Requests Badge Issues Badge GitHub contributors

TableMagnifier Repository์— ์˜ค์‹  ๊ฒƒ์„ ํ™˜์˜ํ•ฉ๋‹ˆ๋‹ค! ํ•œ๊ตญ์–ด TableQA์— ๋Œ€ํ•œ ์‹ ๋ขฐ๋„ ๋†’์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์ถ•ํ•˜๊ณ  ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿš€ {{TableMagnifier}} โ€” ๊ฐ€์งœ์—ฐ๊ตฌ์†Œ 11๊ธฐ NLx Crew ์†Œ์† ํ”„๋กœ์ ํŠธ

โ€œํ•จ๊ป˜ ๋งŒ๋“œ๋Š” ์šฐ์—ฐํ•œ ํ˜๋ช…(Serendipity Revolution)โ€ ์ง„์‹คํ•จ๊ณผ ์‹ ๋ขฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ AI/DS ํ˜์‹  ์ปค๋ฎค๋‹ˆํ‹ฐ์™€ ๊ธฐ์ˆ  ์‹คํ—˜์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

๐ŸŒŸ ํ”„๋กœ์ ํŠธ ๋ชฉํ‘œ (Project Vision)

"์ด๋ก ์—์„œ ์‹ค์ „๊นŒ์ง€, ํ•จ๊ป˜ ์„ฑ์žฅํ•˜๋Š” AI ์‹คํ—˜์‹ค"

  • ๊ฐœ์ธ ์„ฑ์žฅ๊ณผ ์ง‘๋‹จ ์ง€ํ˜œ์˜ ์‹œ๋„ˆ์ง€ ์ฐฝ์ถœ
  • ์˜คํ”ˆ์†Œ์Šค ์ •์‹ ์„ ๋ฐ”ํƒ•์œผ๋กœ ํ•œ ์ง€์‹ ๊ณต์œ  ๋ฌธํ™”
  • ์‹คํŒจ๋ฅผ ์„ฑ๊ณต์˜ ๋””๋”ค๋Œ๋กœ ๋งŒ๋“œ๋Š” ์‹คํ—˜์  ์ ‘๊ทผ
  • ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ํ”„๋กœ์ ํŠธ โ€” ์ตœ์‹  AI ๋…ผ๋ฌธ ๋ถ„์„, ํ† ๋ก , ์‹คํ—˜ ์žฌํ˜„
  • ์˜คํ”ˆ์†Œ์Šค ํ”„๋กœ์ ํŠธ โ€” AIยท๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๊ด€๋ จ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ฐœ๋ฐœ ๋ฐ ๊ฐœ์„ 
  • ์ปจํผ๋Ÿฐ์Šค ๋…ผ๋ฌธ ํˆฌ๊ณ  โ€” ์ตœ์‹  ์—ฐ๊ตฌ ์ˆ˜ํ–‰ ๋ฐ ๊ตญ์ œ ํ•™ํšŒ ์ œ์ถœ ์ค€๋น„

๐Ÿง‘ ํŒ€ ์†Œ๊ฐœ (Dynamic Team)

์—ญํ•  ์ด๋ฆ„ ๊ธฐ์ˆ  ์Šคํƒ ๋ฐฐ์ง€ ์ฃผ์š” ๊ด€์‹ฌ ๋ถ„์•ผ
Project Manager ๋ฐ•์„ธ์—ฐ Python PyTorch VLM, Image Captioning, SLT
Member ์ตœ์žฌํ˜ Python PyTorch LLM, RAG, Agent
Member ์ด๋ช…์ง„ Python PyTorch LLM, SLT
Member ๊น€์ง„์•„ Python PyTorch NLP, Data Science
Member ์„œ์„ํ˜„ Python PyTorch VLM, Data Statistics
Member ์ž„์˜ˆ์› Python PyTorch Compiler, Ontology

๐Ÿš€ ํ”„๋กœ์ ํŠธ ๋กœ๋“œ๋งต (Project Roadmap)

gantt
    title 2025 TableMagnifier Roadmap
    section ํ•ต์‹ฌ ๋งˆ์ผ์Šคํ†ค
    ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ ๋…ผ์˜       :a1, 2025-09-09, 28d
    ๊ตฌ์ถ• ๋ฐ ์ฆ๊ฐ•        :a2, after a1, 35d
    ํ‰๊ฐ€ ์ ์šฉ    :a3, after a2, 35d
    ๋…ผ๋ฌธ ์ž‘์„ฑ    :a4, after a3, 21d
    section ๋ถ€๊ฐ€ ํ™œ๋™
    ๋งค์ง€์ปฌ์œ„ํฌ         :2025-09-21, 7d
    ๋งค์ง€์ปฌ์œ„ํฌ         :2025-10-26, 7d
Loading

๐Ÿ’ป ์ฃผ์ฐจ๋ณ„ ํ™œ๋™ (Activity History)

์ฃผ์ฐจ ๋‚ ์งœ ํ™œ๋™ ๊ฒฐ๊ณผ๋ฌผ ์œ ํ˜• ๋น„๊ณ 
1 9/9 Introduction ์˜จ๋ผ์ธ
2 9/16 ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ (๋ฐฐ์ • 1๊ถŒ, ๊ฐœ๋ณ„ 1๊ถŒ) ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ๋ฌธ์„œ ์˜จ๋ผ์ธ
9/23 Magical Week ์˜คํ”„๋ผ์ธ
3 9/30 ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ (๋ฐฐ์ • 1๊ถŒ, ๊ฐœ๋ณ„ 1๊ถŒ) ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ ๋ฌธ์„œ ์˜จ๋ผ์ธ
4 10/14 Dataset ๊ตฌ์„ฑ ๋…ผ์˜ ์˜จ๋ผ์ธ
5 10/21 Raw-data Collection ์˜จ๋ผ์ธ
10/28 Magical Week ์˜คํ”„๋ผ์ธ
6 11/4 Data refinement / Augmentation ์˜จ๋ผ์ธ
7 11/11 Data refinement / Augmentation ์ตœ์ข… ๋ฐ์ดํ„ฐ์…‹ ์˜จ๋ผ์ธ
8 11/18 Paper Remind / Evaluation Idea Discussion ์˜คํ”„๋ผ์ธ
9 11/25 Pipleine setting ์˜จ๋ผ์ธ
10 12/2 Evaluation (Basic) ์•„์ด๋””์–ด ๊ตฌํ˜„, ๊ฒฐ๊ณผ๋ฌผ ์˜จ๋ผ์ธ
11 12/9 Evaluation (Advanced) ์•„์ด๋””์–ด ๊ตฌํ˜„, ๊ฒฐ๊ณผ๋ฌผ ์˜จ๋ผ์ธ
12 12/16 Evaluation (Advanced) ์•„์ด๋””์–ด ๊ตฌํ˜„, ๊ฒฐ๊ณผ๋ฌผ ์˜จ๋ผ์ธ
13 12/23 Evaluation (Advanced) ์•„์ด๋””์–ด ๊ตฌํ˜„, ๊ฒฐ๊ณผ๋ฌผ ์˜จ๋ผ์ธ
14 12/30 Github ๊ด€๋ฆฌ, Paper Writing ์˜จ๋ผ์ธ
15 1/6 Github ๊ด€๋ฆฌ, Paper Writing ์˜จ๋ผ์ธ
16 1/13 ํšŒ๊ณ  ๋ฐ ์•„์นด์ด๋น™, ํˆฌ๊ณ  ์ค€๋น„ Paper ์˜คํ”„๋ผ์ธ

๐ŸŒฑ ์ฐธ์—ฌ ์•ˆ๋‚ด (How to Engage)

  • ๋นŒ๋”๋กœ ์ฐธ์—ฌ โ€” ํ”„๋กœ์ ํŠธ ๊ธฐํšยท์šด์˜ ์ฃผ๋„
  • ๋Ÿฌ๋„ˆ๋กœ ์ฐธ์—ฌ โ€” ์—ฐ๊ตฌยท๊ฐœ๋ฐœยทํ…Œ์ŠคํŠธ ๋“ฑ ์‹คํ–‰
  • ์ฒญ๊ฐ• ์ฐธ์—ฌ โ€” ๊ณต๊ฐœ ์„ธ์…˜ ์ฐธ์—ฌ ๊ฐ€๋Šฅ

โ—๏ธ์ฐธ์—ฌ ๋งํฌ: ๊ฐ€์งœ์—ฐ๊ตฌ์†Œ ๋””์Šค์ฝ”๋“œ โ—๏ธ์ปค๋ฎค๋‹ˆ์ผ€์ด์…˜ ์ฑ„๋„: ์นด์นด์˜คํ†ก

๋ˆ„๊ตฌ๋‚˜ ์ฒญ๊ฐ•์„ ํ†ตํ•ด ๋ชจ์ž„์„ ์ฐธ์—ฌํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  1. ํŠน๋ณ„ํ•œ ์‹ ์ฒญ ์—†์ด ์ •๊ธฐ ๋ชจ์ž„ ์‹œ๊ฐ„์— ๋งž์ถ”์–ด ๋””์Šค์ฝ”๋“œ #Room-CS ์ฑ„๋„๋กœ ์ž…์žฅ
  2. Magical Week ์ค‘ ํ–‰์‚ฌ์— ์ฐธ๊ฐ€ {{ ... }}

Acknowledgement ๐Ÿ™

์ด ํ”„๋กœ์ ํŠธ๋Š” ๊ฐ€์งœ์—ฐ๊ตฌ์†Œ Open Academy๋กœ ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ๋ถ„์˜ ์ฐธ์—ฌ์™€ ๊ธฐ์—ฌ๊ฐ€ '์šฐ์—ฐํ•œ ํ˜๋ช…(Serendipity Revolution)'์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค. ๋ชจ๋‘์—๊ฒŒ ๊นŠ์€ ๊ฐ์‚ฌ๋ฅผ ์ „ํ•ฉ๋‹ˆ๋‹ค.


TableMagnifier (ํ…Œ์ด๋ธ” ๋งค๊ทธ๋‹ˆํŒŒ์ด์–ด)

TableMagnifier๋Š” ํ•œ๊ตญ์–ด ํ…Œ์ด๋ธ” ์ด๋ฏธ์ง€๋ฅผ ๋ถ„์„ํ•˜์—ฌ ๊ตฌ์กฐํ™”๋œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ๊ฒ€์ฆ ๋ฐ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๋„๊ตฌ์ž…๋‹ˆ๋‹ค. LangGraph๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ ์›Œํฌํ”Œ๋กœ์šฐ๋ฅผ ํ†ตํ•ด ์ด๋ฏธ์ง€์—์„œ HTML ํ…Œ์ด๋ธ” ๊ตฌ์กฐ๋ฅผ ์ถ”์ถœํ•˜๊ณ , ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ƒˆ๋กœ์šด ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ๊ธฐ๋Šฅ

  • ์ด๋ฏธ์ง€ to HTML ๋ณ€ํ™˜: ํ…Œ์ด๋ธ” ์ด๋ฏธ์ง€๋ฅผ HTML ๊ตฌ์กฐ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค.
  • ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ: ์›๋ณธ ํ…Œ์ด๋ธ”์˜ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ์ƒˆ๋กœ์šด ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • ์ž๊ฐ€ ๊ฒ€์ฆ ๋ฐ ์ˆ˜์ • (Self-Reflection): ์ƒ์„ฑ๋œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๊ฐ€ ์›๋ณธ ๊ตฌ์กฐ์™€ ์ผ์น˜ํ•˜๋Š”์ง€ ๊ฒ€์ฆํ•˜๊ณ , ํ•„์š”์‹œ ์ž๋™์œผ๋กœ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค.
  • QA ๋ฐ์ดํ„ฐ ์ƒ์„ฑ: ์ƒ์„ฑ๋œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ RAG ํ•™์Šต์šฉ QA ์Œ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
  • API ํ‚ค ํ’€๋ง: ์—ฌ๋Ÿฌ Gemini API ํ‚ค๋ฅผ ์ž๋™ ๋กœํ…Œ์ด์…˜ํ•˜์—ฌ ๋ฌด๋ฃŒ ํ• ๋‹น๋Ÿ‰์„ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์›น ๊ธฐ๋ฐ˜ ๊ฒ€์ฆ ๋„๊ตฌ: ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์›น ์ธํ„ฐํŽ˜์ด์Šค์—์„œ ์‹œ๊ฐ์ ์œผ๋กœ ํ™•์ธํ•˜๊ณ  ์ง์ ‘ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

TableMagnifier/
โ”œโ”€โ”€ generate_synthetic_table/   # ํ•ต์‹ฌ ๋กœ์ง (LangGraph ์›Œํฌํ”Œ๋กœ์šฐ)
โ”‚   โ”œโ”€โ”€ flow.py                 # ๊ทธ๋ž˜ํ”„ ์ •์˜ ๋ฐ ๋…ธ๋“œ ๊ตฌํ˜„
โ”‚   โ”œโ”€โ”€ runner.py               # ์‹คํ–‰ ์œ ํ‹ธ๋ฆฌํ‹ฐ
โ”‚   โ”œโ”€โ”€ cli.py                  # CLI ์ง„์ž…์ 
โ”‚   โ”œโ”€โ”€ llm_factory.py          # LLM ํŒฉํ† ๋ฆฌ (OpenAI, Gemini, Gemini Pool, vLLM)
โ”‚   โ”œโ”€โ”€ validators.py           # ๋ฐ์ดํ„ฐ ๊ฒ€์ฆ ๋กœ์ง
โ”‚   โ”œโ”€โ”€ html_to_image.py        # HTML โ†’ ์ด๋ฏธ์ง€ ๋ณ€ํ™˜
โ”‚   โ””โ”€โ”€ prompts/                # LLM ํ”„๋กฌํ”„ํŠธ ํ…œํ”Œ๋ฆฟ
โ”‚
โ”œโ”€โ”€ polling_gemini/             # Gemini API ํ‚ค ํ’€๋ง ์‹œ์Šคํ…œ
โ”‚   โ”œโ”€โ”€ api_pool.py             # API ํ‚ค ๋กœํ…Œ์ด์…˜ ๋งค๋‹ˆ์ €
โ”‚   โ”œโ”€โ”€ langgraph_integration.py # LangChain/LangGraph ํ˜ธํ™˜ ๋ž˜ํผ
โ”‚   โ””โ”€โ”€ README.md               # ์‚ฌ์šฉ ๊ฐ€์ด๋“œ
โ”‚
โ”œโ”€โ”€ annotate_tools/             # ์›น ๊ธฐ๋ฐ˜ ๊ฒ€์ฆ ๋„๊ตฌ
โ”‚   โ”œโ”€โ”€ server.py               # FastAPI ๋ฐฑ์—”๋“œ
โ”‚   โ”œโ”€โ”€ App.tsx                 # React ํ”„๋ก ํŠธ์—”๋“œ
โ”‚   โ””โ”€โ”€ components/             # UI ์ปดํฌ๋„ŒํŠธ
โ”‚
โ”œโ”€โ”€ apis/                       # API ํ‚ค ์„ค์ •
โ”‚   โ””โ”€โ”€ gemini_keys.yaml        # Gemini API ํ‚ค ๋ชฉ๋ก (gemini_pool์šฉ)
โ”‚
โ”œโ”€โ”€ tests/                      # ํ…Œ์ŠคํŠธ ์ฝ”๋“œ
โ”œโ”€โ”€ main.py                     # CLI ์ง„์ž…์ 
โ”œโ”€โ”€ pyproject.toml              # ํ”„๋กœ์ ํŠธ ์˜์กด์„ฑ
โ””โ”€โ”€ README.md                   # ์„ค๋ช…์„œ

LangGraph ์›Œํฌํ”Œ๋กœ์šฐ

flowchart TD
    START((START)) --> RouteStart{์ž…๋ ฅ ํƒ€์ž… ํ™•์ธ}
    
    RouteStart -->|HTML ํŒŒ์ผ| LoadHTML[load_html_input]
    RouteStart -->|์ด๋ฏธ์ง€ + openai/gemini/gemini_pool| DirectGen[generate_synthetic_table_from_image]
    RouteStart -->|์ด๋ฏธ์ง€ + ๊ธฐํƒ€ ๋ชจ๋ธ| PyMuPDF[pymupdf_parse]
    
    LoadHTML --> Analyze[analyze_table]
    
    PyMuPDF --> Validate[validate_parsed_table]
    Validate -->|์œ ํšจ| Analyze
    Validate -->|๋ฌดํšจ| ImageToHTML[image_to_html]
    ImageToHTML --> Analyze
    
    Analyze --> GenSynthetic[generate_synthetic_table]
    GenSynthetic --> SelfReflection[self_reflection]
    
    DirectGen --> SelfReflection
    
    SelfReflection --> RouteReflection{๊ฒ€์ฆ ๊ฒฐ๊ณผ}
    RouteReflection -->|ํ†ต๊ณผ ๋˜๋Š” ์ตœ๋Œ€ ์‹œ๋„| Parse[parse_synthetic_table]
    RouteReflection -->|์ˆ˜์ • ํ•„์š”| Revise[revise_synthetic_table]
    
    Revise --> SelfReflection
    
    Parse --> GenerateQA[generate_qa]
    GenerateQA --> END((END))
Loading

์„ค์น˜ ๋ฐฉ๋ฒ•

์‚ฌ์ „ ์š”๊ตฌ ์‚ฌํ•ญ

  • Python 3.10 ์ด์ƒ
  • Node.js (๊ฒ€์ฆ ๋„๊ตฌ ์‹คํ–‰ ์‹œ ํ•„์š”)
  • OpenAI API Key ๋˜๋Š” Google Gemini API Key

1. ํ”„๋กœ์ ํŠธ ํด๋ก  ๋ฐ ์˜์กด์„ฑ ์„ค์น˜

git clone https://github.com/Pseudo-Lab/TableMagnifier.git
cd TableMagnifier

# uv ์‚ฌ์šฉ (๊ถŒ์žฅ)
uv sync

# ๋˜๋Š” pip ์‚ฌ์šฉ
python3 -m venv .venv
source .venv/bin/activate
pip install -e .

2. ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •

.env ํŒŒ์ผ์„ ์ƒ์„ฑํ•˜๊ณ  API ํ‚ค๋ฅผ ์ž…๋ ฅํ•˜์„ธ์š”.

OPENAI_API_KEY=sk-...
# ๋˜๋Š”
GOOGLE_API_KEY=AIza...

3. Gemini API ํ‚ค ํ’€๋ง ์„ค์ • (์„ ํƒ์‚ฌํ•ญ)

์—ฌ๋Ÿฌ ๊ฐœ์˜ Gemini API ํ‚ค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌด๋ฃŒ ํ• ๋‹น๋Ÿ‰์„ ํšจ์œจ์ ์œผ๋กœ ํ™œ์šฉํ•˜๋ ค๋ฉด apis/gemini_keys.yaml ํŒŒ์ผ์„ ์ƒ์„ฑํ•˜์„ธ์š”:

api_keys:
  - name: key1
    key: AIza...your-first-key
    enabled: true
  - name: key2
    key: AIza...your-second-key
    enabled: true
  # ๋” ๋งŽ์€ ํ‚ค ์ถ”๊ฐ€ ๊ฐ€๋Šฅ

settings:
  model: gemini-2.5-flash
  temperature: 0.2
  max_retries: 3
  retry_delay: 2

์‚ฌ์šฉ ๋ฐฉ๋ฒ•

UI ์‚ฌ์šฉ๋ฐฉ๋ฒ•

### backend
cd pipeline_ui/backend && uv run python main.py

### frontend
cd pipeline_ui/frontend && npm run dev

1. ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ (CLI)

์ด๋ฏธ์ง€ ํŒŒ์ผ ๋˜๋Š” HTML ํŒŒ์ผ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

# OpenAI ์‚ฌ์šฉ (๊ธฐ๋ณธ)
uv run python -m generate_synthetic_table.cli path/to/table_image.png --save-json output.json

# Gemini ๋ชจ๋ธ ์‚ฌ์šฉ (๋‹จ์ผ API ํ‚ค)
uv run python -m generate_synthetic_table.cli path/to/table_image.png \
  --provider gemini --model gemini-1.5-flash --save-json output.json

# Gemini Pool ์‚ฌ์šฉ (๋‹ค์ค‘ API ํ‚ค ์ž๋™ ๋กœํ…Œ์ด์…˜) โญ ๊ถŒ์žฅ
uv run python -m generate_synthetic_table.cli path/to/table_image.png \
  --provider gemini_pool --save-json output.json

# ์ปค์Šคํ…€ ์„ค์ • ํŒŒ์ผ ์‚ฌ์šฉ
uv run python -m generate_synthetic_table.cli path/to/table_image.png \
  --provider gemini_pool --config-path /path/to/gemini_keys.yaml

์˜ต์…˜ ์„ค๋ช…:

์˜ต์…˜ ์„ค๋ช… ๊ธฐ๋ณธ๊ฐ’
image ์ž…๋ ฅ ์ด๋ฏธ์ง€, HTML ํŒŒ์ผ, ๋˜๋Š” ํด๋” ๊ฒฝ๋กœ (ํ•„์ˆ˜)
--provider LLM ์ œ๊ณต์ž (openai, gemini, gemini_pool, claude, vllm) openai
--model ์‚ฌ์šฉํ•  ๋ชจ๋ธ๋ช… gpt-4o-mini
--temperature ์ƒ์„ฑ ๋‹ค์–‘์„ฑ ์กฐ์ ˆ 0.2
--config-path gemini_pool์šฉ ์„ค์ • ํŒŒ์ผ ๊ฒฝ๋กœ apis/gemini_keys.yaml
--base-url vLLM ๋˜๋Š” OpenAI ํ˜ธํ™˜ ์—”๋“œํฌ์ธํŠธ URL (์„ ํƒ)
--save-json ๊ฒฐ๊ณผ JSON ์ €์žฅ ๊ฒฝ๋กœ (์„ ํƒ)
--domain ํ”„๋กฌํ”„ํŠธ ๋„๋ฉ”์ธ (medical, public, insurance, finance, academic, business) ์ž๋™ ๊ฐ์ง€
--qa-only ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ์—†์ด ์ด๋ฏธ์ง€์—์„œ ์ง์ ‘ QA๋งŒ ์ƒ์„ฑ false
--output-dir ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ๊ฒฐ๊ณผ ์ €์žฅ ๋””๋ ‰ํ† ๋ฆฌ ./qa_output
--max-workers ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ์‹œ ๋ณ‘๋ ฌ ์›Œ์ปค ์ˆ˜ 3
--sampling ํ…Œ์ด๋ธ”๋‹น ๋žœ๋ค ์ด๋ฏธ์ง€ ์ƒ˜ํ”Œ๋ง ํ™œ์„ฑํ™” false
--min-k ํ…Œ์ด๋ธ”๋‹น ์ตœ์†Œ ์ƒ˜ํ”Œ๋ง ์ด๋ฏธ์ง€ ์ˆ˜ 2
--max-k ํ…Œ์ด๋ธ”๋‹น ์ตœ๋Œ€ ์ƒ˜ํ”Œ๋ง ์ด๋ฏธ์ง€ ์ˆ˜ 3
--num-samples ํ…Œ์ด๋ธ”๋‹น ์ƒ์„ฑํ•  ๋žœ๋ค ๋ฐฐ์น˜ ์ˆ˜ 1
--pair-mode Public ๋ฐ์ดํ„ฐ์šฉ ์ˆœ์ฐจ ํŽ˜์–ด ์ฒ˜๋ฆฌ (0-1, 2-3...) false

๋„๋ฉ”์ธ๋ณ„ ํ”„๋กฌํ”„ํŠธ

--domain ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ • ๋„๋ฉ”์ธ์— ๋งž๋Š” ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋„๋ฉ”์ธ๋ณ„ ํ”„๋กฌํ”„ํŠธ๋Š” generate_synthetic_table/prompts/ ํด๋”์— ์ •์˜๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

๋„๋ฉ”์ธ ์„ค๋ช… ์ž๋™ ๊ฐ์ง€ ์กฐ๊ฑด
medical ์˜๋ฃŒ ๋ฐ์ดํ„ฐ (ํ™˜์ž ๋ฐ”์ดํƒˆ, ์ง„๋‹จ๋ช…, ์•ฝ๋ฌผ ์šฉ๋Ÿ‰, ๊ฒ€์‚ฌ ์ˆ˜์น˜ ๋“ฑ) โœ… ํŒŒ์ผ/ํด๋”๋ช…์ด M_๋กœ ์‹œ์ž‘
public ๊ณต๊ณต ๋ฐ์ดํ„ฐ โœ… ํŒŒ์ผ/ํด๋”๋ช…์ด P_๋กœ ์‹œ์ž‘
insurance ๋ณดํ—˜ ๋ฐ์ดํ„ฐ โœ… ํŒŒ์ผ/ํด๋”๋ช…์ด I_๋กœ ์‹œ์ž‘
finance ๊ธˆ์œต ๋ฐ์ดํ„ฐ โœ… ํŒŒ์ผ/ํด๋”๋ช…์ด F_๋กœ ์‹œ์ž‘
academic ํ•™์ˆ  ๋ฐ์ดํ„ฐ โœ… ํŒŒ์ผ/ํด๋”๋ช…์ด A_๋กœ ์‹œ์ž‘
business ๋น„์ฆˆ๋‹ˆ์Šค ๋ฐ์ดํ„ฐ โœ… ํŒŒ์ผ/ํด๋”๋ช…์ด B_๋กœ ์‹œ์ž‘

๋„๋ฉ”์ธ ์‚ฌ์šฉ ์˜ˆ์‹œ:

# Medical ๋„๋ฉ”์ธ์œผ๋กœ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
uv run python -m generate_synthetic_table.cli path/to/medical_table.png \
  --domain medical --provider gemini_pool --save-json output.json

# ํด๋” ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ (Medical ๋„๋ฉ”์ธ)
uv run python -m generate_synthetic_table.cli ./Medical/Table/ \
  --domain medical --max-workers 5 --output-dir ./output

# QA๋งŒ ์ƒ์„ฑ (ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ๋‹จ๊ณ„ ์Šคํ‚ต)
uv run python -m generate_synthetic_table.cli path/to/table.png \
  --domain medical --qa-only --save-json output.json

2. JSON ์ž…๋ ฅ ๊ธฐ๋ฐ˜ ๋ฐฐ์น˜ ์ฒ˜๋ฆฌ (run_pipeline_json.py)

JSON ํŒŒ์ผ๋กœ ์ด๋ฏธ์ง€ ๊ฒฝ๋กœ ์Œ์„ ์ •์˜ํ•˜์—ฌ ์—ฌ๋Ÿฌ pair๋ฅผ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์กฐํ™”๋œ ํ˜•ํƒœ๋กœ ์ฒ˜๋ฆฌํ•  ๋•Œ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

2.1 JSON ์ž…๋ ฅ ํ˜•์‹

๊ตฌ์กฐํ™”๋œ ํ˜•์‹ (๊ถŒ์žฅ):

[
  {
    "pair_id": "P_origin_0_1",
    "image_paths": [
      "data/Public/Table/P_origin_0/P_origin_0_1_0.png",
      "data/Public/Table/P_origin_0/P_origin_0_1_1.png"
    ],
    "domain": "public"
  },
  {
    "pair_id": "F_table_1",
    "image_paths": [
      "data/Finance/Table/F_table_1/F_table_1_0.png",
      "data/Finance/Table/F_table_1/F_table_1_1.png"
    ],
    "domain": "finance"
  }
]

๋ ˆ๊ฑฐ์‹œ ํ˜•์‹ (๋ฐฐ์—ด - ํ•˜์œ„ ํ˜ธํ™˜์„ฑ ์ง€์›):

[
  [
    "data/Public/Table/P_origin_0/P_origin_0_1_0.png",
    "data/Public/Table/P_origin_0/P_origin_0_1_1.png"
  ],
  [
    "data/Finance/Table/F_table_1/F_table_1_0.png",
    "data/Finance/Table/F_table_1/F_table_1_1.png"
  ]
]

๊ตฌ์กฐํ™”๋œ ํ˜•์‹์˜ ์žฅ์ :

  • pair_id: ์›ํ•˜๋Š” ์‹๋ณ„์ž๋ฅผ ์ง์ ‘ ์ง€์ • ๊ฐ€๋Šฅ
  • domain: ๊ฐ pair๋งˆ๋‹ค ๋‹ค๋ฅธ domain ์ง€์ • ๊ฐ€๋Šฅ (CLI --domain๋ณด๋‹ค ์šฐ์„ )
  • ์ถœ๋ ฅ ๊ฒฐ๊ณผ์™€ ์ž…๋ ฅ ํ˜•์‹์˜ ์ผ๊ด€์„ฑ

2.2 ๊ธฐ๋ณธ ์‚ฌ์šฉ๋ฒ•

# ๊ธฐ๋ณธ ์‹คํ–‰ (๊ตฌ์กฐํ™”๋œ ์ถœ๋ ฅ)
uv run python run_pipeline_json.py \
  --input test_input.json \
  --output-dir output_results \
  --domain public

# QA๋งŒ ์ƒ์„ฑ (ํ…Œ์ด๋ธ” ์ƒ์„ฑ ์Šคํ‚ต)
uv run python run_pipeline_json.py \
  --input test_input.json \
  --output-dir output_qa_only \
  --qa-only

# Notion ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ž๋™ ์—…๋กœ๋“œ
uv run python run_pipeline_json.py \
  --input test_input.json \
  --output-dir output_with_notion \
  --domain public \
  --upload-to-notion

2.3 ์˜ต์…˜ ์„ค๋ช…

์˜ต์…˜ ์„ค๋ช… ๊ธฐ๋ณธ๊ฐ’
--input JSON ์ž…๋ ฅ ํŒŒ์ผ ๊ฒฝ๋กœ (ํ•„์ˆ˜)
--data-root ์ด๋ฏธ์ง€ ํŒŒ์ผ ๊ฒ€์ƒ‰ ๊ธฐ์ค€ ๋””๋ ‰ํ† ๋ฆฌ data
--output-dir ๊ฒฐ๊ณผ JSON ์ €์žฅ ๋””๋ ‰ํ† ๋ฆฌ output_json
--provider LLM ์ œ๊ณต์ž gemini_pool
--model ์‚ฌ์šฉํ•  ๋ชจ๋ธ๋ช… gemini-1.5-flash
--config-path Gemini Pool ์„ค์ • ํŒŒ์ผ ๊ฒฝ๋กœ apis/gemini_keys.yaml
--domain ๋„๋ฉ”์ธ ๊ฐ•์ œ ์ง€์ • ์ž๋™ ๊ฐ์ง€
--qa-only ํ…Œ์ด๋ธ” ์ƒ์„ฑ ์Šคํ‚ต, QA๋งŒ ์ƒ์„ฑ false
--upload-to-notion QA ๊ฒฐ๊ณผ๋ฅผ Notion DB์— ์—…๋กœ๋“œ false

2.4 ์ถœ๋ ฅ ํ˜•์‹

๊ฒฐ๊ณผ๋Š” {output-dir}/pipeline_output.json์— ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ๋กœ ์ €์žฅ๋ฉ๋‹ˆ๋‹ค:

[
  {
    "pair_id": "P_origin_0_1_0",
    "image_paths": [
      "data/Public/Table/P_origin_0/P_origin_0_1_0.png",
      "data/Public/Table/P_origin_0/P_origin_0_1_1.png"
    ],
    "domain": "public",
    "tables": [null, null],
    "qa_results": [
      {
        "question": "ํ•„๊ธฐ ๊ณผ๋ชฉ๋ช… '๋””์ง€ํ„ธ ์ „์žํšŒ๋กœ'์˜ ๋ฌธ์ œ์ˆ˜๋Š” ๋ช‡ ๋ฌธ์ œ์ธ๊ฐ€์š”?",
        "answer": "20๋ฌธ์ œ",
        "type": "lookup",
        "reasoning_annotation": "...",
        "context": null
      }
    ],
    "metadata": {
      "provider": "gemini_pool",
      "model": "gemini-1.5-flash",
      "qa_only": true
    },
    "notion_upload": {
      "success": true,
      "created_count": 10
    }
  }
]

2.5 Notion ์—…๋กœ๋“œ ์„ค์ •

--upload-to-notion ํ”Œ๋ž˜๊ทธ๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด apis/gemini_keys.yaml์— Notion ๊ด€๋ จ ์„ค์ •์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค:

# Notion API ํ‚ค
notion_key: secret_...

# ๋„๋ฉ”์ธ๋ณ„ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ID
notion_databases:
  public: your_database_id_here
  finance: another_database_id
  insurance: yet_another_database_id

Notion ์—…๋กœ๋“œ ์‹œ ์ฃผ์˜์‚ฌํ•ญ:

  • Notion database์— ์ž๋™์œผ๋กœ ํ•„์š”ํ•œ ์†์„ฑ(Domain, Image, Question, Answer, Type ๋“ฑ)์ด ์ƒ์„ฑ๋ฉ๋‹ˆ๋‹ค
  • ์—…๋กœ๋“œ ์‹คํŒจ ์‹œ์—๋„ ํŒŒ์ดํ”„๋ผ์ธ์€ ์ค‘๋‹จ๋˜์ง€ ์•Š์œผ๋ฉฐ, ๊ฒฐ๊ณผ JSON์— ์—๋Ÿฌ ์ •๋ณด๊ฐ€ ๊ธฐ๋ก๋ฉ๋‹ˆ๋‹ค

3. ์‹คํ–‰ ์˜ˆ์‹œ ๋ฐ ๊ฒฐ๊ณผ

$ uv run python -m generate_synthetic_table.cli ./image.png --provider gemini_pool

# ์ถœ๋ ฅ ๋กœ๊ทธ
2025-12-15 10:47:42 - polling_gemini.api_pool - INFO - ์ด 6๊ฐœ์˜ API ํ‚ค๋ฅผ ๋กœ๋“œํ–ˆ์Šต๋‹ˆ๋‹ค.
2025-12-15 10:47:42 - polling_gemini.api_pool - INFO - API ํ‚ค 'key1' ์‚ฌ์šฉ ์ค‘ (๋ชจ๋ธ: gemini-2.5-flash)
2025-12-15 10:47:42 - generate_synthetic_table.flow - INFO - Entering node: generate_synthetic_table_from_image
2025-12-15 10:47:53 - generate_synthetic_table.flow - INFO - Entering node: self_reflection
2025-12-15 10:48:14 - generate_synthetic_table.flow - INFO - Entering node: revise_synthetic_table
2025-12-15 10:48:27 - generate_synthetic_table.flow - INFO - Entering node: self_reflection
2025-12-15 10:48:33 - generate_synthetic_table.flow - INFO - Entering node: parse_synthetic_table
2025-12-15 10:48:38 - generate_synthetic_table.flow - INFO - Entering node: generate_qa

# ๊ฒฐ๊ณผ JSON
{
  "image_path": "./image.png",
  "synthetic_json": [
    {"๊ฒฝ๊ณผ๊ธฐ๊ฐ„": "1๋…„", "๋‚ฉ์ž…๋ณดํ—˜๋ฃŒ ๋ˆ„๊ณ„": 600000, "ํ•ด์ง€ํ™˜๊ธ‰๊ธˆ": 0, "ํ™˜๊ธ‰๋ฅ ": 0},
    {"๊ฒฝ๊ณผ๊ธฐ๊ฐ„": "3๋…„", "๋‚ฉ์ž…๋ณดํ—˜๋ฃŒ ๋ˆ„๊ณ„": 1800000, "ํ•ด์ง€ํ™˜๊ธ‰๊ธˆ": 540000, "ํ™˜๊ธ‰๋ฅ ": 30},
    {"๊ฒฝ๊ณผ๊ธฐ๊ฐ„": "5๋…„", "๋‚ฉ์ž…๋ณดํ—˜๋ฃŒ ๋ˆ„๊ณ„": 3000000, "ํ•ด์ง€ํ™˜๊ธ‰๊ธˆ": 1650000, "ํ™˜๊ธ‰๋ฅ ": 55},
    ...
  ]
}

2. ๊ฒ€์ฆ ๋„๊ตฌ ์‹คํ–‰ (Web UI)

์ƒ์„ฑ๋œ output.json์„ ์›น ์ธํ„ฐํŽ˜์ด์Šค์—์„œ ํ™•์ธํ•˜๊ณ  ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์„œ๋ฒ„ ์‹คํ–‰:

python annotate_tools/server.py --file output.json

ํด๋ผ์ด์–ธํŠธ ์‹คํ–‰ (๋ณ„๋„ ํ„ฐ๋ฏธ๋„):

cd annotate_tools
npm install
npm run dev

๋ธŒ๋ผ์šฐ์ €์—์„œ http://localhost:5173์œผ๋กœ ์ ‘์†ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•˜์„ธ์š”.

polling_gemini ๋ชจ๋“ˆ

polling_gemini๋Š” ์—ฌ๋Ÿฌ Gemini API ํ‚ค๋ฅผ ์ž๋™์œผ๋กœ ๋กœํ…Œ์ด์…˜ํ•˜๋Š” ํ’€๋ง ์‹œ์Šคํ…œ์ž…๋‹ˆ๋‹ค. ๋ฌด๋ฃŒ ํ• ๋‹น๋Ÿ‰์ด ์†Œ์ง„๋˜๋ฉด ์ž๋™์œผ๋กœ ๋‹ค์Œ ํ‚ค๋กœ ์ „ํ™˜๋ฉ๋‹ˆ๋‹ค.

๋…๋ฆฝ ์‚ฌ์šฉ ์˜ˆ์‹œ

from polling_gemini import create_gemini_chat_model, invoke_gemini

# LangChain ํ˜ธํ™˜ ๋ชจ๋ธ๋กœ ์‚ฌ์šฉ
model = create_gemini_chat_model()
response = model.invoke([HumanMessage(content="์•ˆ๋…•ํ•˜์„ธ์š”!")])

# ๊ฐ„๋‹จํ•œ ํ•จ์ˆ˜ ํ˜ธ์ถœ
response = invoke_gemini("ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•ด์ฃผ์„ธ์š”.")

LangGraph์—์„œ ์‚ฌ์šฉ

from polling_gemini import create_gemini_chat_model
from langgraph.graph import StateGraph

model = create_gemini_chat_model()

def my_node(state):
    response = model.invoke([HumanMessage(content=state["query"])])
    return {"response": response.content}

graph = StateGraph(...)
graph.add_node("process", my_node)

๐Ÿ—„๏ธ Google Drive & MongoDB ํ†ตํ•ฉ (Data Handling)

db/data_handling.py ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜์—ฌ Google Drive์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™์œผ๋กœ ํƒ์ƒ‰ํ•˜๊ณ  MongoDB์— ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

1. ์‚ฌ์ „ ์„ค์ •

  1. Google Drive API ํ™œ์„ฑํ™”: Google Cloud Console์—์„œ ํ”„๋กœ์ ํŠธ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  Drive API๋ฅผ ํ™œ์„ฑํ™”ํ•˜์„ธ์š”.
  2. OAuth ํด๋ผ์ด์–ธํŠธ ID: '๋ฐ์Šคํฌํ†ฑ ์•ฑ' ์œ ํ˜•์˜ OAuth ํด๋ผ์ด์–ธํŠธ ID๋ฅผ ์ƒ์„ฑํ•˜๊ณ  client.json ํŒŒ์ผ์„ ๋‹ค์šด๋กœ๋“œํ•˜์„ธ์š”.
  3. ํŒŒ์ผ ๋ฐฐ์น˜: ๋‹ค์šด๋กœ๋“œํ•œ client.json ํŒŒ์ผ์„ ํ”„๋กœ์ ํŠธ ๋‚ด info/ ํด๋” ๋˜๋Š” ์ง€์ •๋œ ๊ฒฝ๋กœ์— ๋ฐฐ์น˜ํ•˜์„ธ์š”.

2. ์ฃผ์š” ๊ธฐ๋Šฅ ๋ฐ ์‚ฌ์šฉ๋ฒ•

  • ์ž๋™ ๋“œ๋ผ์ด๋ธŒ ํƒ์ƒ‰: ์ง€์ •๋œ START_FOLDER_ID๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋„๋ฉ”์ธ๋ณ„ ํด๋”๋ฅผ ํƒ์ƒ‰ํ•ฉ๋‹ˆ๋‹ค.
  • ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ์ €์žฅ: ํƒ์ƒ‰๋œ ํŒŒ์ผ์˜ ID, ์ด๋ฆ„, ๋„๋ฉ”์ธ ์ •๋ณด๋ฅผ MongoDB์˜ TableInformation ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋ฏธ์ง€ ๋‹ค์šด๋กœ๋“œ: download_file_bytes()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํŠน์ • ์ด๋ฏธ์ง€๋ฅผ ๋ฐ”์ดํŠธ ํ˜•ํƒœ๋กœ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

3. ์‹คํ–‰ ๋ฐฉ๋ฒ•

db/data_handling.py ํŒŒ์ผ ๋‚ด์˜ ์„ค์ •์„ ํ™•์ธํ•˜๊ฑฐ๋‚˜, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํ™˜๊ฒฝ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์„ค์ •์„ ๊ด€๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • GOOGLE_DRIVE_CLIENT_JSON: client.json ํŒŒ์ผ์˜ ๊ฒฝ๋กœ (๊ธฐ๋ณธ๊ฐ’: info/client.json)
  • GOOGLE_DRIVE_START_FOLDER_ID: ํƒ์ƒ‰์„ ์‹œ์ž‘ํ•  Google Drive ํด๋” ID
# ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ • ์˜ˆ์‹œ (PowerShell)
$env:GOOGLE_DRIVE_START_FOLDER_ID="your_folder_id_here"
uv run python db/data_handling.py

Important

์‹คํ–‰ ์ „ ๋ฐ˜๋“œ์‹œ client.json ํŒŒ์ผ์„ info/ ํด๋”์— ๋ฐฐ์น˜ํ•˜๊ณ , ๋Œ€์ƒ ํด๋” ID๋ฅผ ํ™•์ธํ•˜์„ธ์š”.


Contributors ๐Ÿ˜ƒ

License ๐Ÿ—ž

This project is licensed under the MIT License.