A Python library that splits Markdown into hierarchical sections. It intelligently handles code blocks, normalizes Unicode characters, and maintains parent-child relationships between sections.
- Code-Aware: Preserves code blocks and comments while processing markdown
- Hierarchy Tracking: Automatically tracks parent headers for each section (H1-H4)
- Unicode Normalization: Converts non-ASCII characters to their ASCII equivalents
pip install -e .
For Development Setup:
git clone https://github.com/Gal-Gilor/markdown-chunkify.git
cd markdown-chunkify
poetry install
poetry run pytest tests -vv
The following environment variables are required for the UnicodeReplaceProcessor
component.
Variable | Description | Default |
---|---|---|
GOOGLE_API_KEY |
A Gemini API key | None |
GEMINI_MODEL_NAME |
A Gemini model name | gemini-2.0-flash |
from markdown_chunkify import MarkdownSplitter
from markdown_chunkify import PyMuPDFMParser
# Convert PDF to Markdown
markdown_text = PyMuPDFMParser.to_markdown(
file_path="document.pdf",
destination_path="document.md" # Optional
)
# Initialize splitter
splitter = MarkdownSplitter()
# Split from file
sections = splitter.from_file('document.md')
# Split from text
sections = splitter.split_text(markdown_text)
MarkdownContent: The base data structure representing a Markdown section. It contains the header and content of a section. It's used as a base class for the Section
model, and as a generation schema for structured output responses
class MarkdownContent(BaseModel):
section_header: str # The header of the section (without #)
section_text: str # The content of the section
Section: The primary data structure representing a Markdown section. It contains the header level, metadata, and content of a section.
class Section(MarkdownContent):
header_level: int # Number of # symbols (1-4)
metadata: SectionMetadata # Processing and hierarchy information
def to_markdown(self) -> str: # Convert section back to Markdown
SectionMetadata: A Section's metadata, containing information about the section's processing and hierarchy (i.e., from information about the parent headers and text normalization status).
class SectionMetadata(BaseModel):
token_count: int | None # Generation token count
model_version: str | None # Model used for normalization
normalized: bool # Whether Unicode normalization succeeded
error: str | None # Error message if normalization failed
original_content: MarkdownContent | None # Pre-normalization content
parents: dict[str, str | None] # Header hierarchy information
to_markdown()
: Convert to Markdown format
- Python 3.12+
Apache License 2.0 - See LICENSE