Hitting limits on passing the larger context to your limited character token limit llm model not anymore this chunker solves the problem It is a token-aware, LangChain-compatible chunker that splits text (from PDF, markdown, or plain text) into semantically coherent chunks while respecting model token limits.
-
🔍 Model-Aware Token Limits: Automatically adjusts chunking size for GPT-3.5, GPT-4, Claude, and others.
-
📄 Multi-format Input Support:
- PDF via
pdfplumber
- Plain
.txt
- Markdown
- (Extendable to
.docx
and.html
)
- PDF via
-
🔁 Overlapping Chunks: Smart overlap between paragraphs to preserve context.
-
🧠 Smart Merging: Merges chunks smaller than 300 tokens.
-
🧩 Retriever-Ready: Direct integration with
LangChain
retrievers via FAISS. -
🔧 CLI Support: Run from terminal with one command.
pip install semantic-chunker-langchain
Requires Python 3.9 - 3.12
semantic-chunker sample.pdf --txt chunks.txt --json chunks.json
from semantic_chunker_langchain.chunker import SemanticChunker, SimpleSemanticChunker
from semantic_chunker_langchain.extractors.pdf import extract_pdf
from semantic_chunker_langchain.outputs.formatter import write_to_txt
# Extract
docs = extract_pdf("sample.pdf")
# Using SemanticChunker
chunker = SemanticChunker(model_name="gpt-3.5-turbo")
chunks = chunker.split_documents(docs)
# Save to file
write_to_txt(chunks, "output.txt")
# Using SimpleSemanticChunker
simple_chunker = SimpleSemanticChunker(model_name="gpt-3.5-turbo")
simple_chunks = simple_chunker.split_documents(docs)
from langchain_community.embeddings import OpenAIEmbeddings
retriever = chunker.to_retriever(chunks, embedding=OpenAIEmbeddings())
poetry run pytest tests/
- Prajwal Shivaji Mandale
- Sudhnwa Ghorpade
This project is licensed under the MIT License.