A Python library for parsing Word documents (.docx) and splitting them into sections using the unstructured
library.
- Parse Word documents into sections based on headings
- Extract section content and metadata
- Save sections to individual text files
- Support for nested sections with heading levels
- Robust handling of document structure
- Clone the repository:
git clone [email protected]:project-delphi/word_file_parser.git
cd word_file_parser
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install the package in development mode:
pip install -e .
from word_file_parser import DocxParser
# Initialize parser with a Word document
parser = DocxParser("path/to/document.docx")
# Parse sections
sections = parser.parse_sections()
# Get a specific section
introduction = parser.get_section("Introduction")
# Save all sections to files
parser.save_sections_to_files("output_directory")
# Save a specific section
parser.save_section("Methods", "output_directory")
python -m pytest tests/
This project uses:
black
for code formattingisort
for import sortingflake8
for linting
MIT License