A powerful and flexible library for processing various file formats as part of the Nodetool ecosystem. This library provides a collection of nodes for handling different file types including PDFs, Excel spreadsheets, Word documents, Markdown, and more.
- Text extraction with page range control
- Image extraction from PDF documents
- Table extraction with conversion to dataframes
- Page metadata extraction (dimensions, rotation, etc.)
- Page counting and document analysis
- Create new workbooks and worksheets
- Convert between DataFrames and Excel worksheets
- Apply cell formatting and styles
- Auto-fit column widths
- Save workbooks with customizable naming
- Word document (.docx) handling
- Markdown processing
- Pandoc integration for document conversion
- PyMuPDF integration for advanced PDF processing
Each file format is handled through specialized nodes that inherit from BaseNode. These nodes can be used individually or chained together in workflows.
# Extract text from a PDF
text_node = ExtractText(
pdf=document_ref,
start_page=0,
end_page=4
)
# Extract tables from PDF
tables_node = ExtractTables(
pdf=document_ref,
start_page=0,
end_page=-1 # All pages
)# Create a new workbook
workbook_node = CreateWorkbook(
sheet_name="Data"
)
# Write DataFrame to Excel
excel_writer = DataFrameToExcel(
workbook=workbook_ref,
dataframe=df_ref,
sheet_name="Data",
include_header=True
)ExtractText: Extract text content from PDF filesExtractImages: Extract embedded images from PDFsExtractTables: Convert PDF tables to dataframesGetPageCount: Count pages in PDF documentsExtractPageMetadata: Get page dimensions and properties
CreateWorkbook: Initialize new Excel workbooksDataFrameToExcel: Convert DataFrames to Excel worksheetsExcelToDataFrame: Import Excel data as DataFramesFormatCells: Apply styling to Excel cellsAutoFitColumns: Optimize column widthsSaveWorkbook: Export Excel files
- Word document processing
- Markdown conversion
- Document format conversion via Pandoc
- Advanced PDF manipulation with PyMuPDF
The library relies on several Python packages for file processing:
pdfplumber: PDF text and table extractionopenpyxl: Excel file handlingpython-docx: Word document processingpandoc: Document format conversionPyMuPDF: Advanced PDF processingmarkdown: Markdown processing
- Document data extraction and analysis
- Automated report generation
- Data conversion between formats
- Document format transformation
- Image extraction and processing
- Table data extraction and structuring