HTML to Markdown Test Files for lightfeed-extract
This repository contains test files for validating HTML to LLM-extractor-ready Markdown conversion functionality. It specifically tests three conversion variants:
- Basic Conversion - Converting all HTML content to Markdown (without images)
- Main Content Extraction - Extracting and converting only the main content from HTML files (without images)
- Conversion with Images - Converting all HTML content to Markdown including images
├── html/ # Source HTML files
│ ├── forum/ # Forum HTML samples
│ │ ├── tech-0.html
│ │ └── ...
│ └── ...
│
└── groundtruth/ # Expected Markdown output files
├── forum/ # Expected forum conversion results
│ ├── tech-0.md # Basic conversion expected output
│ ├── tech-0.main.md # Main-content-only expected output
│ ├── tech-0.images.md # Conversion with images expected output
│ └── ...
└── ...
Files follow a specific naming pattern to clearly indicate their purpose:
html/[category]/[file-name].html
- Original HTML source filesgroundtruth/[category]/[file-name].md
- Expected output for basic HTML conversiongroundtruth/[category]/[file-name].main.md
- Expected output for main content extractiongroundtruth/[category]/[file-name].images.md
- Expected output for conversion with images
For example:
html/forum/tech.html
- Original forum HTML filegroundtruth/forum/tech.md
- Expected Markdown after basic conversion (no images)groundtruth/forum/tech.main.md
- Expected Markdown when only extracting main content (no images)groundtruth/forum/tech.images.md
- Expected Markdown with images included
The HTML test files included in this repository are used solely for testing purposes. All files have been sanitized to replace personal information and sensitive content with generic placeholders. The structure and formatting of the HTML is preserved for testing purposes.