HTML to Markdown Converter

A Python application that converts HTML documentation files to Markdown format while preserving directory structure and managing assets intelligently.

Features

Directory Structure Preservation: Maintains the original folder hierarchy while normalizing paths
Smart Image Management:
- Moves all images to a centralized static/img/project_name folder
- Deduplicates images based on content using SHA256 hashing
- Removes unreferenced images automatically
Path Normalization:
- Converts all paths to lowercase
- Replaces spaces with underscores
- Uses absolute paths for all references
Content Extraction: Extracts only the main content (div role="main"), excluding navigation elements
Markdown Enhancement:
- Converts HTML code blocks to proper markdown with triple backticks
- Generates proper markdown-style anchor links
- Excludes HTML title attributes from links and images

Installation

Clone the repository:

git clone <repository-url>
cd html2markdown

Install dependencies:

pip install -r requirements.txt

Usage

Basic Usage

python3 main.py --input /path/to/html/docs --output /path/to/output/docs

Options

--input, -i: Input directory containing HTML files (required)
--output, -o: Output directory for markdown files (required)
--validate: Run validation after conversion to check paths and naming
--force: Overwrite output directory if it exists

Example

python3 main.py --input ../product_docs/1secure --output ../output_docs/1secure --force --validate

Output Structure

Given an input structure like:

product_docs/
└── ProjectName/
    ├── index.html
    ├── guide/
    │   └── setup.html
    └── images/
        └── screenshot.png

The converter produces:

output_docs/
└── projectname/
    ├── index.md
    └── guide/
        └── setup.md

static/
└── img/
    └── projectname/
        └── screenshot.png

How It Works

Preprocessing: HTML files are parsed and only the main content is extracted
Path Resolution: All links and image references are resolved to absolute paths
Conversion: HTML is converted to Markdown using customized conversion rules
Image Processing: Images are deduplicated and moved to the static folder
Path Updates: All image references in Markdown files are updated to point to the new locations
Cleanup: Empty directories and unreferenced images are removed

Architecture

main.py: CLI entry point
converter.py: Main orchestrator for the conversion process
preprocessor.py: HTML preprocessing and content extraction
markdown_converter.py: Custom Markdown conversion rules
image_manager.py: Image deduplication and management
path_resolver.py: Path resolution and normalization
file_handler.py: File system operations
validator.py: Output validation
utils.py: Utility functions

Requirements

Python 3.6+
Dependencies listed in requirements.txt:
- markdownify>=0.11.6
- beautifulsoup4>=4.12.0
- lxml>=4.9.0
- click>=8.1.0
- tqdm>=4.65.0
- pathlib2>=2.3.7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HTML to Markdown Converter

Features

Installation

Usage

Basic Usage

Options

Example

Output Structure

How It Works

Architecture

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
__init__.py		__init__.py
__main__.py		__main__.py
converter.py		converter.py
file_handler.py		file_handler.py
image_manager.py		image_manager.py
main.py		main.py
markdown_converter.py		markdown_converter.py
path_resolver.py		path_resolver.py
preprocessor.py		preprocessor.py
requirements.txt		requirements.txt
run_converter.py		run_converter.py
utils.py		utils.py
validator.py		validator.py

netwrix/html2markdown

Folders and files

Latest commit

History

Repository files navigation

HTML to Markdown Converter

Features

Installation

Usage

Basic Usage

Options

Example

Output Structure

How It Works

Architecture

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages