Added comprehensive Markdown validation toolkit #2961

Tejassveer08 · 2025-10-21T09:57:29Z

Overview

This PR introduces a robust, modular Python-based validation toolkit for the TiDB Operator documentation repository. The toolkit provides automated checking of Markdown files for broken links, missing anchors, and invalid image references.

Changes Made

New Files Added

Core Package (`scripts/markdown_checks/`)

__init__.py - Package initialization with clean API exports
fs_utils.py - File system utilities for safe traversal and reading
markdown_parser.py - Lightweight regex-based Markdown content extraction
link_checker.py - Comprehensive link and anchor validation
image_checker.py - Image reference validation
report.py - Structured reporting with JSON and text output formats

CLI Tools

scripts/validate_docs.py - Main Python CLI for running validation
scripts/validate-docs.ps1 - Windows PowerShell wrapper for convenience

Features

Link Validation

✅ External HTTP(S) links syntax validation
✅ Local relative file path verification
✅ Cross-file anchor fragment validation (file.md#section)
✅ Intra-file anchor validation (#section)
✅ Path traversal protection
✅ Empty link detection

Image Validation

✅ Image file existence verification
✅ Path traversal protection
✅ Common image format validation (PNG, JPG, GIF, WebP, SVG)
✅ Uncommon extension warnings

Reporting

✅ Human-readable text output
✅ Machine-readable JSON output
✅ Structured error codes and severity levels
✅ Line number precision for all issues

Usage

Python CLI

# Validate default directories (en/, zh/)
python scripts/validate_docs.py

# Validate specific files/directories
python scripts/validate_docs.py README.md en/deploy/

# JSON output for CI integration
python scripts/validate_docs.py --format json

# Custom repository root
python scripts/validate_docs.py --repo-root /path/to/repo

Windows PowerShell

# Validate default directories
./scripts/validate-docs.ps1

# Validate specific paths
./scripts/validate-docs.ps1 en README.md

Technical Details

Architecture

Dependency-free: Uses only Python standard library
Modular design: Each checker is independently testable
Safe traversal: Prevents path traversal attacks
Cross-platform: Works on Windows, macOS, and Linux

Error Codes

LINK_WHITESPACE - External links with spaces
EMPTY_LINK - Empty link targets
MISSING_ANCHOR - Referenced anchor not found
PATH_TRAVERSAL - Link escapes repository root
MISSING_FILE - Referenced file not found
MISSING_IMAGE - Referenced image not found
UNCOMMON_EXT - Uncommon image file extension

Exit Codes

0 - Success, no errors found
1 - Errors found during validation
2 - Execution failure (invalid args, exceptions)

Benefits

Automated Quality Assurance: Catches broken links and missing resources before they reach users
CI/CD Integration: JSON output enables easy integration with automated workflows
Developer Experience: Clear error messages with precise line numbers
Maintainability: Modular design makes it easy to add new validation rules
Cross-Platform: Works consistently across different operating systems

Testing

The toolkit can be tested by running it against the existing documentation:

# Test against current docs
python scripts/validate_docs.py --format text

# Test specific problematic files
python scripts/validate_docs.py en/deploy/aws-eks.md

Future Enhancements

This foundation enables future additions such as:

Spell checking integration
Style guide enforcement
Link freshness checking
Image optimization validation
Accessibility compliance checking

Files Changed

Added: scripts/markdown_checks/ (6 files, ~500 lines)
Added: scripts/validate_docs.py (~120 lines)
Added: scripts/validate-docs.ps1 (~30 lines)

Total: ~650 lines of new Python and PowerShell code

This contribution provides a solid foundation for maintaining documentation quality and can be easily extended with additional validation rules as needed.

First-time contributors' checklist

I've signed the Contributor License Agreement, which is required for the repository owners to accept my contribution.

What is changed, added, or deleted? (Required)

Which TiDB Operator version(s) do your changes apply to? (Required)

master (the latest development version for v1.x)
feature/v2 (the latest development version for v2.x)
v2.0 (TiDB Operator 2.0 versions)
v1.6 (TiDB Operator 1.6 versions)
v1.5 (TiDB Operator 1.5 versions)
v1.4 (TiDB Operator 1.4 versions)
v1.3 (TiDB Operator 1.3 versions)

What is the related PR or file link(s)?

This PR is translated from:
Other reference link(s):

## Overview This PR introduces a robust, modular Python-based validation toolkit for the TiDB Operator documentation repository. The toolkit provides automated checking of Markdown files for broken links, missing anchors, and invalid image references. ## Changes Made ### New Files Added #### Core Package (`scripts/markdown_checks/`) - **`__init__.py`** - Package initialization with clean API exports - **`fs_utils.py`** - File system utilities for safe traversal and reading - **`markdown_parser.py`** - Lightweight regex-based Markdown content extraction - **`link_checker.py`** - Comprehensive link and anchor validation - **`image_checker.py`** - Image reference validation - **`report.py`** - Structured reporting with JSON and text output formats #### CLI Tools - **`scripts/validate_docs.py`** - Main Python CLI for running validation - **`scripts/validate-docs.ps1`** - Windows PowerShell wrapper for convenience ## Features ### Link Validation - ✅ External HTTP(S) links syntax validation - ✅ Local relative file path verification - ✅ Cross-file anchor fragment validation (`file.md#section`) - ✅ Intra-file anchor validation (`#section`) - ✅ Path traversal protection - ✅ Empty link detection ### Image Validation - ✅ Image file existence verification - ✅ Path traversal protection - ✅ Common image format validation (PNG, JPG, GIF, WebP, SVG) - ✅ Uncommon extension warnings ### Reporting - ✅ Human-readable text output - ✅ Machine-readable JSON output - ✅ Structured error codes and severity levels - ✅ Line number precision for all issues ## Usage ### Python CLI ```bash # Validate default directories (en/, zh/) python scripts/validate_docs.py # Validate specific files/directories python scripts/validate_docs.py README.md en/deploy/ # JSON output for CI integration python scripts/validate_docs.py --format json # Custom repository root python scripts/validate_docs.py --repo-root /path/to/repo ``` ### Windows PowerShell ```powershell # Validate default directories ./scripts/validate-docs.ps1 # Validate specific paths ./scripts/validate-docs.ps1 en README.md ``` ## Technical Details ### Architecture - **Dependency-free**: Uses only Python standard library - **Modular design**: Each checker is independently testable - **Safe traversal**: Prevents path traversal attacks - **Cross-platform**: Works on Windows, macOS, and Linux ### Error Codes - `LINK_WHITESPACE` - External links with spaces - `EMPTY_LINK` - Empty link targets - `MISSING_ANCHOR` - Referenced anchor not found - `PATH_TRAVERSAL` - Link escapes repository root - `MISSING_FILE` - Referenced file not found - `MISSING_IMAGE` - Referenced image not found - `UNCOMMON_EXT` - Uncommon image file extension ### Exit Codes - `0` - Success, no errors found - `1` - Errors found during validation - `2` - Execution failure (invalid args, exceptions) ## Benefits 1. **Automated Quality Assurance**: Catches broken links and missing resources before they reach users 2. **CI/CD Integration**: JSON output enables easy integration with automated workflows 3. **Developer Experience**: Clear error messages with precise line numbers 4. **Maintainability**: Modular design makes it easy to add new validation rules 5. **Cross-Platform**: Works consistently across different operating systems ## Testing The toolkit can be tested by running it against the existing documentation: ```bash # Test against current docs python scripts/validate_docs.py --format text # Test specific problematic files python scripts/validate_docs.py en/deploy/aws-eks.md ``` ## Future Enhancements This foundation enables future additions such as: - Spell checking integration - Style guide enforcement - Link freshness checking - Image optimization validation - Accessibility compliance checking ## Files Changed - **Added**: `scripts/markdown_checks/` (6 files, ~500 lines) - **Added**: `scripts/validate_docs.py` (~120 lines) - **Added**: `scripts/validate-docs.ps1` (~30 lines) **Total**: ~650 lines of new Python and PowerShell code This contribution provides a solid foundation for maintaining documentation quality and can be easily extended with additional validation rules as needed.

ti-chi-bot · 2025-10-21T09:57:33Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lance6716 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2025-10-21T09:57:40Z

Welcome @Tejassveer08!

It looks like this is your first PR to pingcap/docs-tidb-operator 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to pingcap/docs-tidb-operator. 😃

pingcap-cla-assistant · 2025-10-21T09:57:43Z

All committers have signed the CLA.

ti-chi-bot bot added contribution This PR is from a community contributor. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. labels Oct 21, 2025

ti-chi-bot bot added the missing-translation-status This PR does not have translation status info. label Oct 21, 2025

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added comprehensive Markdown validation toolkit #2961

Added comprehensive Markdown validation toolkit #2961

Tejassveer08 commented Oct 21, 2025 •

edited

Loading

Uh oh!

ti-chi-bot bot commented Oct 21, 2025

Uh oh!

ti-chi-bot bot commented Oct 21, 2025

Uh oh!

pingcap-cla-assistant bot commented Oct 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Added comprehensive Markdown validation toolkit #2961

Are you sure you want to change the base?

Added comprehensive Markdown validation toolkit #2961

Conversation

Tejassveer08 commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes Made

New Files Added

Core Package (scripts/markdown_checks/)

CLI Tools

Features

Link Validation

Image Validation

Reporting

Usage

Python CLI

Windows PowerShell

Technical Details

Architecture

Error Codes

Exit Codes

Benefits

Testing

Future Enhancements

Files Changed

First-time contributors' checklist

What is changed, added, or deleted? (Required)

Which TiDB Operator version(s) do your changes apply to? (Required)

What is the related PR or file link(s)?

Uh oh!

ti-chi-bot bot commented Oct 21, 2025

Uh oh!

ti-chi-bot bot commented Oct 21, 2025

Uh oh!

pingcap-cla-assistant bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tejassveer08 commented Oct 21, 2025 •

edited

Loading

Core Package (`scripts/markdown_checks/`)

pingcap-cla-assistant bot commented Oct 21, 2025 •

edited

Loading