-
Notifications
You must be signed in to change notification settings - Fork 125
Added comprehensive Markdown validation toolkit #2961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
## Overview This PR introduces a robust, modular Python-based validation toolkit for the TiDB Operator documentation repository. The toolkit provides automated checking of Markdown files for broken links, missing anchors, and invalid image references. ## Changes Made ### New Files Added #### Core Package (`scripts/markdown_checks/`) - **`__init__.py`** - Package initialization with clean API exports - **`fs_utils.py`** - File system utilities for safe traversal and reading - **`markdown_parser.py`** - Lightweight regex-based Markdown content extraction - **`link_checker.py`** - Comprehensive link and anchor validation - **`image_checker.py`** - Image reference validation - **`report.py`** - Structured reporting with JSON and text output formats #### CLI Tools - **`scripts/validate_docs.py`** - Main Python CLI for running validation - **`scripts/validate-docs.ps1`** - Windows PowerShell wrapper for convenience ## Features ### Link Validation - ✅ External HTTP(S) links syntax validation - ✅ Local relative file path verification - ✅ Cross-file anchor fragment validation (`file.md#section`) - ✅ Intra-file anchor validation (`#section`) - ✅ Path traversal protection - ✅ Empty link detection ### Image Validation - ✅ Image file existence verification - ✅ Path traversal protection - ✅ Common image format validation (PNG, JPG, GIF, WebP, SVG) - ✅ Uncommon extension warnings ### Reporting - ✅ Human-readable text output - ✅ Machine-readable JSON output - ✅ Structured error codes and severity levels - ✅ Line number precision for all issues ## Usage ### Python CLI ```bash # Validate default directories (en/, zh/) python scripts/validate_docs.py # Validate specific files/directories python scripts/validate_docs.py README.md en/deploy/ # JSON output for CI integration python scripts/validate_docs.py --format json # Custom repository root python scripts/validate_docs.py --repo-root /path/to/repo ``` ### Windows PowerShell ```powershell # Validate default directories ./scripts/validate-docs.ps1 # Validate specific paths ./scripts/validate-docs.ps1 en README.md ``` ## Technical Details ### Architecture - **Dependency-free**: Uses only Python standard library - **Modular design**: Each checker is independently testable - **Safe traversal**: Prevents path traversal attacks - **Cross-platform**: Works on Windows, macOS, and Linux ### Error Codes - `LINK_WHITESPACE` - External links with spaces - `EMPTY_LINK` - Empty link targets - `MISSING_ANCHOR` - Referenced anchor not found - `PATH_TRAVERSAL` - Link escapes repository root - `MISSING_FILE` - Referenced file not found - `MISSING_IMAGE` - Referenced image not found - `UNCOMMON_EXT` - Uncommon image file extension ### Exit Codes - `0` - Success, no errors found - `1` - Errors found during validation - `2` - Execution failure (invalid args, exceptions) ## Benefits 1. **Automated Quality Assurance**: Catches broken links and missing resources before they reach users 2. **CI/CD Integration**: JSON output enables easy integration with automated workflows 3. **Developer Experience**: Clear error messages with precise line numbers 4. **Maintainability**: Modular design makes it easy to add new validation rules 5. **Cross-Platform**: Works consistently across different operating systems ## Testing The toolkit can be tested by running it against the existing documentation: ```bash # Test against current docs python scripts/validate_docs.py --format text # Test specific problematic files python scripts/validate_docs.py en/deploy/aws-eks.md ``` ## Future Enhancements This foundation enables future additions such as: - Spell checking integration - Style guide enforcement - Link freshness checking - Image optimization validation - Accessibility compliance checking ## Files Changed - **Added**: `scripts/markdown_checks/` (6 files, ~500 lines) - **Added**: `scripts/validate_docs.py` (~120 lines) - **Added**: `scripts/validate-docs.ps1` (~30 lines) **Total**: ~650 lines of new Python and PowerShell code This contribution provides a solid foundation for maintaining documentation quality and can be easily extended with additional validation rules as needed.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Welcome @Tejassveer08! |
Overview
This PR introduces a robust, modular Python-based validation toolkit for the TiDB Operator documentation repository. The toolkit provides automated checking of Markdown files for broken links, missing anchors, and invalid image references.
Changes Made
New Files Added
Core Package (
scripts/markdown_checks/)__init__.py- Package initialization with clean API exportsfs_utils.py- File system utilities for safe traversal and readingmarkdown_parser.py- Lightweight regex-based Markdown content extractionlink_checker.py- Comprehensive link and anchor validationimage_checker.py- Image reference validationreport.py- Structured reporting with JSON and text output formatsCLI Tools
scripts/validate_docs.py- Main Python CLI for running validationscripts/validate-docs.ps1- Windows PowerShell wrapper for convenienceFeatures
Link Validation
file.md#section)#section)Image Validation
Reporting
Usage
Python CLI
Windows PowerShell
Technical Details
Architecture
Error Codes
LINK_WHITESPACE- External links with spacesEMPTY_LINK- Empty link targetsMISSING_ANCHOR- Referenced anchor not foundPATH_TRAVERSAL- Link escapes repository rootMISSING_FILE- Referenced file not foundMISSING_IMAGE- Referenced image not foundUNCOMMON_EXT- Uncommon image file extensionExit Codes
0- Success, no errors found1- Errors found during validation2- Execution failure (invalid args, exceptions)Benefits
Testing
The toolkit can be tested by running it against the existing documentation:
Future Enhancements
This foundation enables future additions such as:
Files Changed
scripts/markdown_checks/(6 files, ~500 lines)scripts/validate_docs.py(~120 lines)scripts/validate-docs.ps1(~30 lines)Total: ~650 lines of new Python and PowerShell code
This contribution provides a solid foundation for maintaining documentation quality and can be easily extended with additional validation rules as needed.
First-time contributors' checklist
What is changed, added, or deleted? (Required)
Which TiDB Operator version(s) do your changes apply to? (Required)
What is the related PR or file link(s)?