Skip to content

Conversation

@Tejassveer08
Copy link

@Tejassveer08 Tejassveer08 commented Oct 21, 2025

Overview

This PR introduces a robust, modular Python-based validation toolkit for the TiDB Operator documentation repository. The toolkit provides automated checking of Markdown files for broken links, missing anchors, and invalid image references.

Changes Made

New Files Added

Core Package (scripts/markdown_checks/)

  • __init__.py - Package initialization with clean API exports
  • fs_utils.py - File system utilities for safe traversal and reading
  • markdown_parser.py - Lightweight regex-based Markdown content extraction
  • link_checker.py - Comprehensive link and anchor validation
  • image_checker.py - Image reference validation
  • report.py - Structured reporting with JSON and text output formats

CLI Tools

  • scripts/validate_docs.py - Main Python CLI for running validation
  • scripts/validate-docs.ps1 - Windows PowerShell wrapper for convenience

Features

Link Validation

  • ✅ External HTTP(S) links syntax validation
  • ✅ Local relative file path verification
  • ✅ Cross-file anchor fragment validation (file.md#section)
  • ✅ Intra-file anchor validation (#section)
  • ✅ Path traversal protection
  • ✅ Empty link detection

Image Validation

  • ✅ Image file existence verification
  • ✅ Path traversal protection
  • ✅ Common image format validation (PNG, JPG, GIF, WebP, SVG)
  • ✅ Uncommon extension warnings

Reporting

  • ✅ Human-readable text output
  • ✅ Machine-readable JSON output
  • ✅ Structured error codes and severity levels
  • ✅ Line number precision for all issues

Usage

Python CLI

# Validate default directories (en/, zh/)
python scripts/validate_docs.py

# Validate specific files/directories
python scripts/validate_docs.py README.md en/deploy/

# JSON output for CI integration
python scripts/validate_docs.py --format json

# Custom repository root
python scripts/validate_docs.py --repo-root /path/to/repo

Windows PowerShell

# Validate default directories
./scripts/validate-docs.ps1

# Validate specific paths
./scripts/validate-docs.ps1 en README.md

Technical Details

Architecture

  • Dependency-free: Uses only Python standard library
  • Modular design: Each checker is independently testable
  • Safe traversal: Prevents path traversal attacks
  • Cross-platform: Works on Windows, macOS, and Linux

Error Codes

  • LINK_WHITESPACE - External links with spaces
  • EMPTY_LINK - Empty link targets
  • MISSING_ANCHOR - Referenced anchor not found
  • PATH_TRAVERSAL - Link escapes repository root
  • MISSING_FILE - Referenced file not found
  • MISSING_IMAGE - Referenced image not found
  • UNCOMMON_EXT - Uncommon image file extension

Exit Codes

  • 0 - Success, no errors found
  • 1 - Errors found during validation
  • 2 - Execution failure (invalid args, exceptions)

Benefits

  1. Automated Quality Assurance: Catches broken links and missing resources before they reach users
  2. CI/CD Integration: JSON output enables easy integration with automated workflows
  3. Developer Experience: Clear error messages with precise line numbers
  4. Maintainability: Modular design makes it easy to add new validation rules
  5. Cross-Platform: Works consistently across different operating systems

Testing

The toolkit can be tested by running it against the existing documentation:

# Test against current docs
python scripts/validate_docs.py --format text

# Test specific problematic files
python scripts/validate_docs.py en/deploy/aws-eks.md

Future Enhancements

This foundation enables future additions such as:

  • Spell checking integration
  • Style guide enforcement
  • Link freshness checking
  • Image optimization validation
  • Accessibility compliance checking

Files Changed

  • Added: scripts/markdown_checks/ (6 files, ~500 lines)
  • Added: scripts/validate_docs.py (~120 lines)
  • Added: scripts/validate-docs.ps1 (~30 lines)

Total: ~650 lines of new Python and PowerShell code

This contribution provides a solid foundation for maintaining documentation quality and can be easily extended with additional validation rules as needed.

First-time contributors' checklist

What is changed, added, or deleted? (Required)

Which TiDB Operator version(s) do your changes apply to? (Required)

  • master (the latest development version for v1.x)
  • feature/v2 (the latest development version for v2.x)
  • v2.0 (TiDB Operator 2.0 versions)
  • v1.6 (TiDB Operator 1.6 versions)
  • v1.5 (TiDB Operator 1.5 versions)
  • v1.4 (TiDB Operator 1.4 versions)
  • v1.3 (TiDB Operator 1.3 versions)

What is the related PR or file link(s)?

  • This PR is translated from:
  • Other reference link(s):

## Overview

This PR introduces a robust, modular Python-based validation toolkit for the TiDB Operator documentation repository. The toolkit provides automated checking of Markdown files for broken links, missing anchors, and invalid image references.

## Changes Made

### New Files Added

#### Core Package (`scripts/markdown_checks/`)
- **`__init__.py`** - Package initialization with clean API exports
- **`fs_utils.py`** - File system utilities for safe traversal and reading
- **`markdown_parser.py`** - Lightweight regex-based Markdown content extraction
- **`link_checker.py`** - Comprehensive link and anchor validation
- **`image_checker.py`** - Image reference validation
- **`report.py`** - Structured reporting with JSON and text output formats

#### CLI Tools
- **`scripts/validate_docs.py`** - Main Python CLI for running validation
- **`scripts/validate-docs.ps1`** - Windows PowerShell wrapper for convenience

## Features

### Link Validation
- ✅ External HTTP(S) links syntax validation
- ✅ Local relative file path verification
- ✅ Cross-file anchor fragment validation (`file.md#section`)
- ✅ Intra-file anchor validation (`#section`)
- ✅ Path traversal protection
- ✅ Empty link detection

### Image Validation
- ✅ Image file existence verification
- ✅ Path traversal protection
- ✅ Common image format validation (PNG, JPG, GIF, WebP, SVG)
- ✅ Uncommon extension warnings

### Reporting
- ✅ Human-readable text output
- ✅ Machine-readable JSON output
- ✅ Structured error codes and severity levels
- ✅ Line number precision for all issues

## Usage

### Python CLI
```bash
# Validate default directories (en/, zh/)
python scripts/validate_docs.py

# Validate specific files/directories
python scripts/validate_docs.py README.md en/deploy/

# JSON output for CI integration
python scripts/validate_docs.py --format json

# Custom repository root
python scripts/validate_docs.py --repo-root /path/to/repo
```

### Windows PowerShell
```powershell
# Validate default directories
./scripts/validate-docs.ps1

# Validate specific paths
./scripts/validate-docs.ps1 en README.md
```

## Technical Details

### Architecture
- **Dependency-free**: Uses only Python standard library
- **Modular design**: Each checker is independently testable
- **Safe traversal**: Prevents path traversal attacks
- **Cross-platform**: Works on Windows, macOS, and Linux

### Error Codes
- `LINK_WHITESPACE` - External links with spaces
- `EMPTY_LINK` - Empty link targets
- `MISSING_ANCHOR` - Referenced anchor not found
- `PATH_TRAVERSAL` - Link escapes repository root
- `MISSING_FILE` - Referenced file not found
- `MISSING_IMAGE` - Referenced image not found
- `UNCOMMON_EXT` - Uncommon image file extension

### Exit Codes
- `0` - Success, no errors found
- `1` - Errors found during validation
- `2` - Execution failure (invalid args, exceptions)

## Benefits

1. **Automated Quality Assurance**: Catches broken links and missing resources before they reach users
2. **CI/CD Integration**: JSON output enables easy integration with automated workflows
3. **Developer Experience**: Clear error messages with precise line numbers
4. **Maintainability**: Modular design makes it easy to add new validation rules
5. **Cross-Platform**: Works consistently across different operating systems

## Testing

The toolkit can be tested by running it against the existing documentation:

```bash
# Test against current docs
python scripts/validate_docs.py --format text

# Test specific problematic files
python scripts/validate_docs.py en/deploy/aws-eks.md
```

## Future Enhancements

This foundation enables future additions such as:
- Spell checking integration
- Style guide enforcement
- Link freshness checking
- Image optimization validation
- Accessibility compliance checking

## Files Changed

- **Added**: `scripts/markdown_checks/` (6 files, ~500 lines)
- **Added**: `scripts/validate_docs.py` (~120 lines)
- **Added**: `scripts/validate-docs.ps1` (~30 lines)

**Total**: ~650 lines of new Python and PowerShell code

This contribution provides a solid foundation for maintaining documentation quality and can be easily extended with additional validation rules as needed.
@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 21, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign lance6716 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added contribution This PR is from a community contributor. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. labels Oct 21, 2025
@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 21, 2025

Welcome @Tejassveer08!

It looks like this is your first PR to pingcap/docs-tidb-operator 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!



Thank you, and welcome to pingcap/docs-tidb-operator. 😃

@ti-chi-bot ti-chi-bot bot added the missing-translation-status This PR does not have translation status info. label Oct 21, 2025
@pingcap-cla-assistant
Copy link

pingcap-cla-assistant bot commented Oct 21, 2025

CLA assistant check
All committers have signed the CLA.

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. missing-translation-status This PR does not have translation status info. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant