Conjob: Confluence to Markdown Scraper

A Node.js tool to scrape Confluence spaces and convert them to Markdown files while preserving the page hierarchy.

Features

📚 Scrape entire Confluence spaces or individual spaces
🔄 Convert Confluence storage format to Markdown
📁 Preserve page hierarchy in directory structure
🔁 Handle rate limiting with exponential backoff
🔗 Maintain page relationships and ordering

Quick Start

# Install dependencies
pnpm install

# Configure your Confluence instance
# Edit utils/index.js:
export const BASE_URL = "http://your-confluence-instance/rest/api";
export const ACCESS_TOKEN = "your-personal-access-token";

# Scrape all spaces
pnpm space:all

# Or scrape a specific space
pnpm space:single ENGINEERING

Architecture Decisions

This project follows a documented decision-making process. Key architectural decisions:

API Integration
- Native fetch with backoff
- Centralized API client
- Type-safe responses
File Structure
- Feature-based organization
- Clear separation of concerns
- Consistent patterns
Error Handling
- Centralized error handling
- Retry mechanisms
- Consistent error messages
CLI Interface
- Command-based interface
- Progress feedback
- Clear usage instructions

Project Structure

.
├── scripts/                 # CLI Commands
│   ├── all-spaces.js       # Scrape all spaces
│   └── all-space-content.js # Scrape single space
├── utils/                  # Shared Utilities
│   └── index.js           # API client, helpers
└── docs/                  # Documentation
    ├── api-examples.md
    ├── api-integration.md
    ├── cli-interface.md
    ├── error-handling.md
    ├── file-structure.md

Development

# Format code
pnpm format

Output Structure

The scraper creates a directory structure that mirrors your Confluence space:

confluence_markdown/
├── SPACE1/
│   ├── home/
│   │   ├── index.md (Space homepage)
│   │   └── Other Root Pages.md
│   └── Parent Page/
│       ├── index.md (Parent page content)
│       └── Child Page.md
└── SPACE2/
    └── ...

Configuration

Configure your Confluence instance in utils/index.js:

export const BASE_URL = "http://your-confluence-instance/rest/api";
export const ACCESS_TOKEN = "your-personal-access-token";
export const OUTPUT_DIR = "confluence_markdown";

Error Handling

The scraper handles several error cases:

Rate limiting (429) with exponential backoff
Network errors with retries
Invalid space keys
Missing configuration
File system errors

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

ISC

Acknowledgments

markdown-it for Markdown conversion
jsdom for HTML parsing

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.husky		.husky
docs		docs
scripts		scripts
utils		utils
.cz.json		.cz.json
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc		.prettierrc
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
eslint.config.js		eslint.config.js
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Conjob: Confluence to Markdown Scraper

Features

Quick Start

Architecture Decisions

Project Structure

Development

Output Structure

Configuration

Error Handling

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

dgrebb/conjob

Folders and files

Latest commit

History

Repository files navigation

Conjob: Confluence to Markdown Scraper

Features

Quick Start

Architecture Decisions

Project Structure

Development

Output Structure

Configuration

Error Handling

Contributing

License

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages