This project demonstrates how to use Puppeteer to render a web page and create a WARC file of the rendered page and its resources. This can be useful for archiving web pages for long-term storage or offline browsing.
- Node.js: Ensure you have Node.js installed (version 22 or higher is recommended). You can download it from nodejs.org.
-
Clone the repository: This step involves downloading the project files to your local machine.
git clone https://github.com/ganapativs/puppeteer-warc.git cd puppeteer-warc
-
Install the necessary dependencies: This command will install all the required Node.js packages specified in the
package.json
file.npm install
To create a WARC file from a website, use the src/write-warc-cli.mjs
script. This script will render the specified website and create a WARC file containing the page and its resources. It will also generate a screenshot of the web page, which can be useful for debugging.
-
Command:
node src/write-warc-cli.mjs <website-url>
-
Example: To create a WARC file for
https://example.com
, run:node src/write-warc-cli.mjs https://example.com
-
To read and print the contents of a WARC file, use the src/read-warc-cli.mjs
script. This script supports two output formats: JSON and text.
-
Command:
node src/read-warc-cli.mjs <path-to-warc-file> [--format=json|text]
-
Options:
--format=json
(default): Outputs the WARC contents in JSON format--format=text
: Outputs a human-readable text report
-
Examples:
# Output in JSON format (default) node src/read-warc-cli.mjs examplecom.warc.gz # Output in text format node src/read-warc-cli.mjs examplecom.warc.gz --format=text
The JSON format includes:
recordCount
: Total number of records in the WARC filerecords
: Object containing all records, where each record includes:warcHeaders
: WARC headers for the recordhttpHeaders
: HTTP headers (when present)contentType
: Type of the contentcontentSize
: Size of the content in bytescontent
: The actual content (base64 encoded for binary data)contentEncoding
: Indicates if content is base64 encoded
The text format provides a well-structured report that includes:
- Total number of records
- WARC headers for each record
- HTTP headers (when present)
- Content type and size
- Full content for text-based resources
- Binary data indicator for non-text resources
-
You can preview WARC files using ReplayWeb.page, a web-based tool for viewing archived web content. This tool allows you to interact with the archived pages as if you were browsing them live.
This project is licensed under the MIT License. See the LICENSE file for details.