- From: https://gist.githubusercontent.com/mullnerz/9fff80593d6b442d5c1b/raw/2c511e82f998bc489d9e300870f8789c77c2b49b/archive-website.md
- "The command I use to archive a single website"
sh wget -mpck --html-extension --user-agent="" -e robots=off --wait 1 -P . www.foo.com
- Explanation of the parameters used
-m
(Mirror) Turns on mirror-friendly settings like infinite recursion depth, timestamps, etc.-c
(Continue) Resumes a partially-downloaded transfer-p
(Page requisites) Downloads any page dependencies like images, style sheets, etc.-k
(Convert) After completing retrieval of all files…- converts all absolute links to other downloaded files into relative links
- converts all relative links to any files that weren’t downloaded into absolute, external links
- In a nutshell: makes your website archive work locally
--html-extension
this adds .html after the downloaded filename, to make sure it plays nicely on whatever system you’re going to view the archive on–user-agent=””
- Sometimes websites use robots.txt to block certain agents like web crawlers (e.g. GoogleBot) and Wget. This tells Wget to send a blank user-agent, preventing identification. You could alternatively use a web browser’s user-agent and make it look like a web browser, but it probably doesn’t matter.-e robots=off
- Sometimes you’ll run into a site with a robots.txt that blocks everything. In these cases, this setting will tell Wget to ignore it. Like the user-agent, I usually leave this on for the sake of convenience.–wait 1
- Tells Wget to wait 1 second between each action. This will make it a bit less taxing on the servers.-P .
- set the download directory to something. I left it at the default “.” (which means “here”) but this is where you could pass in a directory path to tell wget to save the archived site. Handy, if you’re doing this on a regular basis (say, as a cron job or something…)http://url-to-site
- this is the full URL of the site to download. You’ll likely want to change this.
- Sources