Merge pull request #41 from buren/ignore-robots-by-default

buren · web-flow · commit f880dc69b1d8 · 2021-04-23T12:29:53.000+02:00
Don't respect robots.txt file by default
diff --git a/README.md b/README.md
@@ -199,6 +199,8 @@ View archive: [https://web.archive.org/web/*/http://example.com](https://web.arc
 
 ## Configuration
 
+:information_source: By default `wayback_archiver` doesn't respect robots.txt files, see [this Internet Archive blog post](https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/) for more information.
+
 Configuration (the below values are the defaults)
 
 ```ruby
diff --git a/lib/wayback_archiver.rb b/lib/wayback_archiver.rb
@@ -12,7 +12,7 @@ module WaybackArchiver
   # WaybackArchiver User-Agent
   USER_AGENT = "WaybackArchiver/#{WaybackArchiver::VERSION} (+#{INFO_LINK})".freeze
   # Default for whether to respect robots txt files
-  DEFAULT_RESPECT_ROBOTS_TXT = true
+  DEFAULT_RESPECT_ROBOTS_TXT = false
 
   # Default concurrency for archiving URLs
   DEFAULT_CONCURRENCY = 1