website2org.el downloads a website, transforms it into minimalist Orgmode, and presents the results as either a temporary Orgmode buffer or creates an .org file in a specified directory.
I have now three primary uses cases for this. 1) Local storage/read it later: I often store websites locally to link them (and specific paragraphs) to my Zettelkasten in orgrr. (They are downloaded to the directory “findings”, an orgrr container, and the tag “orgrr-project” is automatically added. See orgrr for details.) 2) To quickly see the contents of a website in mastodon.el. 3) Loading full articles in Elfeed (see below).
This package is still in an early stage. I use it to replace orgrr-save-website
, which draws on org-web-tools--eww-readable
and org-web-tools--html-to-org-with-pandoc
but has become more fragile by the day. orgrr-save-website
also often does struggle to produce the kind of Orgmode I want to have - with as little HTML fragments as possible.
website2org requires wget or curl (curl is now standard because it has better handling of redirects) but does not use Pandoc. The package parses HTML via RegExp to achieve rather minimal looking Orgmode files, which are much smaller in size (the downloaded version of this readme from Github has 311KB, the website2org version just 9KB).
You can automatically forward downloaded websites to an archiving website via (setq website2org-archive t)
. The standard service here is, which can be changed by modifying website2org-archive-url
website2org ignores all information before the first <h1>
headline and everything coming after the <footer>
. All data about source, <div>
and similar types of tags are also ignored. It respects all paragraphs, headlines, lists (ordered and not), inline code, block quotes, <pre>
, links (including local links), <strong>
, and <em>
. Tabs and multiple spaces are reduced to one space. A new line cannot start with a space (or “- ” followed by nothing).
Parsing HTML with RegEx comes with lots of issues. Most experienced coders strongly advice against doing so for good reason. And there are numerous tools to parse HTML, there is even one built-in (libxml-parse-html-region
, which eventually might the basis of a proper rewrite of this package). I also considered tidy-html5, hxclean of html-xml-utils fame, and htmlq. All of these worked to some degree but stopped doing so when leaving the UTF-8 world. In other words, not a single one of them produced acceptable results for Chinese websites. Given the quality of the current solution, I don’t see the pressing need to add such HTML parsing. website2org will work for most sites - the more they stick to common standards and behavior, the better are the chances. Right now we may be at 85-95% of websites working, with a 5% chance of some small issue (please report the obvious ones).
Still, there are some known issues even with otherwise working websites. Orgmode does not deal well with source blocks within quote blocks. These will look weird. Orgmode also sometimes has difficulties in dealing with two formatting styles for the same string. Bold and italics at the same time may or may not work.
- Fixes for detecting links in documents;
transforms marked HTML files inDired
into Orgmode documents and places them intowebsite2org-directory
- Added
now creates read-only buffer; improvements in display of quotation marks before links
- Fixed Substack double loading content bug (by deleting all scripts before processing the HTML)
- Switching from
as standard tool; better handling of Unicode escape sequences
See also the changelog.
Clone the repository:
git clone
To run Website2org, you need to load the package by adding it to your .emacs or init.el:
(load "/path/to/website2org/website2org.el")
You should set a binding to website2org
and website2org-temp
(global-set-key (kbd "C-M-s-<down>") 'website2org) ;; this is what I use on a Mac
(global-set-key (kbd "C-M-s-<up>") 'website2org-temp)
Or, if you use straight:
(use-package website2org
:straight (:host github :repo "rtrppl/website2org")
(setq website2org-directory "/path/to/where/websites/should/be/stored/") ;; if needed, see below
(:map global-map)
("C-M-s-<down>" . website2org)
("C-M-s-<up>" . website2org-temp))
Additionally you can set these values:
;; If wget should be called with a different command.
(setq website2org-wget-cmd "wget -q ")
;; Change the name of the local cache file.
(setq website2org-cache-filename "~/website2org-cache.html")
;; Turn website2org-additional-meta nil if not applicable. This is for
;; use in orgrr (
(setq website2org-additional-meta "#+roam_tags: website orgrr-project")
;; By default all websites will be stored in the org-directory.
;; Set website2org-directory, if you prefer a different directory.
;; directories must end with /
(setq website2org-directory "/path/to/where/websites/should/be/stored/")
(setq website2org-filename-time-format "%Y%m%d%H%M%S")
(setq website2org-archive nil) ;; If this is set to t, the URL called will be send to the archiving URL below
(setq website2org-archive-url "")
These are the primary functions of website2org.el:
will download the website at point (or from a provided URL) and save it as an Orgmode file. website2org-temp
will download a website at point (or from a provided URL) and present it as a temporary Orgmode buffer (press “q” to exit the screen; press “spacebar” to scroll).
will transform marked HTML documents in Dired into Orgmode documents and places them into website2org-directory
I wrote a small integration for Elfeed (based on elfeed-show-visit
), which may also be of interest for some:
(defun elfeed-show-visit-website2org (&optional use-generic-p)
"Visit the current entry in a website2org temporary buffer.
Calling this function with C-u will use website2org-url-to buffer
to create an orgmode document."
(interactive "P")
(let ((link (elfeed-entry-link elfeed-show-entry)))
(when link
(message "Sent to browser: %s" link)
(if use-generic-p
(website2org-url-to-org link)
(website2org-to-buffer link)))))
By adding a keybinding you are able to quickly open the current entry in a temporary website2org buffer.
My Elfeed setup basically looks like this:
(use-package elfeed
:defer t
(:map global-map
("C-x w" . elfeed))
(:map elfeed-show-mode-map
("w" . elfeed-show-visit-website2org)))