Skip to content

Commit 69606e5

Browse files
authored
Merge pull request #332 from twm/update-docs
Update docs
2 parents 9f9dfdb + deb98bb commit 69606e5

8 files changed

+82
-94
lines changed

AUTHORS.rst

+1
Original file line numberDiff line numberDiff line change
@@ -45,3 +45,4 @@ Patches and suggestions
4545
- Jon Dufresne
4646
- Ville Skyttä
4747
- Jonathan Vanasco
48+
- Tom Most

CHANGES.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Released on July 14, 2016
3232

3333
* Cease supporting DATrie under PyPy.
3434

35-
* **Remove ``PullDOM`` support, as this hasn't ever been properly
35+
* **Remove PullDOM support, as this hasn't ever been properly
3636
tested, doesn't entirely work, and as far as I can tell is
3737
completely unused by anyone.**
3838

@@ -70,7 +70,7 @@ Released on July 14, 2016
7070
to clarify their status as public.**
7171

7272
* **Get rid of the sanitizer package. Merge sanitizer.sanitize into the
73-
sanitizer.htmlsanitizer module and move that to saniziter. This means
73+
sanitizer.htmlsanitizer module and move that to sanitizer. This means
7474
anyone who used sanitizer.sanitize or sanitizer.HTMLSanitizer needs no
7575
code changes.**
7676

doc/html5lib.rst

+4-8
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,8 @@
11
html5lib Package
22
================
33

4-
:mod:`html5lib` Package
5-
-----------------------
6-
7-
.. automodule:: html5lib.__init__
8-
:members:
9-
:undoc-members:
10-
:show-inheritance:
4+
.. automodule:: html5lib
5+
:members: __version__
116

127
:mod:`constants` Module
138
-----------------------
@@ -26,7 +21,7 @@ html5lib Package
2621
:show-inheritance:
2722

2823
:mod:`serializer` Module
29-
----------------------
24+
------------------------
3025

3126
.. automodule:: html5lib.serializer
3227
:members:
@@ -41,4 +36,5 @@ Subpackages
4136
html5lib.filters
4237
html5lib.treebuilders
4338
html5lib.treewalkers
39+
html5lib.treeadapters
4440

doc/html5lib.treeadapters.rst

+20
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
treebuilders Package
2+
====================
3+
4+
:mod:`~html5lib.treeadapters` Package
5+
-------------------------------------
6+
7+
.. automodule:: html5lib.treeadapters
8+
:members:
9+
:undoc-members:
10+
:show-inheritance:
11+
12+
.. automodule:: html5lib.treeadapters.genshi
13+
:members:
14+
:undoc-members:
15+
:show-inheritance:
16+
17+
.. automodule:: html5lib.treeadapters.sax
18+
:members:
19+
:undoc-members:
20+
:show-inheritance:

doc/html5lib.treewalkers.rst

+4-4
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ treewalkers Package
1010
:show-inheritance:
1111

1212
:mod:`base` Module
13-
-------------------
13+
------------------
1414

1515
.. automodule:: html5lib.treewalkers.base
1616
:members:
@@ -34,7 +34,7 @@ treewalkers Package
3434
:show-inheritance:
3535

3636
:mod:`etree_lxml` Module
37-
-----------------------
37+
------------------------
3838

3939
.. automodule:: html5lib.treewalkers.etree_lxml
4040
:members:
@@ -43,9 +43,9 @@ treewalkers Package
4343

4444

4545
:mod:`genshi` Module
46-
--------------------------
46+
--------------------
4747

4848
.. automodule:: html5lib.treewalkers.genshi
4949
:members:
5050
:undoc-members:
51-
:show-inheritance:
51+
:show-inheritance:

doc/movingparts.rst

+29-73
Original file line numberDiff line numberDiff line change
@@ -4,22 +4,25 @@ The moving parts
44
html5lib consists of a number of components, which are responsible for
55
handling its features.
66

7+
Parsing uses a *tree builder* to generate a *tree*, the in-memory representation of the document.
8+
Several tree representations are supported, as are translations to other formats via *tree adapters*.
9+
The tree may be translated to a token stream with a *tree walker*, from which :class:`~html5lib.serializer.HTMLSerializer` produces a stream of bytes.
10+
The token stream may also be transformed by use of *filters* to accomplish tasks like sanitization.
711

812
Tree builders
913
-------------
1014

1115
The parser reads HTML by tokenizing the content and building a tree that
12-
the user can later access. There are three main types of trees that
13-
html5lib can build:
16+
the user can later access. html5lib can build three types of trees:
1417

15-
* ``etree`` - this is the default; builds a tree based on ``xml.etree``,
18+
* ``etree`` - this is the default; builds a tree based on :mod:`xml.etree`,
1619
which can be found in the standard library. Whenever possible, the
1720
accelerated ``ElementTree`` implementation (i.e.
1821
``xml.etree.cElementTree`` on Python 2.x) is used.
1922

20-
* ``dom`` - builds a tree based on ``xml.dom.minidom``.
23+
* ``dom`` - builds a tree based on :mod:`xml.dom.minidom`.
2124

22-
* ``lxml.etree`` - uses lxml's implementation of the ``ElementTree``
25+
* ``lxml`` - uses the :mod:`lxml.etree` implementation of the ``ElementTree``
2326
API. The performance gains are relatively small compared to using the
2427
accelerated ``ElementTree`` module.
2528

@@ -31,21 +34,15 @@ You can specify the builder by name when using the shorthand API:
3134
with open("mydocument.html", "rb") as f:
3235
lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
3336
34-
When instantiating a parser object, you have to pass a tree builder
35-
class in the ``tree`` keyword attribute:
37+
To get a builder class by name, use the :func:`~html5lib.treebuilders.getTreeBuilder` function.
3638

37-
.. code-block:: python
38-
39-
import html5lib
40-
parser = html5lib.HTMLParser(tree=SomeTreeBuilder)
41-
document = parser.parse("<p>Hello World!")
42-
43-
To get a builder class by name, use the ``getTreeBuilder`` function:
39+
When instantiating a :class:`~html5lib.html5parser.HTMLParser` object, you must pass a tree builder class via the ``tree`` keyword attribute:
4440

4541
.. code-block:: python
4642
4743
import html5lib
48-
parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))
44+
TreeBuilder = html5lib.getTreeBuilder("dom")
45+
parser = html5lib.HTMLParser(tree=TreeBuilder)
4946
minidom_document = parser.parse("<p>Hello World!")
5047
5148
The implementation of builders can be found in `html5lib/treebuilders/
@@ -55,17 +52,13 @@ The implementation of builders can be found in `html5lib/treebuilders/
5552
Tree walkers
5653
------------
5754

58-
Once a tree is ready, you can work on it either manually, or using
59-
a tree walker, which provides a streaming view of the tree. html5lib
60-
provides walkers for all three supported types of trees (``etree``,
61-
``dom`` and ``lxml``).
55+
In addition to manipulating a tree directly, you can use a tree walker to generate a streaming view of it.
56+
html5lib provides walkers for ``etree``, ``dom``, and ``lxml`` trees, as well as ``genshi`` `markup streams <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
6257

6358
The implementation of walkers can be found in `html5lib/treewalkers/
6459
<https://github.com/html5lib/html5lib-python/tree/master/html5lib/treewalkers>`_.
6560

66-
Walkers make consuming HTML easier. html5lib uses them to provide you
67-
with has a couple of handy tools.
68-
61+
html5lib provides :class:`~html5lib.serializer.HTMLSerializer` for generating a stream of bytes from a token stream, and several filters which manipulate the stream.
6962

7063
HTMLSerializer
7164
~~~~~~~~~~~~~~
@@ -90,15 +83,14 @@ The serializer lets you write HTML back as a stream of bytes.
9083
'>'
9184
'Witam wszystkich'
9285
93-
You can customize the serializer behaviour in a variety of ways, consult
94-
the :class:`~html5lib.serializer.htmlserializer.HTMLSerializer`
95-
documentation.
86+
You can customize the serializer behaviour in a variety of ways. Consult
87+
the :class:`~html5lib.serializer.HTMLSerializer` documentation.
9688

9789

9890
Filters
9991
~~~~~~~
10092

101-
You can alter the stream content with filters provided by html5lib:
93+
html5lib provides several filters:
10294

10395
* :class:`alphabeticalattributes.Filter
10496
<html5lib.filters.alphabeticalattributes.Filter>` sorts attributes on
@@ -110,11 +102,11 @@ You can alter the stream content with filters provided by html5lib:
110102
the document
111103

112104
* :class:`lint.Filter <html5lib.filters.lint.Filter>` raises
113-
``LintError`` exceptions on invalid tag and attribute names, invalid
105+
:exc:`AssertionError` exceptions on invalid tag and attribute names, invalid
114106
PCDATA, etc.
115107

116108
* :class:`optionaltags.Filter <html5lib.filters.optionaltags.Filter>`
117-
removes tags from the stream which are not necessary to produce valid
109+
removes tags from the token stream which are not necessary to produce valid
118110
HTML
119111

120112
* :class:`sanitizer.Filter <html5lib.filters.sanitizer.Filter>` removes
@@ -125,9 +117,9 @@ You can alter the stream content with filters provided by html5lib:
125117

126118
* :class:`whitespace.Filter <html5lib.filters.whitespace.Filter>`
127119
collapses all whitespace characters to single spaces unless they're in
128-
``<pre/>`` or ``textarea`` tags.
120+
``<pre/>`` or ``<textarea/>`` tags.
129121

130-
To use a filter, simply wrap it around a stream:
122+
To use a filter, simply wrap it around a token stream:
131123

132124
.. code-block:: python
133125
@@ -142,9 +134,11 @@ To use a filter, simply wrap it around a stream:
142134
Tree adapters
143135
-------------
144136

145-
Used to translate one type of tree to another. More documentation
146-
pending, sorry.
137+
Tree adapters can be used to translate between tree formats.
138+
Two adapters are provided by html5lib:
147139

140+
* :func:`html5lib.treeadapters.genshi.to_genshi()` generates a `Genshi markup stream <https://genshi.edgewall.org/wiki/Documentation/streams.html>`_.
141+
* :func:`html5lib.treeadapters.sax.to_sax()` calls a SAX handler based on the tree.
148142

149143
Encoding discovery
150144
------------------
@@ -156,54 +150,16 @@ the following way:
156150
* The encoding may be explicitly specified by passing the name of the
157151
encoding as the encoding parameter to the
158152
:meth:`~html5lib.html5parser.HTMLParser.parse` method on
159-
``HTMLParser`` objects.
153+
:class:`~html5lib.html5parser.HTMLParser` objects.
160154

161155
* If no encoding is specified, the parser will attempt to detect the
162156
encoding from a ``<meta>`` element in the first 512 bytes of the
163157
document (this is only a partial implementation of the current HTML
164-
5 specification).
158+
specification).
165159

166-
* If no encoding can be found and the chardet library is available, an
160+
* If no encoding can be found and the :mod:`chardet` library is available, an
167161
attempt will be made to sniff the encoding from the byte pattern.
168162

169163
* If all else fails, the default encoding will be used. This is usually
170164
`Windows-1252 <http://en.wikipedia.org/wiki/Windows-1252>`_, which is
171165
a common fallback used by Web browsers.
172-
173-
174-
Tokenizers
175-
----------
176-
177-
The part of the parser responsible for translating a raw input stream
178-
into meaningful tokens is the tokenizer. Currently html5lib provides
179-
two.
180-
181-
To set up a tokenizer, simply pass it when instantiating
182-
a :class:`~html5lib.html5parser.HTMLParser`:
183-
184-
.. code-block:: python
185-
186-
import html5lib
187-
from html5lib import sanitizer
188-
189-
p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
190-
p.parse("<p>Surprise!<script>alert('Boo!');</script>")
191-
192-
HTMLTokenizer
193-
~~~~~~~~~~~~~
194-
195-
This is the default tokenizer, the heart of html5lib. The implementation
196-
can be found in `html5lib/tokenizer.py
197-
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/tokenizer.py>`_.
198-
199-
HTMLSanitizer
200-
~~~~~~~~~~~~~
201-
202-
This is a tokenizer that removes unsafe markup and CSS styles from the
203-
input. Elements that are known to be safe are passed through and the
204-
rest is converted to visible text. The default configuration of the
205-
sanitizer follows the `WHATWG Sanitization Rules
206-
<http://wiki.whatwg.org/wiki/Sanitization_rules>`_.
207-
208-
The implementation can be found in `html5lib/sanitizer.py
209-
<https://github.com/html5lib/html5lib-python/blob/master/html5lib/sanitizer.py>`_.

html5lib/__init__.py

+17-7
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,23 @@
11
"""
2-
HTML parsing library based on the WHATWG "HTML5"
3-
specification. The parser is designed to be compatible with existing
4-
HTML found in the wild and implements well-defined error recovery that
2+
HTML parsing library based on the `WHATWG HTML specification
3+
<https://whatwg.org/html>`_. The parser is designed to be compatible with
4+
existing HTML found in the wild and implements well-defined error recovery that
55
is largely compatible with modern desktop web browsers.
66
7-
Example usage:
7+
Example usage::
88
9-
import html5lib
10-
f = open("my_document.html")
11-
tree = html5lib.parse(f)
9+
import html5lib
10+
with open("my_document.html", "rb") as f:
11+
tree = html5lib.parse(f)
12+
13+
For convenience, this module re-exports the following names:
14+
15+
* :func:`~.html5parser.parse`
16+
* :func:`~.html5parser.parseFragment`
17+
* :class:`~.html5parser.HTMLParser`
18+
* :func:`~.treebuilders.getTreeBuilder`
19+
* :func:`~.treewalkers.getTreeWalker`
20+
* :func:`~.serializer.serialize`
1221
"""
1322

1423
from __future__ import absolute_import, division, unicode_literals
@@ -22,4 +31,5 @@
2231
"getTreeWalker", "serialize"]
2332

2433
# this has to be at the top level, see how setup.py parses this
34+
#: Distribution version number.
2535
__version__ = "0.9999999999-dev"

tox.ini

+5
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,12 @@ deps =
1111
base: webencodings
1212
py26-base: ordereddict
1313
optional: -r{toxinidir}/requirements-optional.txt
14+
doc: Sphinx
1415

1516
commands =
1617
{envbindir}/py.test {posargs}
1718
{toxinidir}/flake8-run.sh
19+
20+
[testenv:doc]
21+
changedir = doc
22+
commands = sphinx-build -b html . _build

0 commit comments

Comments
 (0)