The Beautiful Soup module is used to parse the text source for UTC document registry pages. For each page, an initial "soup" is obtained:
soup = BeautifulSoup(page, "lxml")
What we're interested in specifically is a particular table contained within the page. There are multiple table elements used within each page; the following isolates the table we're interested in:
tableSoup = soup.find(class_="contents").find(class_="subtle")
Now, a "soup" is a complex object that retains a lot of details from the source HTML that isn't relevant for our purposes. What we want is to derive a table—a list of lists—with a row for each document entry in the registry, and columns for relevant details about each doc; and we want the value for each "cell" to be a basic type: a number or string. We need to massage that out of the tableSoup
object.
Also note: "soup"s don't pickle well. Elements have references back up to parent nodes, and that appears to be the cause of loops that result in stack overflows when dumping to a pickle file. No matter: the "soup" is only a temporary result, it's not that slow to parse the results, and it will only be done infrequently.
The tableSoup
object has a .contents
attribute that is list-like, and has elements for each of the children of the <table> element we isolated. But there are two issues:
- The contents include extraneous details from the source we don't want: each new-line, as well as comments. So, some of the "rows" are just '\n'. And the same also occurs within the actual <tr>s: some of the "cells" are comments and some are just '\n'.
- Of the items in the contents that are of interest, they're soups, not lists, numbers or strings.
So, there's a hierarchy of "soups" with extraneous cruft that we need to massage away.
When we do get down to actual <td> or <th> elements, we'll need to take special steps to keep some cells and ignore others, and to extract what we want from the cells. (E.g., from the cells that have the doc links, we want to retain both the link text and the target URL.)
Before we get down to the individual cells, we need to massage tableSoup
to get a list of lists of the <td>\<th> soups. There are two ways that work well.
Technique 1: use the BeautifulSoup.find_all()
method to find <tr>s, and then use it again in a nested list comprehension to find the <td>s or <th>s within each <tr>.
rows = table.find_all("tr")
# that returns a 'bs4.element.ResultSet', which is a list subclass
rows = [
[cell for cell in row.find_all(["th", "td"]) ]
for row in rows if not (re.fullmatch(whitespace_pattern, row.text))
]
# that returns a new list (rows) of lists each containing "td" or "th" soups
Technique 2: Use a list comprehension with a filter to get the real rows from the table. Then use a nested list comprehension with a filter to get the real <td>s and <th>s.
rows = [tr for tr in table.contents
if tr != '\n'
and not (re.fullmatch(whitespace_pattern, tr.text))
and not isinstance(tr, Comment)
]
rows = [
[td for td in row.contents
if td != '\n' and not isinstance(td.string, Comment)
]
for row in rows # each row is a soup so has .contents
]
The two are conceptually similar. The main difference is that the former lets Beautiful Soup do much of the filtering, while the latter does all of the filtering manually. Note that, in the former, we still use our own filter to keep out empty rows.
Importantly, the results generated by both were equal in thorough testing for doc registry tables from 2000 to May 2020.
The latter is what I did first. But after learning more about how to work with Beautiful Soup, the former seems like the more reliable approach.