Implementation Notes

The Beautiful Soup module is used to parse the text source for UTC document registry pages. For each page, an initial "soup" is obtained:

    soup = BeautifulSoup(page, "lxml")

What we're interested in specifically is a particular table contained within the page. There are multiple table elements used within each page; the following isolates the table we're interested in:

    tableSoup = soup.find(class_="contents").find(class_="subtle")

Now, a "soup" is a complex object that retains a lot of details from the source HTML that isn't relevant for our purposes. What we want is to derive a table—a list of lists—with a row for each document entry in the registry, and columns for relevant details about each doc; and we want the value for each "cell" to be a basic type: a number or string. We need to massage that out of the tableSoup object.

Also note: "soup"s don't pickle well. Elements have references back up to parent nodes, and that appears to be the cause of loops that result in stack overflows when dumping to a pickle file. No matter: the "soup" is only a temporary result, it's not that slow to parse the results, and it will only be done infrequently.

The tableSoup object has a .contents attribute that is list-like, and has elements for each of the children of the <table> element we isolated. But there are two issues:

The contents include extraneous details from the source we don't want: each new-line, as well as comments. So, some of the "rows" are just '\n'. And the same also occurs within the actual <tr>s: some of the "cells" are comments and some are just '\n'.
Of the items in the contents that are of interest, they're soups, not lists, numbers or strings.

So, there's a hierarchy of "soups" with extraneous cruft that we need to massage away.

When we do get down to actual <td> or <th> elements, we'll need to take special steps to keep some cells and ignore others, and to extract what we want from the cells. (E.g., from the cells that have the doc links, we want to retain both the link text and the target URL.)

Before we get down to the individual cells, we need to massage tableSoup to get a list of lists of the <td>\<th> soups. There are two ways that work well.

Technique 1: use the BeautifulSoup.find_all() method to find <tr>s, and then use it again in a nested list comprehension to find the <td>s or <th>s within each <tr>.

    rows = table.find_all("tr")
    # that returns a 'bs4.element.ResultSet', which is a list subclass

    rows = [
        [cell for cell in row.find_all(["th", "td"]) ]
         for row in rows if not (re.fullmatch(whitespace_pattern, row.text))
         ]
    # that returns a new list (rows) of lists each containing "td" or "th" soups

Technique 2: Use a list comprehension with a filter to get the real rows from the table. Then use a nested list comprehension with a filter to get the real <td>s and <th>s.

    rows = [tr for tr in table.contents
        if tr != '\n'
        and not (re.fullmatch(whitespace_pattern, tr.text))
        and not isinstance(tr, Comment)
    ]

    rows = [
        [td for td in row.contents
        if td != '\n' and not isinstance(td.string, Comment)
        ]
        for row in rows # each row is a soup so has .contents
    ]

The two are conceptually similar. The main difference is that the former lets Beautiful Soup do much of the filtering, while the latter does all of the filtering manually. Note that, in the former, we still use our own filter to keep out empty rows.

Importantly, the results generated by both were equal in thorough testing for doc registry tables from 2000 to May 2020.

The latter is what I did first. But after learning more about how to work with Beautiful Soup, the former seems like the more reliable approach.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementation_Notes.md

Implementation_Notes.md

Implementation Notes

Files

Implementation_Notes.md

Latest commit

History

Implementation_Notes.md

File metadata and controls

Implementation Notes