Skip to content

Add docstrings relating to functions involving parser context to parser.pxi #449

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 22, 2025

Conversation

abepolk
Copy link
Contributor

@abepolk abepolk commented Jan 29, 2025

I wrote docstrings for a number of functions in parser.pxi. Since the docs said parser.pxi is "definitely not the right place to start reading lxml's source code," I thought it would be helpful to make it more accessible to newcomers.

In particular, I focused on functions that involved the parser context. This includes the context at the lxml/Cython level and the libxml2/C level. I wrote these based on notes I took as I was reading through parser.pxi for the first time and learning how lxml's parser wraps libxml2. I include a little bit of detail on libxml2 internals because it appears coupled to lxml and I don't anticipate any changes soon to the libxml2 code I mention.

Feedback is welcome.

Copy link
Member

@scoder scoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I believe that documentation/docstrings should not repeat code. (Some of your proposals really just state what the code does.) It should rather explain and motivate what's not obvious from the code, why it's done, why it's needed.

@@ -573,6 +573,9 @@ cdef class _ParserContext(_ResolverContext):
return context

cdef void _initParserContext(self, xmlparser.xmlParserCtxt* c_ctxt) noexcept:
"""
Connects the libxml2-level context to the lxml-level context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Connects the libxml2-level context to the lxml-level context.
Connects the libxml2-level context to the lxml-level parser context.

Comment on lines 594 to 602
"""
This method resets the parser context for a new parse. It also connects the
error log to the libxml-level ctxt by passing in _receiveParserError in as a
callback. Note: the HTML parser always has the libxml2 ctxt->sax not set to
null, so this always works. The ctxt->sax is set up in the libxml2 funciton
htmlInitParserCtxt in HTMLparser.c, and the lxml error log is always connected.
Then libxml2's SAX's serror is set to be the place errors are sent when schannel
is set to ctxt->sax->serror in xmlCtxtErrMemory in libxml2's parserInternals.c.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation should not duplicate code, only explain it.

Suggested change
"""
This method resets the parser context for a new parse. It also connects the
error log to the libxml-level ctxt by passing in _receiveParserError in as a
callback. Note: the HTML parser always has the libxml2 ctxt->sax not set to
null, so this always works. The ctxt->sax is set up in the libxml2 funciton
htmlInitParserCtxt in HTMLparser.c, and the lxml error log is always connected.
Then libxml2's SAX's serror is set to be the place errors are sent when schannel
is set to ctxt->sax->serror in xmlCtxtErrMemory in libxml2's parserInternals.c.
"""
"""
This method resets the parser context for a new parse. It also connects the
error log to the libxml-level ctxt.
"""

@@ -687,6 +702,15 @@ cdef xmlDoc* _handleParseResult(_ParserContext context,
xmlparser.xmlParserCtxt* c_ctxt,
xmlDoc* result, filename,
bint recover, bint free_doc) except NULL:
"""
C-level xmlDoc* result is set to NULL if the parser cannot parse the document.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better explained with an in-place code comment than in a docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! The problem is that result = NULL appears many times in parser.pxi. result's being NULL is also checked in many different places using result is NULL and result is not NULL throughout parser.pxi. So, an in-place code comment would have to be duplicated >10 times. Do you agree?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… or appear at the top of the function. But it's still a code comment, not a docstring. docstrings are written for users of the function. Code comments are written for readers of the code.

Comment on lines 705 to 713
"""
C-level xmlDoc* result is set to NULL if the parser cannot parse the document.
Keep in mind the libxml2-level wellFormed, which starts at 1, is set to 0 if the
parser may be able to parse the document, but there are errors that libxml2 has
to work through. lxml then uses this, along with some other logic, to determine
whether to set lxml-level well_formed to 0 and if that happens, set result to
NULL. Then, if result is NULL, raise a parse error. Some modifications to result
are made. Then the xmlDoc* result is returned.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
C-level xmlDoc* result is set to NULL if the parser cannot parse the document.
Keep in mind the libxml2-level wellFormed, which starts at 1, is set to 0 if the
parser may be able to parse the document, but there are errors that libxml2 has
to work through. lxml then uses this, along with some other logic, to determine
whether to set lxml-level well_formed to 0 and if that happens, set result to
NULL. Then, if result is NULL, raise a parse error. Some modifications to result
are made. Then the xmlDoc* result is returned.
"""
"""
Determine if we can consider the parsing successful.
Handles different error cases, e.g. with 'recover=True' or validation.
Frees a parsed document if it turns out not to be acceptable.
Raises an appropriate parse error.
"""

Copy link
Contributor Author

@abepolk abepolk Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly correct. However, I think the sentence Handles different error cases, e.g. with 'recover=True' or validation. is too simple. For example the block

        elif recover or (c_ctxt.wellFormed and
                         c_ctxt.lastError.level < xmlerror.XML_ERR_ERROR):
            well_formed = 1

is not so much describing any particular kind of error as much as saying to carry over libxml2 errors unless in lxml level recovery mode or libxml2 lastError has a level at least as high as than what seems to be an error level named XML_ERR_ERROR. And, it's important to understand the underlying roles of the variables c_ctxt.wellFormed and well_formed because they're different but made of the same two English words. (I mean English words as opposed to variable names or keywords.)

The reason my docstring was originally long is because the code itself is very complex. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if specifically the case in the code that you copied above is not clear, why not just add a comment to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to add a comment to it rather then a docstring. But where does this leave us on content? Should I put what I had in the original PR, with docstrings replaced by code comments? Or what I mention above in my previous PR comment (not to be confused with code comments)? Or does it matter? I just want to be sure you're okay with whatever I come up with.

Comment on lines 898 to 904
"""
This method creates a parser context if there isn't already one. If
creating a new parser context, this function configures it, creates a
libxml2-level parser context, connects the two, and makes some changes
to the libxml2 context. Then, it returns the parser context for the
parser.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
This method creates a parser context if there isn't already one. If
creating a new parser context, this function configures it, creates a
libxml2-level parser context, connects the two, and makes some changes
to the libxml2 context. Then, it returns the parser context for the
parser.
"""

Comment on lines 935 to 937
"""
This method configures the lxml-level parser.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
This method configures the lxml-level parser.
"""
"""
Create and configure an appropriate lxml-level parser context.
"""

Comment on lines 984 to 987
"""
This method calls into libxml to configure the libxml2-level parser context,
among other things.
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""
This method calls into libxml to configure the libxml2-level parser context,
among other things.
"""
"""
Create a libxml2-level parser context.
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I wrote among other things is that the lines
self._registerHtmlErrorHandler(c_ctxt)
and
c_ctxt.sax.startDocument = _initSaxDocument
are important and not related to creating the parser context. This is especially true with the line about SAX, which does many things.

Copy link
Member

@scoder scoder Feb 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there is a libxml2 function to create the context. If the creation was trivial, we wouldn't need this function here just to call it.

Suggested change
"""
This method calls into libxml to configure the libxml2-level parser context,
among other things.
"""
"""
Create and initialise a libxml2-level parser context.
"""

Better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advocate getting rid of the docstring entirely and replace it with comments in the right places. We could avoid the words "among other things" by just not putting a comment above the lines with "other things" (for example c_ctxt.sax.startDocument = _initSaxDocument).

@abepolk
Copy link
Contributor Author

abepolk commented Feb 6, 2025

Thanks for the feedback! I'm happy to make the changes you suggested. But first, I want to ask whether there is a better way to make lxml's parser.pxi more accessible to newcomers. Would a diagram, or a gloss of some sort be helpful? Or a list of common variables and what they represent? I'd like to make it so that people can read parser.pxi and quickly learn how it works. Thanks again!

@scoder
Copy link
Member

scoder commented Feb 9, 2025

Would a diagram, or a gloss of some sort be helpful? Or a list of common variables and what they represent?

Since you digged into it, you're probably in the best position to explain what information you would have liked to have. Then we could think about how best to present it.

@abepolk
Copy link
Contributor Author

abepolk commented Feb 10, 2025

That makes sense! I'll think more about that. In the meanwhile, I'll accept your suggestions soon so we can merge the docstrings. I'm also going to send two other more minor PRs. Besides the PRs, I'm in the final stages of writing an article on Medium explaining some other work I've done on researching lxml and libxml2 errors.

@abepolk
Copy link
Contributor Author

abepolk commented Feb 15, 2025

I was thinking that instead of these docstrings, I could instead write an unofficial gloss or "guide" to reading the code in a separate article on Medium. My sense is that there is a lot of complexity in lxml and libxml2 and coupling between the two. An unofficial gloss or guide might be able to explain the complexity better. Any thoughts on that?

@scoder
Copy link
Member

scoder commented Feb 15, 2025

write an unofficial gloss or "guide" to reading the code in a separate article

That would be a separate article and many people won't expect such an article to exist (and be worth searching for) when starting to read the code. Comments in the code would certainly be more helpful.

I write "comments" because code concepts are better explained in comments than in docstrings. docstrings are for users of the library. Comments are for readers of the source code.

You can also add to this document:
https://lxml.de/lxml-source-howto.html
It's at least closer to the code than a "random article" on Medium.

(PS: As you probably notice from that doc page, external documentation tends to get outdated pretty reliably over time.)

@abepolk
Copy link
Contributor Author

abepolk commented Feb 18, 2025

That makes sense. I left a few comments on the specific suggested changes in the PR. Overall, the plan of making this PR about code comments rather than docstrings makes sense to me. It could also make sense to add to the documentation on lxml.de. Does this mean the content I originally wrote might be worth keeping? For example, the parts about how libxml2 internals affect lxml?

@scoder
Copy link
Member

scoder commented Feb 18, 2025

Does this mean the content I originally wrote might be worth keeping?

Since it seemed to help you, it probably is. If you split it into code comments in place, we'll see whether they are really helpful or redundant.

@abepolk
Copy link
Contributor Author

abepolk commented Feb 21, 2025

I converted a number of the docstrings to comments. I also reworded and removed a bunch. I tried to take your feedback into consideration. Let me know what you think! And, no problem if this PR doesn't work out—I will keep my notes for future times I come to read parser.pxi.

@@ -687,6 +702,15 @@ cdef xmlDoc* _handleParseResult(_ParserContext context,
xmlparser.xmlParserCtxt* c_ctxt,

This comment was marked as spam.

@scoder scoder merged commit c7828c9 into lxml:master Apr 22, 2025
0 of 48 checks passed
@scoder
Copy link
Member

scoder commented Apr 22, 2025

Thanks

@abepolk
Copy link
Contributor Author

abepolk commented Apr 24, 2025

Glad you merged this! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants