-
-
Notifications
You must be signed in to change notification settings - Fork 579
Add docstrings relating to functions involving parser context to parser.pxi #449
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I believe that documentation/docstrings should not repeat code. (Some of your proposals really just state what the code does.) It should rather explain and motivate what's not obvious from the code, why it's done, why it's needed.
src/lxml/parser.pxi
Outdated
@@ -573,6 +573,9 @@ cdef class _ParserContext(_ResolverContext): | |||
return context | |||
|
|||
cdef void _initParserContext(self, xmlparser.xmlParserCtxt* c_ctxt) noexcept: | |||
""" | |||
Connects the libxml2-level context to the lxml-level context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connects the libxml2-level context to the lxml-level context. | |
Connects the libxml2-level context to the lxml-level parser context. |
src/lxml/parser.pxi
Outdated
""" | ||
This method resets the parser context for a new parse. It also connects the | ||
error log to the libxml-level ctxt by passing in _receiveParserError in as a | ||
callback. Note: the HTML parser always has the libxml2 ctxt->sax not set to | ||
null, so this always works. The ctxt->sax is set up in the libxml2 funciton | ||
htmlInitParserCtxt in HTMLparser.c, and the lxml error log is always connected. | ||
Then libxml2's SAX's serror is set to be the place errors are sent when schannel | ||
is set to ctxt->sax->serror in xmlCtxtErrMemory in libxml2's parserInternals.c. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation should not duplicate code, only explain it.
""" | |
This method resets the parser context for a new parse. It also connects the | |
error log to the libxml-level ctxt by passing in _receiveParserError in as a | |
callback. Note: the HTML parser always has the libxml2 ctxt->sax not set to | |
null, so this always works. The ctxt->sax is set up in the libxml2 funciton | |
htmlInitParserCtxt in HTMLparser.c, and the lxml error log is always connected. | |
Then libxml2's SAX's serror is set to be the place errors are sent when schannel | |
is set to ctxt->sax->serror in xmlCtxtErrMemory in libxml2's parserInternals.c. | |
""" | |
""" | |
This method resets the parser context for a new parse. It also connects the | |
error log to the libxml-level ctxt. | |
""" |
src/lxml/parser.pxi
Outdated
@@ -687,6 +702,15 @@ cdef xmlDoc* _handleParseResult(_ParserContext context, | |||
xmlparser.xmlParserCtxt* c_ctxt, | |||
xmlDoc* result, filename, | |||
bint recover, bint free_doc) except NULL: | |||
""" | |||
C-level xmlDoc* result is set to NULL if the parser cannot parse the document. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is better explained with an in-place code comment than in a docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! The problem is that result = NULL
appears many times in parser.pxi
. result
's being NULL
is also checked in many different places using result is NULL
and result is not NULL
throughout parser.pxi
. So, an in-place code comment would have to be duplicated >10 times. Do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… or appear at the top of the function. But it's still a code comment, not a docstring. docstrings are written for users of the function. Code comments are written for readers of the code.
src/lxml/parser.pxi
Outdated
""" | ||
C-level xmlDoc* result is set to NULL if the parser cannot parse the document. | ||
Keep in mind the libxml2-level wellFormed, which starts at 1, is set to 0 if the | ||
parser may be able to parse the document, but there are errors that libxml2 has | ||
to work through. lxml then uses this, along with some other logic, to determine | ||
whether to set lxml-level well_formed to 0 and if that happens, set result to | ||
NULL. Then, if result is NULL, raise a parse error. Some modifications to result | ||
are made. Then the xmlDoc* result is returned. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
C-level xmlDoc* result is set to NULL if the parser cannot parse the document. | |
Keep in mind the libxml2-level wellFormed, which starts at 1, is set to 0 if the | |
parser may be able to parse the document, but there are errors that libxml2 has | |
to work through. lxml then uses this, along with some other logic, to determine | |
whether to set lxml-level well_formed to 0 and if that happens, set result to | |
NULL. Then, if result is NULL, raise a parse error. Some modifications to result | |
are made. Then the xmlDoc* result is returned. | |
""" | |
""" | |
Determine if we can consider the parsing successful. | |
Handles different error cases, e.g. with 'recover=True' or validation. | |
Frees a parsed document if it turns out not to be acceptable. | |
Raises an appropriate parse error. | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly correct. However, I think the sentence Handles different error cases, e.g. with 'recover=True' or validation.
is too simple. For example the block
elif recover or (c_ctxt.wellFormed and
c_ctxt.lastError.level < xmlerror.XML_ERR_ERROR):
well_formed = 1
is not so much describing any particular kind of error as much as saying to carry over libxml2 errors unless in lxml level recovery mode or libxml2 lastError
has a level at least as high as than what seems to be an error level named XML_ERR_ERROR
. And, it's important to understand the underlying roles of the variables c_ctxt.wellFormed
and well_formed
because they're different but made of the same two English words. (I mean English words as opposed to variable names or keywords.)
The reason my docstring was originally long is because the code itself is very complex. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if specifically the case in the code that you copied above is not clear, why not just add a comment to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to add a comment to it rather then a docstring. But where does this leave us on content? Should I put what I had in the original PR, with docstrings replaced by code comments? Or what I mention above in my previous PR comment (not to be confused with code comments)? Or does it matter? I just want to be sure you're okay with whatever I come up with.
src/lxml/parser.pxi
Outdated
""" | ||
This method creates a parser context if there isn't already one. If | ||
creating a new parser context, this function configures it, creates a | ||
libxml2-level parser context, connects the two, and makes some changes | ||
to the libxml2 context. Then, it returns the parser context for the | ||
parser. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
This method creates a parser context if there isn't already one. If | |
creating a new parser context, this function configures it, creates a | |
libxml2-level parser context, connects the two, and makes some changes | |
to the libxml2 context. Then, it returns the parser context for the | |
parser. | |
""" |
src/lxml/parser.pxi
Outdated
""" | ||
This method configures the lxml-level parser. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
This method configures the lxml-level parser. | |
""" | |
""" | |
Create and configure an appropriate lxml-level parser context. | |
""" |
src/lxml/parser.pxi
Outdated
""" | ||
This method calls into libxml to configure the libxml2-level parser context, | ||
among other things. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
""" | |
This method calls into libxml to configure the libxml2-level parser context, | |
among other things. | |
""" | |
""" | |
Create a libxml2-level parser context. | |
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I wrote among other things
is that the lines
self._registerHtmlErrorHandler(c_ctxt)
and
c_ctxt.sax.startDocument = _initSaxDocument
are important and not related to creating the parser context. This is especially true with the line about SAX, which does many things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there is a libxml2 function to create the context. If the creation was trivial, we wouldn't need this function here just to call it.
""" | |
This method calls into libxml to configure the libxml2-level parser context, | |
among other things. | |
""" | |
""" | |
Create and initialise a libxml2-level parser context. | |
""" |
Better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advocate getting rid of the docstring entirely and replace it with comments in the right places. We could avoid the words "among other things" by just not putting a comment above the lines with "other things" (for example c_ctxt.sax.startDocument = _initSaxDocument
).
Thanks for the feedback! I'm happy to make the changes you suggested. But first, I want to ask whether there is a better way to make lxml's |
Since you digged into it, you're probably in the best position to explain what information you would have liked to have. Then we could think about how best to present it. |
That makes sense! I'll think more about that. In the meanwhile, I'll accept your suggestions soon so we can merge the docstrings. I'm also going to send two other more minor PRs. Besides the PRs, I'm in the final stages of writing an article on Medium explaining some other work I've done on researching lxml and libxml2 errors. |
I was thinking that instead of these docstrings, I could instead write an unofficial gloss or "guide" to reading the code in a separate article on Medium. My sense is that there is a lot of complexity in lxml and libxml2 and coupling between the two. An unofficial gloss or guide might be able to explain the complexity better. Any thoughts on that? |
That would be a separate article and many people won't expect such an article to exist (and be worth searching for) when starting to read the code. Comments in the code would certainly be more helpful. I write "comments" because code concepts are better explained in comments than in docstrings. docstrings are for users of the library. Comments are for readers of the source code. You can also add to this document: (PS: As you probably notice from that doc page, external documentation tends to get outdated pretty reliably over time.) |
That makes sense. I left a few comments on the specific suggested changes in the PR. Overall, the plan of making this PR about code comments rather than docstrings makes sense to me. It could also make sense to add to the documentation on lxml.de. Does this mean the content I originally wrote might be worth keeping? For example, the parts about how libxml2 internals affect lxml? |
Since it seemed to help you, it probably is. If you split it into code comments in place, we'll see whether they are really helpful or redundant. |
… docstrings and some comments
I converted a number of the docstrings to comments. I also reworded and removed a bunch. I tried to take your feedback into consideration. Let me know what you think! And, no problem if this PR doesn't work out—I will keep my notes for future times I come to read |
@@ -687,6 +702,15 @@ cdef xmlDoc* _handleParseResult(_ParserContext context, | |||
xmlparser.xmlParserCtxt* c_ctxt, |
This comment was marked as spam.
This comment was marked as spam.
Sorry, something went wrong.
Thanks |
Glad you merged this! Thanks! |
I wrote docstrings for a number of functions in
parser.pxi
. Since the docs saidparser.pxi
is "definitely not the right place to start reading lxml's source code," I thought it would be helpful to make it more accessible to newcomers.In particular, I focused on functions that involved the parser context. This includes the context at the lxml/Cython level and the libxml2/C level. I wrote these based on notes I took as I was reading through
parser.pxi
for the first time and learning how lxml's parser wraps libxml2. I include a little bit of detail on libxml2 internals because it appears coupled to lxml and I don't anticipate any changes soon to the libxml2 code I mention.Feedback is welcome.