Skip to content

Refactor KaTeX parsing of inline styles and vlists; normalize survey output a bit more #1722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

gnprice
Copy link
Member

@gnprice gnprice commented Jul 19, 2025

Now that the KaTeX support we've had in recent releases is merged into main (with #1559) (*), I've gone and made a number of the refactors that I'd been contemplating for this code but hadn't wanted to do before those were merged. In this branch:

  • Parse each element's inline styles in a more direct way, letting the parsing logic for different types of elements apply their respective context of what styles are expected. This also means less building and copying of generic KatexSpanStyles objects.
  • Simplify some of the vlist parsing code, notably around margins.
  • Split the parsing of struts and vlists into their own respective methods.
  • At the start of the branch, a few adjustments to make the output of tools/content/check-features katex-check more stable and easier to compare across changes to the implementation. This helped me validate that on an empirical corpus, none of the above changes caused any KaTeX expressions to stop being successfully parsed.

In particular I ran tools/content/check-features katex-check near the start of the branch, after the first three commits:
8c4106d katex [nfc]: Show line numbers only on unknown hard-fails, not others
d269628 katex [nfc]: Add messages for remaining hard-fail cases seen in corpus
e74557b tools/content [nfc]: Tie-break on reason text when number of failures equal

and again at the end of the branch. I then compared the output with commands like:

$ diff -U2 <(grep -P 'failed|Because' tmp/tex.old) \
           <(grep -P 'failed|Because' tmp/tex.new)

There were no changes in the total number of messages or KaTeX expressions where the parser failed. For a small number of expressions, there were changes in the set of soft failures reported before reaching a hard failure; the reasons for those are in a couple of individual commit messages. The full diff was (with a denominator of "28109 of them were KaTeX containing messages and 3370 of those failed"):

@@ -38,4 +38,6 @@
   Because of unsupported css class: frac-line:
     335 messages failed.
+  Because of unsupported css class: delimcenter:
+    184 messages failed.
   Because of unsupported css class: mtable:
     177 messages failed.
@@ -90,6 +92,4 @@
   Because of unsupported css class: cd-arrow-pad:
     8 messages failed.
-  Because of unsupported inline css property: border-right-width:
-    5 messages failed.
   Because of unsupported css class: boxpad:
     4 messages failed.
@@ -102,4 +102,6 @@
   Because of unsupported css class: rule:
     3 messages failed.
+  Because of unsupported inline css property: border-right-width:
+    3 messages failed.
   Because of unsupported inline css property: border-top-width:
     3 messages failed.
@@ -120,5 +122,5 @@
   Because of unsupported css class: textup:
     2 messages failed.
-  Because of unsupported inline css property: border-right-style:
+  Because of unsupported css class: vertical-separator:
     2 messages failed.
   Because of unsupported css class: mover:

(*) One change is still outstanding, in #1720. But it won't much interact with the main logic.

Selected commit messages

8c4106d katex [nfc]: Show line numbers only on unknown hard-fails, not others

This makes the output of the survey script more stable as our KaTeX
parser gets refactored and otherwise edited.

d269628 katex [nfc]: Add messages for remaining hard-fail cases seen in corpus

e74557b tools/content [nfc]: Tie-break on reason text when number of failures equal

This helps make the output more stable from run to run, so that
it's easier to spot changes (or confirm the absence of changes)
when editing the code.

16deb86 katex [nfc]: Cut stray import of widgets library

This was here only for a reference in a doc. In general we try to
avoid imports of widgets from model code; it's an inversion of layers.

46c74e7 katex [nfc]: Fix a variable name to specify its units, namely em

e223229 katex: Require height on pstrut spans

If this were missing, it's not clear to me that zero would be an
appropriate default.

In any case, in an empirical corpus, it's always present.
So just require that.

b62979c katex [nfc]: Separate _parseInlineStyles from constructing a KatexSpanStyles

Also document the existing _parseSpanInlineStyles method, and
describe our plan for eliminating most of its call sites.

a88de14 katex [nfc]: Push more parsing into _parseInlineStyles; return Map

This has a small effect on the survey script's list of failure
reasons: in the rare case that the "unexpected shape of inline CSS"
error appears, it now fires before any of the CSS properties in the
same inline style are processed. That can mean fewer entries added
to unsupportedInlineCssProperties.

This difference is only possible, though, on a KaTeX expression that
is going to reach the same hard failure either way. So it has no
effect on behavior seen by a user.

ade7f66 katex: Fix a misleading log line: unexpected CSS property, not value

Until the previous commit, this bit of code was handling both the case
where the value was unexpected (for which this message was accurate)
and the case where the property itself was unexpected (for which it
wasn't). Now the first case is handled elsewhere, so fix the
remaining case.

9c32bd6 katex [nfc]: Skip building whole KatexSpanStyles for struts' two properties

We know at this spot that there are just two specific CSS properties
we expect to see, and we'll end up handling them directly rather than
through a KatexSpanStyles object. So parse them directly, rather
than build a whole KatexSpanStyles object (and then another one
with .filter()).

6c2e55c katex [nfc]: Cut vertical-align from generic style properties

All the remaining call sites of _parseSpanInlineStyles would throw
anyway if this property were actually found. We only expect it
in a specific context, namely a strut.

f12c749 katex [nfc]: Check "only has height" directly, without making KatexSpanStyles

This lets us skip allocating these objects (two in each case -- the
second one comes from .filter()). We also get to skip parsing the
value of height, since we don't intend to use it.

e1a2c9c katex [nfc]: Get pstrut height directly, without making KatexSpanStyles

7443e1a katex [nfc]: Directly handle expected inline styles on vlist child

There's only a handful of specific properties we expect to see on this
type of span; so handle those explicitly.

In fact, making this list explicit brings to light that there's one
property here which doesn't actually appear on KaTeX's vlist children:
height. We'll cut that in a separate non-NFC commit.

This will also open up ways to simplify the interaction between this
and the margin-handling logic below.

8e059b3 katex: Don't expect height on vlist child spans

These spans are highly structured; the only properties that go
into their inline styles are top, margin-left, and margin-right.

e61c6c7 katex [nfc]: Consolidate logic for computing overall styles of KatexSpanNode

This just pulls these three pieces of closely-related logic next to
each other. That will make it easier to refactor them further.

This causes one change in the survey script's list of failure reasons:
when the delimcenter class occurs with an inline top property, we
now record the unsupported class before reaching the hard fail for the
unsupported property. This has no user-visible effect, though,
because it can only happen when the expression is going to reach that
hard failure either way.

77b1c2f katex [nfc]: Inline remaining/main use of _parseSpanInlineStyles

And inline the effect of the merge method, eliminating that
method too.

This way we get to construct just one KatexSpanStyles object, rather
than constructing three of them when inline styles are present.

636ac6a katex [nfc]: Construct vlist child's styles directly, without filter

c744e45 katex [nfc]: Note that heightEm might turn out not to be needed

26ac991 katex [nfc]: Split out _parseStrut, _parseVlist, _parseGenericSpan

Each of these swathes of logic has no interaction with the others.
Splitting them into their own methods makes that structure easy for
the reader to see.

gnprice added 24 commits July 18, 2025 20:54
This makes the output of the survey script more stable as our KaTeX
parser gets refactored and otherwise edited.
… equal

This helps make the output more stable from run to run, so that
it's easier to spot changes (or confirm the absence of changes)
when editing the code.
This was here only for a reference in a doc.  In general we try to
avoid imports of widgets from model code; it's an inversion of layers.
This was here only for references in docs.  Better to avoid
the layer-inverting import.
If this were missing, it's not clear to me that zero would be an
appropriate default.

In any case, in an empirical corpus, it's always present.
So just require that.
…nStyles

Also document the existing _parseSpanInlineStyles method, and
describe our plan for eliminating most of its call sites.
This has a small effect on the survey script's list of failure
reasons: in the rare case that the "unexpected shape of inline CSS"
error appears, it now fires before any of the CSS properties in the
same inline style are processed.  That can mean fewer entries added
to unsupportedInlineCssProperties.

This difference is only possible, though, on a KaTeX expression that
is going to reach the same hard failure either way.  So it has no
effect on behavior seen by a user.
Also make the error message for this case a bit more specific.
Until the previous commit, this bit of code was handling both the case
where the value was unexpected (for which this message was accurate)
and the case where the property itself was unexpected (for which it
wasn't).  Now the first case is handled elsewhere, so fix the
remaining case.
…erties

We know at this spot that there are just two specific CSS properties
we expect to see, and we'll end up handling them directly rather than
through a KatexSpanStyles object.  So parse them directly, rather
than build a whole KatexSpanStyles object (and then another one
with `.filter()`).
All the remaining call sites of _parseSpanInlineStyles would throw
anyway if this property were actually found.  We only expect it
in a specific context, namely a strut.
…anStyles

This lets us skip allocating these objects (two in each case -- the
second one comes from `.filter()`).  We also get to skip parsing the
value of `height`, since we don't intend to use it.
There's only a handful of specific properties we expect to see on this
type of span; so handle those explicitly.

In fact, making this list explicit brings to light that there's one
property here which doesn't actually appear on KaTeX's vlist children:
height.  We'll cut that in a separate non-NFC commit.

This will also open up ways to simplify the interaction between this
and the margin-handling logic below.
These spans are highly structured; the only properties that go
into their inline styles are top, margin-left, and margin-right.
…panNode

This just pulls these three pieces of closely-related logic next to
each other.  That will make it easier to refactor them further.

This causes one change in the survey script's list of failure reasons:
when the `delimcenter` class occurs with an inline `top` property, we
now record the unsupported class before reaching the hard fail for the
unsupported property.  This has no user-visible effect, though,
because it can only happen when the expression is going to reach that
hard failure either way.
And inline the effect of the `merge` method, eliminating that
method too.

This way we get to construct just one KatexSpanStyles object, rather
than constructing three of them when inline styles are present.
This map pattern syntax looks an awful lot like it's saying that
no other keys should be present -- after all, that's what the
corresponding syntax in a list pattern would mean.  So I initially
read this to mean that this code would ignore the inline styles
if any other attribute was present on the element; which wouldn't
be desirable logic.

In fact it's just saying that this one key should be present and
match the given pattern.  But there are simpler ways to say that;
so use one.
Each of these swathes of logic has no interaction with the others.
Splitting them into their own methods makes that structure easy for
the reader to see.
@gnprice gnprice added the maintainer review PR ready for review by Zulip maintainers label Jul 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintainer review PR ready for review by Zulip maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants