Skip to content

Conversation

@Crozzers
Copy link
Contributor

@Crozzers Crozzers commented Dec 7, 2025

This PR fixes #632, fixes #645, fixes #652, fixes #653, fixes #654.

Many of the em and strong related issues that have been raised in the past couple of months have been weird edge cases that can technically be considered valid according to the original markdown spec. However, the original spec is very loose, and most markdown processors these days have an extended set of rules to handle these cases.

This PR adds a new italics and bold processing extra that aims to mostly implement the GFM emphasis rules, which are very comprehensive and cover pretty much every edge case.

What has been done

I've added a new GFMItalicAndBoldProcessor class that implements most of GFMs rules and processes em and strong. This is now being used by the main Markdown class, instead of the old strong_re and em_re regexes.

I've left the existing ItalicAndBoldProcessor class in place. The new class does not work the same way as the old one, and whilst it wouldn't break any custom extras implemented (ie: throw an error), they would simply no longer work.
This library doesn't really do semantic versioning, so for these reasons I've left it in place.

I've also added a new gfm_emphasis test case, which uses many snippets from the GFM spec, as well as examples from issues raised.

How the new class works

Previously we would match entire emphasis spans with regex, trying to filter out edge cases in the regex and failing that, in the sub function.

Now, we match delimiter runs, which are one or more *_ chars. For each delimiter run we decide, based on a set of rules, whether that run is an opening (left flanking) run or a closing (right flanking) run (or both). These rules are detailed in the GFM spec and implemented in code.

We do a few validity checks, like making sure our match does not cross span borders (eg: *not a</span> valid em*) before checking if the delimiters are balanced.

In a normal em/strong, this is not a concern but for things like consecutive ems (*foo**bar*) or nested em/strong (***em*strong**), not all the delimiter runs are the same size. Depending on which delimiter is larger, and what the upcoming delimiter runs look like, we may do some special processing on these runs.

Finally, once we've decided what our opening and closing runs are, we process them and return the final text

Deviations from previous behaviour and GFM spec

One major deviation from the previous behaviour is that nested em and strong is now accepted.

__**foo**__ -> <strong><strong>foo</strong></strong>

And compared to the GFM spec, we still allow consecutive ems

# us
*foo**bar* -> <em>foo</em><em>bar</em>
# GFM
*foo**bar* -> <em>foo**bar</em>

I think both of these deviations can be put down to a matter of taste or style, I'm not sure one or the other will be objectively right or wrong.

Performance

It remains to be seen what perf is like in the real world. Comparing a previous test run to this PR, the main test suite seems to complete in a comparable amount of time.

The only area I see where performance has changed significantly is for the ReDoS case highlighted in #493.
In previous runs the library handles this in around 0.8 seconds, whereas now it seems to be 2.3-2.7s for an input of around 387,300 characters.
Not catastrophic, but certainly notable.

I've added some caching to the new class to mitigate this (previous result was around 10s which is a fail), but I'll see if I can improve it a bit more

@Crozzers Crozzers marked this pull request as ready for review December 7, 2025 19:32
@Crozzers
Copy link
Contributor Author

Crozzers commented Dec 7, 2025

For reviewers, I'd recommend casting an eye over the gfm_emphasis test case, comparing the text to the HTML version and just double checking that all of those conversions are as expected, although I haven't modified the existing em/strong test cases so all previous behaviour should work as expected

@nicholasserra
Copy link
Collaborator

This is wild, thank you. Gave it a good look and LGTM. Anything else you wanna sneak in before it's merged?

@Crozzers
Copy link
Contributor Author

No. I was meaning to take another look at performance but haven't had the time. I'll submit a follow-up PR if I find anything

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants