Rewrite emphasis and strong processing to be more GFM compliant #665

Crozzers · 2025-12-07T18:21:27Z

This PR fixes #632, fixes #645, fixes #652, fixes #653, fixes #654.

Many of the em and strong related issues that have been raised in the past couple of months have been weird edge cases that can technically be considered valid according to the original markdown spec. However, the original spec is very loose, and most markdown processors these days have an extended set of rules to handle these cases.

This PR adds a new italics and bold processing extra that aims to mostly implement the GFM emphasis rules, which are very comprehensive and cover pretty much every edge case.

What has been done

I've added a new GFMItalicAndBoldProcessor class that implements most of GFMs rules and processes em and strong. This is now being used by the main Markdown class, instead of the old strong_re and em_re regexes.

I've left the existing ItalicAndBoldProcessor class in place. The new class does not work the same way as the old one, and whilst it wouldn't break any custom extras implemented (ie: throw an error), they would simply no longer work.
This library doesn't really do semantic versioning, so for these reasons I've left it in place.

I've also added a new gfm_emphasis test case, which uses many snippets from the GFM spec, as well as examples from issues raised.

How the new class works

Previously we would match entire emphasis spans with regex, trying to filter out edge cases in the regex and failing that, in the sub function.

Now, we match delimiter runs, which are one or more *_ chars. For each delimiter run we decide, based on a set of rules, whether that run is an opening (left flanking) run or a closing (right flanking) run (or both). These rules are detailed in the GFM spec and implemented in code.

We do a few validity checks, like making sure our match does not cross span borders (eg: *not a</span> valid em*) before checking if the delimiters are balanced.

In a normal em/strong, this is not a concern but for things like consecutive ems (*foo**bar*) or nested em/strong (***em*strong**), not all the delimiter runs are the same size. Depending on which delimiter is larger, and what the upcoming delimiter runs look like, we may do some special processing on these runs.

Finally, once we've decided what our opening and closing runs are, we process them and return the final text

Deviations from previous behaviour and GFM spec

One major deviation from the previous behaviour is that nested em and strong is now accepted.

__**foo**__ -> <strong><strong>foo</strong></strong>

And compared to the GFM spec, we still allow consecutive ems

# us
*foo**bar* -> <em>foo</em><em>bar</em>
# GFM
*foo**bar* -> <em>foo**bar</em>

I think both of these deviations can be put down to a matter of taste or style, I'm not sure one or the other will be objectively right or wrong.

Performance

It remains to be seen what perf is like in the real world. Comparing a previous test run to this PR, the main test suite seems to complete in a comparable amount of time.

The only area I see where performance has changed significantly is for the ReDoS case highlighted in #493.
In previous runs the library handles this in around 0.8 seconds, whereas now it seems to be 2.3-2.7s for an input of around 387,300 characters.
Not catastrophic, but certainly notable.

I've added some caching to the new class to mitigate this (previous result was around 10s which is a fail), but I'll see if I can improve it a bit more

Also refactor the GFM class to be more readable

…AB internal functions

Crozzers · 2025-12-07T19:37:09Z

For reviewers, I'd recommend casting an eye over the gfm_emphasis test case, comparing the text to the HTML version and just double checking that all of those conversions are as expected, although I haven't modified the existing em/strong test cases so all previous behaviour should work as expected

nicholasserra · 2025-12-11T20:11:55Z

This is wild, thank you. Gave it a good look and LGTM. Anything else you wanna sneak in before it's merged?

Crozzers · 2025-12-12T18:51:08Z

No. I was meaning to take another look at performance but haven't had the time. I'll submit a follow-up PR if I find anything

Crozzers added 9 commits November 30, 2025 09:49

Re-write the IAB processor to implement GFM rules

bf0a1e2

Get closer to GFM compliance

6ade9ab

Iron out some GFM edge cases

b3e512d

Acheive near full GFM compliance on iab

366ad8c

Acheive near complete GFM compliance

e8e7ced

Refactor inheritants of original IABP to use new GFM variant.

5988b09

Also refactor the GFM class to be more readable

Add issues 645, 652, 653 and 654 to gfm test case

060d48d

Improve performance in repetitive (ReDoS) scenarios by caching some I…

3b19616

…AB internal functions

Fix python typing syntax error

749c9cb

Crozzers marked this pull request as ready for review December 7, 2025 19:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rewrite emphasis and strong processing to be more GFM compliant #665

Rewrite emphasis and strong processing to be more GFM compliant #665

Uh oh!

Crozzers commented Dec 7, 2025 •

edited

Loading

Uh oh!

Crozzers commented Dec 7, 2025 •

edited

Loading

Uh oh!

nicholasserra commented Dec 11, 2025

Uh oh!

Crozzers commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rewrite emphasis and strong processing to be more GFM compliant #665

Are you sure you want to change the base?

Rewrite emphasis and strong processing to be more GFM compliant #665

Uh oh!

Conversation

Crozzers commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What has been done

How the new class works

Deviations from previous behaviour and GFM spec

Performance

Uh oh!

Crozzers commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nicholasserra commented Dec 11, 2025

Uh oh!

Crozzers commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Crozzers commented Dec 7, 2025 •

edited

Loading

Crozzers commented Dec 7, 2025 •

edited

Loading