Rewrite emphasis and strong processing to be more GFM compliant #665
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes #632, fixes #645, fixes #652, fixes #653, fixes #654.
Many of the em and strong related issues that have been raised in the past couple of months have been weird edge cases that can technically be considered valid according to the original markdown spec. However, the original spec is very loose, and most markdown processors these days have an extended set of rules to handle these cases.
This PR adds a new italics and bold processing extra that aims to mostly implement the GFM emphasis rules, which are very comprehensive and cover pretty much every edge case.
What has been done
I've added a new
GFMItalicAndBoldProcessorclass that implements most of GFMs rules and processes em and strong. This is now being used by the mainMarkdownclass, instead of the oldstrong_reandem_reregexes.I've left the existing
ItalicAndBoldProcessorclass in place. The new class does not work the same way as the old one, and whilst it wouldn't break any custom extras implemented (ie: throw an error), they would simply no longer work.This library doesn't really do semantic versioning, so for these reasons I've left it in place.
I've also added a new
gfm_emphasistest case, which uses many snippets from the GFM spec, as well as examples from issues raised.How the new class works
Previously we would match entire emphasis spans with regex, trying to filter out edge cases in the regex and failing that, in the
subfunction.Now, we match delimiter runs, which are one or more
*_chars. For each delimiter run we decide, based on a set of rules, whether that run is an opening (left flanking) run or a closing (right flanking) run (or both). These rules are detailed in the GFM spec and implemented in code.We do a few validity checks, like making sure our match does not cross span borders (eg:
*not a</span> valid em*) before checking if the delimiters are balanced.In a normal em/strong, this is not a concern but for things like consecutive ems (
*foo**bar*) or nested em/strong (***em*strong**), not all the delimiter runs are the same size. Depending on which delimiter is larger, and what the upcoming delimiter runs look like, we may do some special processing on these runs.Finally, once we've decided what our opening and closing runs are, we process them and return the final text
Deviations from previous behaviour and GFM spec
One major deviation from the previous behaviour is that nested em and strong is now accepted.
And compared to the GFM spec, we still allow consecutive ems
I think both of these deviations can be put down to a matter of taste or style, I'm not sure one or the other will be objectively right or wrong.
Performance
It remains to be seen what perf is like in the real world. Comparing a previous test run to this PR, the main test suite seems to complete in a comparable amount of time.
The only area I see where performance has changed significantly is for the ReDoS case highlighted in #493.
In previous runs the library handles this in around 0.8 seconds, whereas now it seems to be 2.3-2.7s for an input of around 387,300 characters.
Not catastrophic, but certainly notable.
I've added some caching to the new class to mitigate this (previous result was around 10s which is a fail), but I'll see if I can improve it a bit more