Annotate tokens with their positions from the source text #98

arseniiv · 2019-07-06T17:27:07Z

In Sprache, you provided IPositionAware and Positioned to make a parse result, well, aware of the position it’s parsed. I see this feature useful for giving a precise position of some syntax construct in post-parse checks (like, “this variable right here wasn’t declared” vs. the same without being able to report a position to the user, so they would have to search that place for themselves).

There is Result<T>.Location, but I don’t see how I could apply that to the resulting value via combinators. Could I achive it here, and which way you’d advice to do it best? (Or if maybe I’m looking for the wrong thing, and the thing mentioned should be done another way.)

The text was updated successfully, but these errors were encountered:

nblumhardt · 2019-07-07T23:13:45Z

Hi!

There's no built-in combinator; I think it would be reasonably easy to write one, using similar tactics to Sprache's implementation - keen to explore how it might look.

If you want to drop this into your own project I think it's roughly:

interface ILocated
{
    TextSpan Location { get; set; }
}

static TextParser<T> WithLocation<T>(this TextParser<T> parser)
    where T: ILocated
{
    return i => {
        var inner = parser(i);
        if (!inner.HasValue) return inner;
        inner.Value.Location = inner.Location;
        return inner;
    };
}

(Sketched in browser, no idea whether or not this will compile as-is ;-))

HTH,
Nick

arseniiv · 2019-07-07T23:20:54Z

Ah, thank you! I’ll look at it and write back if there will be problems hard to fix. (Or if it goes smoothly, anyway.)

arseniiv · 2019-07-08T23:10:50Z

Hi again, I’ve tested this code, and it works like a charm!

…Almost. Length of all TextSpans returned seems always be the same (and to be the full length of the string parsed). Is it expected? I used Superpower 2.3.0 from NuGet, and here is my source and some examples.

I’m okay with having only start positions, though. Thanks once more!

nblumhardt · 2019-07-09T01:06:42Z

Thanks for the follow-up! That's great.

I think the proper span length could be reported using something like:

    return i => {
        var inner = parser(i);
        if (!inner.HasValue) return inner;
        inner.Value.Location = inner.Location.Until(inner.Remainder);
        return inner;
    };

Let's leave this open as a nod towards implementing it within Superpower sometime in the future :-)

arseniiv · 2019-07-09T15:30:02Z

This modification works nicely. 🙂

JoeMGomes · 2022-11-21T10:00:19Z

I have been trying to get this to work on a TokenListParser instead of a TextParser but I can't figure out how to retrieve the Token source position from the parsed TokenListParserResult. I want to have similar behaviour to the Positioned() method from Sprache but my grammar is fully Tokenized at this stage. I have bellow my adaptation of the WithLocation method but for TokenListParser. Is this possible with the current interface of Superpower? Am I missing something?

 public static TokenListParser<TKind, T> WithLocation<TKind, T>(this TokenListParser<TKind, T> parser)  
            where T : ILocated  
        {  
            return i => {  
                var inner = parser(i);  
                if (!inner.HasValue) return inner;  
                inner.Value.Location =   //Can't figure out how to retrieve position information from inner
                return inner;  
            };  
        }

nblumhardt · 2022-11-21T23:00:42Z

Hi @JoeMGomes - unfortunately no time to dig in properly but hopefully this helps:

The start index of the match within the input will be inner.Location.First().Position.Absolute.

The exclusive end index will be inner.Remainder.First().Position.Absolute.

In the second case it's also possible that the remainder token list will be empty, which would mean "matched until end of stream".

JoeMGomes · 2022-11-22T10:46:01Z

The start index of the match within the input will be inner.Location.First().Position.Absolute.

This worked nicely for me! Thank you very much! Any reason for this not to be part of Superpower?

nblumhardt · 2022-11-22T20:33:36Z

That's good to know @JoeMGomes 👍

Just design and implementation time constraints, currently, but it seems like a worthwhile inclusion for the future 👍

Xevion · 2025-02-02T05:21:45Z

Really great thread so far, although I was turned around for so long trying to figure out how to use the code above. That said, I got it working, and I think I've found issues / solutions, with tradeoffs, of course.

In the second case it's also possible that the remainder token list will be empty, which would mean "matched until end of stream".

This isn't hard to solve (inner.Location.Last()), but after implementing it and adding some test cases for my spans, I got errors due to mismatched whitespace within the matched/expected spans. As it turns out, @nblumhardt's suggested method isn't ideal if you're looking for precise positioning data of your parser within the root source string.

I'll explain: My parser in takes something like this:

prompt my_prompt_identifier {
      "choice_a" {}
      "choice_b" {}
      "choice_c" {}
}

When I intake the spans WithLocation provided me, you would expect the length of each branch of the prompt to be 13 characters from the first " to the }. But, only the choice_c branch would have a span of 13 characters; the other two will have a span of 21 characters.

The internal span they acquire looks like this (whitespace replaced with Unicode circle bullets):

"choice_a"•{}
••••••

This is not ideal, especially if you're trying to build something of a compiler or LSP server that needs character-by-character precision. I want the ability to highlight sections of my parsed elements with precision. Any whitespace, comments or otherwise ignored content located after your parser's used tokens would be included in the span!

Here's my solution that avoids this:

public readonly record struct PositionSpan(int Start = -1, int End = -1) {
    public bool HasValue => Start >= 0 && End >= 0;

    /// <summary>
    /// Apply the position span to acquire a substring of the source string.
    /// </summary>
    public string Substring(string source) {
        if (!HasValue) throw new InvalidOperationException("PositionSpan must have a value to be used.");
        return source.Substring(Start, End - Start);
    }
}

public interface ILocated {
    PositionSpan Span { set; }
}

public static class ParserExtensions {
    public static TokenListParser<TKind, T> WithLocation<TKind, T>(this TokenListParser<TKind, T> parser)
        where T : ILocated {
        return i => {
            var inner = parser(i);
            if (!inner.HasValue) return inner;


            var last = inner.Remainder.Any()
                // inner.Location contains both tokens used and the remainder, so get the last 'used' token
                ? inner.Location.ElementAt(inner.Location.Count() - inner.Remainder.Count() - 1)
                // inner.Location is 100% used, just grab the last one
                : inner.Location.Last();

            inner.Value.Span = new PositionSpan(inner.Location.First().Position.Absolute,
                last.Position.Absolute + last.Span.Length);

            return inner;
        };
    }
}

You'll notice that I abandoned TextSpan. This is because I don't know of any way to access the full source string and recalculate the necessary line/column data here. If there's a way, maybe someone would like to add to my solution or show me, but the benefits of continued use of TextSpan aren't key to my implementation.

Note: PositionSpan is inclusive from end to end, not start-inclusive & end-exclusive.

For completeness, here is my implementation of the TextParser extension, too:

    public static TextParser<T> WithLocation<T>(this TextParser<T> parser)
        where T : ILocated {
        return i => {
            var inner = parser(i);
            if (!inner.HasValue) return inner;
            inner.Value.Span = new PositionSpan(
                inner.Location.Position.Absolute, inner.Location.Position.Absolute + inner.Location.Length
            );
            return inner;
        };
    }

Lastly, because this thread didn't include a good example of how WithLocation is actually used, here's one for the TokenListParser extension, one that is actually kind of complicated:

public static readonly TokenListParser<DialogueToken, IPureNode> PromptNodeParser =
    (from promptKeyword in Token.EqualTo(DialogueToken.PromptNodeKeyword)
        from promptIdentifier in Token.EqualTo(DialogueToken.Identifier)
        from openBracket in Token.EqualTo(DialogueToken.OpenBracket)
        from branches in (
            from branchIdentifier in StringParser
            from openBracket in Token.EqualTo(DialogueToken.OpenBracket)
            from innerNodes in NodeParser
            from closeBracket in Token.EqualTo(DialogueToken.CloseBracket)
            select new PurePromptBranch(branchIdentifier, innerNodes)
        ).WithLocation().Many()
        from closeBracket in Token.EqualTo(DialogueToken.CloseBracket)
        select new PurePromptNode(promptIdentifier.ToStringValue(), branches) as IPureNode).WithLocation();

IPureNode inherits from ILocated, as well as the PurePromptBranch record struct. Any TokenListParser assembled with that T type (explicitly or, as shown in the anonymous inner branch parser, implicitly), will be able to use .WithLocation().

nblumhardt added enhancement up-for-grabs labels Dec 2, 2020

nblumhardt changed the title ~~Question: an analog of Positioned from Sprache?~~ Annotate tokens with their positions from the source text Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotate tokens with their positions from the source text #98

Annotate tokens with their positions from the source text #98

arseniiv commented Jul 6, 2019

nblumhardt commented Jul 7, 2019

arseniiv commented Jul 7, 2019 •

edited

Loading

arseniiv commented Jul 8, 2019

nblumhardt commented Jul 9, 2019

arseniiv commented Jul 9, 2019

JoeMGomes commented Nov 21, 2022

nblumhardt commented Nov 21, 2022

JoeMGomes commented Nov 22, 2022

nblumhardt commented Nov 22, 2022

Xevion commented Feb 2, 2025 •

edited

Loading

Annotate tokens with their positions from the source text #98

Annotate tokens with their positions from the source text #98

Comments

arseniiv commented Jul 6, 2019

nblumhardt commented Jul 7, 2019

arseniiv commented Jul 7, 2019 • edited Loading

arseniiv commented Jul 8, 2019

nblumhardt commented Jul 9, 2019

arseniiv commented Jul 9, 2019

JoeMGomes commented Nov 21, 2022

nblumhardt commented Nov 21, 2022

JoeMGomes commented Nov 22, 2022

nblumhardt commented Nov 22, 2022

Xevion commented Feb 2, 2025 • edited Loading

arseniiv commented Jul 7, 2019 •

edited

Loading

Xevion commented Feb 2, 2025 •

edited

Loading