Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotate tokens with their positions from the source text #98

Open
arseniiv opened this issue Jul 6, 2019 · 10 comments
Open

Annotate tokens with their positions from the source text #98

arseniiv opened this issue Jul 6, 2019 · 10 comments

Comments

@arseniiv
Copy link

arseniiv commented Jul 6, 2019

In Sprache, you provided IPositionAware and Positioned to make a parse result, well, aware of the position it’s parsed. I see this feature useful for giving a precise position of some syntax construct in post-parse checks (like, “this variable right here wasn’t declared” vs. the same without being able to report a position to the user, so they would have to search that place for themselves).

There is Result<T>.Location, but I don’t see how I could apply that to the resulting value via combinators. Could I achive it here, and which way you’d advice to do it best? (Or if maybe I’m looking for the wrong thing, and the thing mentioned should be done another way.)

@nblumhardt
Copy link
Member

Hi!

There's no built-in combinator; I think it would be reasonably easy to write one, using similar tactics to Sprache's implementation - keen to explore how it might look.

If you want to drop this into your own project I think it's roughly:

interface ILocated
{
    TextSpan Location { get; set; }
}

static TextParser<T> WithLocation<T>(this TextParser<T> parser)
    where T: ILocated
{
    return i => {
        var inner = parser(i);
        if (!inner.HasValue) return inner;
        inner.Value.Location = inner.Location;
        return inner;
    };
}

(Sketched in browser, no idea whether or not this will compile as-is ;-))

HTH,
Nick

@arseniiv
Copy link
Author

arseniiv commented Jul 7, 2019

Ah, thank you! I’ll look at it and write back if there will be problems hard to fix. (Or if it goes smoothly, anyway.)

@arseniiv
Copy link
Author

arseniiv commented Jul 8, 2019

Hi again, I’ve tested this code, and it works like a charm!

…Almost. Length of all TextSpans returned seems always be the same (and to be the full length of the string parsed). Is it expected? I used Superpower 2.3.0 from NuGet, and here is my source and some examples.

I’m okay with having only start positions, though. Thanks once more!

@nblumhardt
Copy link
Member

Thanks for the follow-up! That's great.

I think the proper span length could be reported using something like:

    return i => {
        var inner = parser(i);
        if (!inner.HasValue) return inner;
        inner.Value.Location = inner.Location.Until(inner.Remainder);
        return inner;
    };

Let's leave this open as a nod towards implementing it within Superpower sometime in the future :-)

@arseniiv
Copy link
Author

arseniiv commented Jul 9, 2019

This modification works nicely. 🙂

@JoeMGomes
Copy link

I have been trying to get this to work on a TokenListParser instead of a TextParser but I can't figure out how to retrieve the Token source position from the parsed TokenListParserResult. I want to have similar behaviour to the Positioned() method from Sprache but my grammar is fully Tokenized at this stage. I have bellow my adaptation of the WithLocation method but for TokenListParser. Is this possible with the current interface of Superpower? Am I missing something?

 public static TokenListParser<TKind, T> WithLocation<TKind, T>(this TokenListParser<TKind, T> parser)  
            where T : ILocated  
        {  
            return i => {  
                var inner = parser(i);  
                if (!inner.HasValue) return inner;  
                inner.Value.Location =   //Can't figure out how to retrieve position information from inner
                return inner;  
            };  
        }  

@nblumhardt
Copy link
Member

Hi @JoeMGomes - unfortunately no time to dig in properly but hopefully this helps:

The start index of the match within the input will be inner.Location.First().Position.Absolute.

The exclusive end index will be inner.Remainder.First().Position.Absolute.

In the second case it's also possible that the remainder token list will be empty, which would mean "matched until end of stream".

@JoeMGomes
Copy link

The start index of the match within the input will be inner.Location.First().Position.Absolute.

This worked nicely for me! Thank you very much! Any reason for this not to be part of Superpower?

@nblumhardt
Copy link
Member

That's good to know @JoeMGomes 👍

Just design and implementation time constraints, currently, but it seems like a worthwhile inclusion for the future 👍

@nblumhardt nblumhardt changed the title Question: an analog of Positioned from Sprache? Annotate tokens with their positions from the source text Jun 17, 2024
@Xevion
Copy link

Xevion commented Feb 2, 2025

Really great thread so far, although I was turned around for so long trying to figure out how to use the code above. That said, I got it working, and I think I've found issues / solutions, with tradeoffs, of course.

In the second case it's also possible that the remainder token list will be empty, which would mean "matched until end of stream".

This isn't hard to solve (inner.Location.Last()), but after implementing it and adding some test cases for my spans, I got errors due to mismatched whitespace within the matched/expected spans. As it turns out, @nblumhardt's suggested method isn't ideal if you're looking for precise positioning data of your parser within the root source string.

I'll explain: My parser in takes something like this:

prompt my_prompt_identifier {
      "choice_a" {}
      "choice_b" {}
      "choice_c" {}
}

When I intake the spans WithLocation provided me, you would expect the length of each branch of the prompt to be 13 characters from the first " to the }. But, only the choice_c branch would have a span of 13 characters; the other two will have a span of 21 characters.

The internal span they acquire looks like this (whitespace replaced with Unicode circle bullets):

"choice_a"•{}
••••••

This is not ideal, especially if you're trying to build something of a compiler or LSP server that needs character-by-character precision. I want the ability to highlight sections of my parsed elements with precision. Any whitespace, comments or otherwise ignored content located after your parser's used tokens would be included in the span!

Here's my solution that avoids this:

public readonly record struct PositionSpan(int Start = -1, int End = -1) {
    public bool HasValue => Start >= 0 && End >= 0;

    /// <summary>
    /// Apply the position span to acquire a substring of the source string.
    /// </summary>
    public string Substring(string source) {
        if (!HasValue) throw new InvalidOperationException("PositionSpan must have a value to be used.");
        return source.Substring(Start, End - Start);
    }
}

public interface ILocated {
    PositionSpan Span { set; }
}

public static class ParserExtensions {
    public static TokenListParser<TKind, T> WithLocation<TKind, T>(this TokenListParser<TKind, T> parser)
        where T : ILocated {
        return i => {
            var inner = parser(i);
            if (!inner.HasValue) return inner;


            var last = inner.Remainder.Any()
                // inner.Location contains both tokens used and the remainder, so get the last 'used' token
                ? inner.Location.ElementAt(inner.Location.Count() - inner.Remainder.Count() - 1)
                // inner.Location is 100% used, just grab the last one
                : inner.Location.Last();

            inner.Value.Span = new PositionSpan(inner.Location.First().Position.Absolute,
                last.Position.Absolute + last.Span.Length);

            return inner;
        };
    }
}

You'll notice that I abandoned TextSpan. This is because I don't know of any way to access the full source string and recalculate the necessary line/column data here. If there's a way, maybe someone would like to add to my solution or show me, but the benefits of continued use of TextSpan aren't key to my implementation.

Note: PositionSpan is inclusive from end to end, not start-inclusive & end-exclusive.

For completeness, here is my implementation of the TextParser extension, too:

    public static TextParser<T> WithLocation<T>(this TextParser<T> parser)
        where T : ILocated {
        return i => {
            var inner = parser(i);
            if (!inner.HasValue) return inner;
            inner.Value.Span = new PositionSpan(
                inner.Location.Position.Absolute, inner.Location.Position.Absolute + inner.Location.Length
            );
            return inner;
        };
    }

Lastly, because this thread didn't include a good example of how WithLocation is actually used, here's one for the TokenListParser extension, one that is actually kind of complicated:

public static readonly TokenListParser<DialogueToken, IPureNode> PromptNodeParser =
    (from promptKeyword in Token.EqualTo(DialogueToken.PromptNodeKeyword)
        from promptIdentifier in Token.EqualTo(DialogueToken.Identifier)
        from openBracket in Token.EqualTo(DialogueToken.OpenBracket)
        from branches in (
            from branchIdentifier in StringParser
            from openBracket in Token.EqualTo(DialogueToken.OpenBracket)
            from innerNodes in NodeParser
            from closeBracket in Token.EqualTo(DialogueToken.CloseBracket)
            select new PurePromptBranch(branchIdentifier, innerNodes)
        ).WithLocation().Many()
        from closeBracket in Token.EqualTo(DialogueToken.CloseBracket)
        select new PurePromptNode(promptIdentifier.ToStringValue(), branches) as IPureNode).WithLocation();

IPureNode inherits from ILocated, as well as the PurePromptBranch record struct. Any TokenListParser assembled with that T type (explicitly or, as shown in the anonymous inner branch parser, implicitly), will be able to use .WithLocation().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants