Skip to content

Consider changing entry points for string creations, return types and their names #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gkdn opened this issue Sep 9, 2022 · 8 comments

Comments

@gkdn
Copy link

gkdn commented Sep 9, 2022

Currently, stringref is the main entry point of all string creations and related high level methods. On the other side, there is a bunch of specialized functions in the views. Once you have a view instance you cannot go back to stringref so one needs to keep stringref as the main reference to carry around. With that every specialized operation requires calling a view creation function to get the view and then calling the function on them like get_codeunit.

I found this setup quite odd and asked why we don't simply have 2 types; string_wtf8 and string_wtf16 with conversion functions in between them. Jakob pointed out that it is designed this way so that we can keep/encourage a portable representation around strings.

When I changed my mental model to think of stringref as the portable abstraction, I still find the current style odd. One calls new_wtf16 or new_wtf16_arrayetc. and ends up withstringref` where you need to create a view for wtf16. This not very intuitive and the portable representation leaks in a lot places where one doesn't care.

It feels a lot more natural to me to have string_wtf8 and string_wtf16 as the entry points with all necessary APIs (e.g. string_wtf16.new, string_wtf16.encode, string_wtf16.eq etc.) and having an instruction like string_wtf16.as_portable that returns a string_portable when you need to cross the module boundary. And it seems to me in this model, the engine could still perform the delayed encoding/decoding and other similar optimizations that can be done in the original design.
This can also help with avoiding other naming inconsistencies around these types.

Anyway I wanted to move the discussion here and hear your thoughts.

@dcodeIO
Copy link
Contributor

dcodeIO commented Sep 9, 2022

I agree that the views are odd, as of this comment, and would prefer a structure akin to this suggestion as well. Perhaps, could this be modeled as

      stringref
     /         \
string8ref  string16ref

so that a stringXref can be trivially passed over a boundary, implicitly becoming a portable stringref if indicated by the receiver? I guess then there's an argument to be had about string8ref <-> string16ref as well.

@wingo
Copy link
Collaborator

wingo commented Sep 12, 2022

If I understand correctly, in Java you really want to be treating strings as WTF-16 all the time and anything else is a distraction. But there are other languages, Scheme or ML or Python for example won't care what encoding is used internally as long as there is by-codepoint access, and C and Rust will want WTF-8 but usually that is just as a stopping point to linear memory, and so on. So what good would splitting to string_wtf8 or string_wtf16 do for e.g. Python? Not much, I would think.

In general I think that given any specific source language, there are parts of stringrefs that are not interesting. But there is common functionality, too. Should that common functionality be duplicated for each view? Probably not, right?

All that said, I think @gkdn you highlight an issue that is well-described by @dcodeIO here also: #12 (comment) -- that either you decide to use e.g. stringview_wtf16 all the time internally, and just use strings on the boundary, or you obtain wtf16 views as needed when you want to access string contents. I think that the former pattern makes more sense for Java-like languages (including JS, Dart, Kotlin, C# etc). But, there is a missing piece, the ability to go from a string view back to a string. I propose to add this, wdyt: #44

@jakobkummerow
Copy link
Collaborator

To clarify, @gkdn and @dcodeIO, is this mostly an aesthetic/stylistic issue ("I find it odd"), or do you have specific concerns about efficiency of the current design?
While we can certainly take majority preferences on style questions into account, the latter would be significantly more important, I think.

For the record:

  • It is reasonable to expect that string.as_wtf16 will be a no-op in browser engines (and potentially other engines too, of course), in particular when the string was created from wtf16 data. So while it does cost an instruction in the Wasm module, it won't cause performance overhead.
  • The current design (among other goals) aims to support the situation where a mostly-wtf16-based application fetches utf8-encoded strings from external sources (e.g. downloading a text file/stream), and then performs operations on them that may or may not require encoding changes. This scenario illustrates that 8-bit and 16-bit encodings aren't as strictly separated as it may seem at first.

@gkdn
Copy link
Author

gkdn commented Sep 13, 2022

Yes it is mostly stylistic and understanding the model (i.e. I find it odd). There will be some extra instructions though probably won't add much to code size at the end. The underlying problem is leaking this "view" concept to the resulting API deeply where that doesn't seem necessary nor very natural to consumer.

re. a wtf16-based app dealing with utf8 from external source:

IIUC, there are 2 approaches to this problem:

  1. The compiler will model all strings as stringref and delay any calls to as_wtf16. This way engine can postpone conversion until the view call is made. The engine still needs to be smart since there will be multiple calls and it needs to manage around that. And there is still the risk of somebody calling as_wtf8 at any point.
  2. The compiler will model strings as the view or stringview_wtf16, and the engine needs to be smart about when to trigger potential conversion.

The optimization around both scenarios looks similarly complicated to me so that is why it is not clear to me how the current proposed modeling helps to the engine nor compilers in any way (like the wft-16 heavy app consuming utf8).

In the ideal world (since I'm coming from OO perspective); there would be an inheritance like relation between 2 types and the factory methods would go to relevant subtypes and common methods would go to parent type. I proposed the as_portable with the assumption that you are trying to avoid such relation between these types.

@dcodeIO
Copy link
Contributor

dcodeIO commented Sep 13, 2022

To me it seems that the string.* and stringview_xy.* types and operations introduce asymmetries as a result of the preference to think in terms of views - but don't necessarily need to. One such asymmetry is indicated by the observation that eq, concat etc. should also be usable with views, since conceptionally, the views are just specializations of a generic string type. This led me to the hierarchy above, that likely cannot exist in the type system, but is what conceptionally exists. Similarly, it has been mentioned that the views should be disallowed at the boundary, which then leads to more asymmetry in that, if a function is called both internally and externally, duplication or a wrapper will be required to have different signatures, which again seems avoidable.

Ultimately, it doesn't really matter to me how things are named, views or not, but I'd prefer if we could avoid such asymmetries that lead to more back and forth between refs and views than necessary, as I expect the final result to be less efficient and less compact than what we'd have achieved if we had thought in terms of specialized string types from the start.

@jakobkummerow
Copy link
Collaborator

@gkdn: To give another shot at explaining the model, follow this train of thought:

  1. Given that different string encodings exist in the world (and aren't going to go away any time soon), and are deeply baked into programming models, it's impossible to design a Wasm string system that gets by without any conversions in all circumstances.
  2. The way to minimize the non-zero number of unavoidable conversions is to give engines maximum freedom to perform them lazily. That's why the main stringref type is encoding agnostic. Aside from creating opportunities for avoiding conversions, this choice also creates implementation freedom: engines can choose how to implement strings internally, and they can revise their choices over time.
  3. The way to make those conversions that do have to happen as predictable and controllable as possible is to make them explicit. That's why views are created explicitly, and we don't just have get_nth_utf8_byte and get_nth_wtf16_codeunit and get_nth_codepoint instructions that operate on stringref.

Further illustrations/examples on these points:

  1. One effect causing bake-in is that different encodings have different complexities for various operations: e.g. wtf16/ucs2 allows O(1) access to the n-th code unit, utf8/wtf8 allows O(1) access to the n-th byte but only O(n) access to the n-th codepoint. This impacts how you'd want to implement various algorithms.
  2. Imagine one component of a system creating a string from utf8 input, and another component of a system writing it to utf8-encoded output. These components don't necessarily know about each other (one could be a library, or written by another member of a large team, etc), and they may well be written in a wtf16-based source language. The current design allows engines to avoid any conversions in this particular case. (Of course, if a string is created from utf8 data and then indexed by wtf16 code unit index, then a conversion is unavoidable, no matter how stringrefs are designed.)
    As an example of implementation freedom: a wtf16-centric browser engine could decide to implement native support for wtf8 strings as well, if/when engineering resources and code complexity concerns permit; or it could experiment with the tradeoffs between re-encodings and storing "breadcrumbs" or other shortcutting techniques.
  3. We wouldn't want to have a situation where code wants to iterate over a string by wtf16 code unit index, but the string happens to have utf8 representation internally, and each get_nth_codeunit instruction in the loop has to perform O(n) work. We could demand/hope that engines be sufficiently smart to recognize this pattern and optimize it; having a way to create a wtf16 view outside the loop and then iterate over that is a way to make this expectation explicit and provide engines with a nice clear hint.
    Regarding predictability, with point (2) in mind, there'll always be some unpredictability: a module could run on an engine where it gets lucky. But the existence of faster and slower engines is a fact of life for pretty much every Wasm feature, and with point (1) in mind the goal is not to guarantee zero-cost anyway, it's to minimize the unavoidable non-zero cost of doing business.

I'm not saying that things couldn't be designed differently, especially when looking at only one of these points at a time. But it does seem difficult to come up with a concept that doesn't regress at least one of the benefits that the current design provides.

As a (counter-) example: the idea to have string_utf8 <: portable_string and string_wtf16 <: portable_string and no views would certainly address the example (2) above (avoiding back-and-forth conversions). But to be efficient, it would force all engines into one particular implementation corner (which is to have full native support for both utf8 and wtf16 strings, so that upcasts can be explicit). Besides, code that wants to iterate over a string's codepoints (which is probably one of the most common forms of iteration?) would have to bring its own (utf8 or wtf16) decoder, because there's no (obvious) room for a stringview_iter in this concept.

@gkdn
Copy link
Author

gkdn commented Sep 13, 2022

To make sure we are on the same page and I have the correct assumptions;

If the module uses a stringview to represent their Strings, it doesn't matter if there is an explicit "view" concept or if they are called view or not, right? If it matches encoding, it will work directly and if it is not; it needs to delay the conversion (or other optimization) until it really needs to.

If the module uses stringref to represent their Strings, then it gives the engine a potential hint to make the conversion (or create breadcrumbs etc.).

And I'm thinking that you are thinking about the later scenario.
However I argue that even in this scenario, it might be still a too naive approach for the engine to take that as the hint; it very likely needs to be smarter about that.
For example if the app is accessing the codeunit from this stringref; at random points in time, there will be instructions to create the view and access the code unit. So it doesn't seem like a much of a different scenario then the app creating the view eagerly.

And given that both kind of modules can exist (i.e. modules with stringrefs or eager stringviews); I'm not sure if the engine is left with many optimization options except delaying the conversion as late as it can until the actual API is called that requires the conversion.

Anyway; this is mostly a speculation based on my limited perspective on the topic. If you think that the engines get meaning opportunities, you are the expert and I trust your judgement :)

@jakobkummerow
Copy link
Collaborator

Yes, the benefits I described mostly apply when module producers choose to represent strings as stringref. If they choose to represent strings as stringview_wtf16, then (especially if engines interpret view creations as conversion hints) that probably leads to about as many eager conversions as a design without views would. So my intuition is that the former approach (using stringref as default, creating views only where necessary) is preferable.

But that's just a guess: data to be gained from experimentation may prove it right or wrong. In particular, I don't know what fraction of strings in a typical application can avoid conversions. Also, as mentioned before, in V8 (and probably other browser engines too), stringref and stringview_wtf16 are the same thing under the hood anyway and conversion from utf8 always happens eagerly when creating strings, so thinking about lazy/delayed conversions is just guesswork to account for alternative engine implementations and/or future optimizations and/or utf8-based source languages running in browsers and interacting with JS/DOM there.

This is one aspect of the general theme that the proposed stringref design aims to support much more than the Java-in-the-browser use case, and if we only cared about that case we could indeed simplify the design by a lot. An incremental approach would be very interesting, but in particular something like the ref/view split is difficult to retrofit if we wanted to have an MVP version where it doesn't exist yet. We could conceivably drop the _wtf8 and _iter views for now (leaving only _wtf16), but that'd make it even harder to explain why the split exists at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants