Skip to content

Commit 82e9e2c

Browse files
committed
Updated RFC with the current protype design of the traits
1 parent 33b9490 commit 82e9e2c

File tree

1 file changed

+80
-57
lines changed

1 file changed

+80
-57
lines changed

text/0000-string-patterns.md

+80-57
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,9 @@ This presents a couple of issues:
2424

2525
- The API is inconsistent.
2626
- The API duplicates similar operations on different types. (`contains` vs `contains_char`)
27-
- The API does not provide all operations for all types. (No `rsplit` for `&str` patterns)
27+
- The API does not provide all operations for all types. (For example, no `rsplit` for `&str` patterns)
2828
- The API is not extensible, eg to allow splitting at regex matches.
29-
- The API offers no way to statically decide between different basic search algorithms
29+
- The API offers no way to explicitly decide between different search algorithms
3030
for the same pattern, for example to use Boyer-Moore string searching.
3131

3232
At the moment, the full set of relevant string methods roughly looks like this:
@@ -79,24 +79,24 @@ First, new traits will be added to the `str` module in the std library:
7979

8080
```rust
8181
trait Pattern<'a> {
82-
type MatcherImpl: Matcher<'a>;
82+
type Searcher: Searcher<'a>;
83+
fn into_matcher(self, haystack: &'a str) -> Self::Searcher;
8384

84-
fn into_matcher(self, haystack: &'a str) -> Self::MatcherImpl;
85-
86-
// Can be implemented to optimize the "find only" case.
87-
fn is_contained_in(self, haystack: &'a str) -> bool {
88-
self.into_matcher(s).next_match().is_some()
89-
}
85+
fn is_contained_in(self, haystack: &'a str) -> bool { /* default*/ }
86+
fn match_starts_at(self, haystack: &'a str, idx: usize) -> bool { /* default*/ }
87+
fn match_ends_at(self, haystack: &'a str, idx: usize) -> bool
88+
where Self::Searcher: ReverseSearcher<'a> { /* default*/ }
9089
}
9190
```
9291

9392
A `Pattern` represents a builder for an associated type implementing a
94-
family of `Matcher` traits (see below), and will be implemented by all types that
93+
family of `Searcher` traits (see below), and will be implemented by all types that
9594
represent string patterns, which includes:
9695

97-
- `char` and `&str`
98-
- Everything implementing `CharEq`
99-
- Additional types like `&Regex` or `Ascii`
96+
- `&str`
97+
- `char`, and everything else implementing `CharEq`
98+
- Third party types like `&Regex` or `Ascii`
99+
- Alternative algorithm wrappers like `struct BoyerMoore(&str)`
100100

101101
```rust
102102
impl<'a> Pattern<'a> for char { /* ... */ }
@@ -112,51 +112,62 @@ The lifetime parameter on `Pattern` exists in order to allow threading the lifet
112112
of the haystack (the string to be searched through) through the API, and is a workaround
113113
for not having associated higher kinded types yet.
114114

115-
Consumers of this API can then call `into_matcher()` on the pattern to convert it into
116-
a type implementing a family of `Matcher` traits:
115+
Consumers of this API can then call `into_searcher()` on the pattern to convert it into
116+
a type implementing a family of `Searcher` traits:
117117

118118
```rust
119-
unsafe trait Matcher<'a> {
120-
fn haystack(&self) -> &'a str
121-
fn next_match(&mut self) -> Option<(uint, uint)>;
119+
pub enum SearchStep {
120+
Match(usize, usize),
121+
Reject(usize, usize),
122+
Done
122123
}
124+
pub unsafe trait Searcher<'a> {
125+
fn haystack(&self) -> &'a str;
126+
fn next(&mut self) -> SearchStep;
123127

124-
unsafe trait ReverseMatcher<'a>: Matcher<'a> {
125-
fn next_match_back(&mut self) -> Option<(uint, uint)>;
128+
fn next_match(&mut self) -> Option<(usize, usize)> { /* default*/ }
129+
fn next_reject(&mut self) -> Option<(usize, usize)> { /* default*/ }
126130
}
131+
pub unsafe trait ReverseSearcher<'a>: Searcher<'a> {
132+
fn next_back(&mut self) -> SearchStep;
127133

128-
trait DoubleEndedMatcher<'a>: ReverseMatcher<'a> {}
134+
fn next_match_back(&mut self) -> Option<(usize, usize)> { /* default*/ }
135+
fn next_reject_back(&mut self) -> Option<(usize, usize)> { /* default*/ }
136+
}
137+
pub trait DoubleEndedSearcher<'a>: ReverseSearcher<'a> {}
129138
```
130139

131-
The basic idea of a `Matcher` is to expose a `Iterator`-like interface for
132-
iterating through all matches of a pattern in the given haystack.
140+
The basic idea of a `Searcher` is to expose a interface for
141+
iterating through all connected string fragments of the haystack while classifing them as either a match, or a reject.
133142

134-
Similar to iterators, depending on the concrete implementation a matcher can have
143+
This happens in form of the returned enum value. A `Match` needs to contain the start and end indices of a complete non-overlapping match, while a `Rejects` may be emitted for arbitary non-overlapping rejected parts of the string, as long as the start and end indices lie on valid utf8 boundaries.
144+
145+
Similar to iterators, depending on the concrete implementation a searcher can have
135146
additional capabilities that build on each other, which is why they will be
136147
defined in terms of a three-tier hierarchy:
137148

138-
- `Matcher<'a>` is the basic trait that all matchers need to implement.
139-
It contains a `next_match()` method that returns the `start` and `end` indices of
140-
the next non-overlapping match in the haystack, with the search beginning at the front
149+
- `Searcher<'a>` is the basic trait that all searchers need to implement.
150+
It contains a `next()` method that returns the `start` and `end` indices of
151+
the next match or reject in the haystack, with the search beginning at the front
141152
(left) of the string. It also contains a `haystack()` getter for returning the
142153
actual haystack, which is the source of the `'a` lifetime on the hierarchy.
143154
The reason for this getter being made part of the trait is twofold:
144-
- Every matcher needs to store some reference to the haystack anyway.
155+
- Every searcher needs to store some reference to the haystack anyway.
145156
- Users of this trait will need access to the haystack in order
146157
for the individual match results to be useful.
147-
- `ReverseMatcher<'a>` adds an `next_match_back` method, for also allowing to efficiently
148-
search for matches in reverse (starting from the right).
158+
- `ReverseSearcher<'a>` adds an `next_back()` method, for also allowing to efficiently
159+
search in reverse (starting from the right).
149160
However, the results are not required to be equal to the results of
150-
`next_match` in reverse, (as would be the case for the `DoubleEndedIterator` trait)
151-
as that can not be efficiently guaranteed for all matchers. (For an example, see further below)
152-
- Instead `DoubleEndedMatcher<'a>` is provided as an marker trait for expressing
153-
that guarantee - If a matcher implements this trait, all results found from the
161+
`next()` in reverse, (as would be the case for the `DoubleEndedIterator` trait)
162+
because that can not be efficiently guaranteed for all searchers. (For an example, see further below)
163+
- Instead `DoubleEndedSearcher<'a>` is provided as an marker trait for expressing
164+
that guarantee - If a searcher implements this trait, all results found from the
154165
left need to be equal to all results found from the right in reverse order.
155166

156167
As an important last detail, both
157-
`Matcher` and `ReverseMatcher` are marked as `unsafe` traits, even though the actual methods
168+
`Searcher` and `ReverseSearcher` are marked as `unsafe` traits, even though the actual methods
158169
aren't. This is because every implementation of these traits need to ensure that all
159-
indices returned by `next_match` and `next_match_back` lay on valid utf8 boundaries
170+
indices returned by `next()` and `next_back()` lie on valid utf8 boundaries
160171
in the haystack.
161172

162173
Without that guarantee, every single match returned by a matcher would need to be
@@ -171,6 +182,15 @@ Given that most implementations of these traits will likely
171182
live in the std library anyway, and are thoroughly tested, marking these traits `unsafe`
172183
doesn't seem like a huge burden to bear for good, optimizable performance.
173184

185+
### The role of the additional default methods
186+
187+
`Pattern`, `Searcher` and `ReverseSearcher` each offer a few additional
188+
default methods that give better optimization opportunities.
189+
190+
Most consumers of the pattern API will use them to more narrowly constraint
191+
how they are looking for a pattern, which given an optimized implementantion,
192+
should lead to mostly optimal code being generated.
193+
174194
### Example for the issue with double-ended searching
175195

176196
Let the haystack be the string `"fooaaaaabar"`, and let the pattern be the string `"aa"`.
@@ -190,10 +210,11 @@ be considered a different operation than "matching from the back".
190210

191211
### Why `(uint, uint)` instead of `&str`
192212

193-
It would be possible to define `next_match` and `next_match_back` to return an `&str`
194-
to the match instead of `(uint, uint)`.
213+
> Note: This section is a bit outdated now
195214
196-
A concrete matcher impl could then make use of unsafe code to construct such an slice cheaply,
215+
It would be possible to define `next` and `next_back` to return `&str`s instead of `(uint, uint)` tuples.
216+
217+
A concrete searcher impl could then make use of unsafe code to construct such an slice cheaply,
197218
and by its very nature it is guaranteed to lie on utf8 boundaries,
198219
which would also allow not marking the traits as unsafe.
199220

@@ -224,7 +245,7 @@ as the "simple" default design.
224245

225246
## New methods on `StrExt`
226247

227-
With the `Pattern` and `Matcher` traits defined and implemented, the actual `str`
248+
With the `Pattern` and `Searcher` traits defined and implemented, the actual `str`
228249
methods will be changed to make use of them:
229250

230251
```rust
@@ -245,17 +266,17 @@ pub trait StrExt for ?Sized {
245266

246267
fn starts_with<'a, P>(&'a self, pat: P) -> bool where P: Pattern<'a>;
247268
fn ends_with<'a, P>(&'a self, pat: P) -> bool where P: Pattern<'a>,
248-
P::MatcherImpl: ReverseMatcher<'a>;
269+
P::Searcher: ReverseSearcher<'a>;
249270

250271
fn trim_matches<'a, P>(&'a self, pat: P) -> &'a str where P: Pattern<'a>,
251-
P::MatcherImpl: ReverseMatcher<'a>;
272+
P::Searcher: DoubleEndedSearcher<'a>;
252273
fn trim_left_matches<'a, P>(&'a self, pat: P) -> &'a str where P: Pattern<'a>;
253274
fn trim_right_matches<'a, P>(&'a self, pat: P) -> &'a str where P: Pattern<'a>,
254-
P::MatcherImpl: ReverseMatcher<'a>;
275+
P::Searcher: ReverseSearcher<'a>;
255276

256277
fn find<'a, P>(&'a self, pat: P) -> Option<uint> where P: Pattern<'a>;
257278
fn rfind<'a, P>(&'a self, pat: P) -> Option<uint> where P: Pattern<'a>,
258-
P::MatcherImpl: ReverseMatcher<'a>;
279+
P::Searcher: ReverseSearcher<'a>;
259280

260281
// ...
261282
}
@@ -278,7 +299,7 @@ changed to uniformly use the new pattern API. The main differences are:
278299
to behave like a double ended queues where you just pop elements from both sides.
279300

280301
_However_, all iterators will still implement `DoubleEndedIterator` if the underlying
281-
matcher implements `DoubleEndedMatcher`, to keep the ability to do things like `foo.split('a').rev()`.
302+
matcher implements `DoubleEndedSearcher`, to keep the ability to do things like `foo.split('a').rev()`.
282303

283304
## Transition and deprecation plans
284305

@@ -288,7 +309,7 @@ methods will still compile, or give deprecation warning.
288309
It would even be possible to generically implement `Pattern` for all `CharEq` types,
289310
making the transition more painless.
290311

291-
Long-term, post 1.0, it would be possible to define new sets of `Pattern` and `Matcher`
312+
Long-term, post 1.0, it would be possible to define new sets of `Pattern` and `Searcher`
292313
without a lifetime parameter by making use of higher kinded types in order to simplify the
293314
string APIs. Eg, instead of `fn starts_with<'a, P>(&'a self, pat: P) -> bool where P: Pattern<'a>;`
294315
you'd have `fn starts_with<P>(&self, pat: P) -> bool where P: Pattern;`.
@@ -298,30 +319,30 @@ forward to the old traits, which would roughly look like this:
298319

299320
```rust
300321
unsafe trait NewPattern {
301-
type MatcherImpl<'a> where MatcherImpl: NewMatcher;
322+
type Searcher<'a> where Searcher: NewSearcher;
302323

303-
fn into_matcher<'a>(self, s: &'a str) -> Self::MatcherImpl<'a>;
324+
fn into_matcher<'a>(self, s: &'a str) -> Self::Searcher<'a>;
304325
}
305326

306327
unsafe impl<'a, P> Pattern<'a> for P where P: NewPattern {
307-
type MatcherImpl = <Self as NewPattern>::MatcherImpl<'a>;
328+
type Searcher = <Self as NewPattern>::Searcher<'a>;
308329

309-
fn into_matcher(self, haystack: &'a str) -> Self::MatcherImpl {
330+
fn into_matcher(self, haystack: &'a str) -> Self::Searcher {
310331
<Self as NewPattern>::into_matcher(self, haystack)
311332
}
312333
}
313334

314-
unsafe trait NewMatcher for Self<'_> {
335+
unsafe trait NewSearcher for Self<'_> {
315336
fn haystack<'a>(self: &Self<'a>) -> &'a str;
316337
fn next_match<'a>(self: &mut Self<'a>) -> Option<(uint, uint)>;
317338
}
318339

319-
unsafe impl<'a, M> Matcher<'a> for M<'a> where M: NewMatcher {
340+
unsafe impl<'a, M> Searcher<'a> for M<'a> where M: NewSearcher {
320341
fn haystack(&self) -> &'a str {
321-
<M as NewMatcher>::haystack(self)
342+
<M as NewSearcher>::haystack(self)
322343
}
323344
fn next_match(&mut self) -> Option<(uint, uint)> {
324-
<M as NewMatcher>::next_match(self)
345+
<M as NewSearcher>::next_match(self)
325346
}
326347
}
327348
```
@@ -346,6 +367,8 @@ the `prelude` (which would be unneeded anyway).
346367

347368
# Alternatives
348369

370+
> Note: This section is not updated to the new naming scheme
371+
349372
In general:
350373

351374
- Keep status quo, with all issues listed at the beginning.
@@ -371,8 +394,8 @@ some negative trade-offs:
371394
for immediate results.
372395
- Extend `Pattern` into `Pattern` and `ReversePattern`, starting the forward-reverse split at the level of
373396
patterns directly. The two would still be in a inherits-from relationship like
374-
`Matcher` and `ReverseMatcher`, and be interchangeable if the later also implement `DoubleEndedMatcher`,
375-
but on the `str` API where clauses like `where P: Pattern<'a>, P::MatcherImpl: ReverseMatcher<'a>`
397+
`Matcher` and `ReverseSearcher`, and be interchangeable if the later also implement `DoubleEndedSearcher`,
398+
but on the `str` API where clauses like `where P: Pattern<'a>, P::Searcher: ReverseSearcher<'a>`
376399
would turn into `where P: ReversePattern<'a>`.
377400

378401
Lastly, there are alternatives that don't seem very favorable, but are listed for completeness sake:
@@ -400,7 +423,7 @@ Lastly, there are alternatives that don't seem very favorable, but are listed fo
400423
- Should the API split in regard to forward-reverse matching be as symmetrical as possible,
401424
or as minimal as possible?
402425
In the first case, iterators like `Matches` and `RMatches` could both implement `DoubleEndedIterator` if a
403-
`DoubleEndedMatcher` exists, in the latter only `Matches` would, with `RMatches` only providing the
426+
`DoubleEndedSearcher` exists, in the latter only `Matches` would, with `RMatches` only providing the
404427
minimum to support reverse operation.
405428
A ruling in favor of symmetry would also speak for the `ReversePattern` alternative.
406429

0 commit comments

Comments
 (0)