-
Notifications
You must be signed in to change notification settings - Fork 355
RFC 3986 conformant URI encoding policy #517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@arturobernalg It makes no sense to change formatting of the fragment component and ignore all other components. |
286850f to
8a46343
Compare
@ok2c please take another look. |
|
@arturobernalg What about |
5517ff9 to
41ba051
Compare
@ok2c here's a consolidated fix ensuring consistent RFC 3986–compliant encoding for userinfo, reg-name, path segments, query, and fragment. |
ok2c
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg Do you mind investing some more effort into it and adding a new attribute encodingPolicy as an enum of ALL_RESERVED|RFC_3986 or similar and encoding components based on that policy?
| URIC.or(UNRESERVED); | ||
| } | ||
|
|
||
| static final BitSet FRAGMENT_SAFE = new BitSet(256); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg FRAGMENT_SAFE does not seem to be used anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
@ok2c Done. Please do another pas |
ok2c
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg Do you mind keeping ALL_RESERVED mode for compatibility with the behavior of previous versions?
| if (idx != -1) { | ||
| PercentCodec.encode(sb, this.userInfo.substring(0, idx), this.charset); | ||
| PercentCodec.encode(sb, this.userInfo.substring(0, idx), this.charset, | ||
| encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.URIC, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg Should not it be USERINFO, not URIC? Same below.
| } else { | ||
| sb.append(PercentCodec.encode(this.host, this.charset)); | ||
| PercentCodec.encode(sb, this.host, this.charset, | ||
| encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.URIC, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg Should not it be REG_NAME, not URIC?
| sb.append("?"); | ||
| PercentCodec.encode(sb, this.query, this.charset, PercentCodec.URIC, false); | ||
| PercentCodec.encode(sb, this.query, this.charset, | ||
| encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.FRAGMENT, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg Careful there. Here we are encoding a complete query that may use sub-delims. Even in ALL_RESERVED mode it should still be URIC.
I suppose this line should be
encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.URIC : PercentCodec.QUERY, false)
|
The initial motivation for HTTPCORE-778 was the need to put some With the above changes, if I read the diff correctly, we'll be able to use I understand that this probably opens a can of worms as it likely would require functionality similar to what is available for query parameters, but unlike query parameters in case of fragment this is only one of many ways how fragments are used. |
|
@peterhalicky no matter how hard we try we just cannot make everyone happy, can we? The only way of accommodating your request would be to add support for |
Yes, I agree, I'd be quite hesitant to implement this myself. For my use case I can get away with restricting the keys and values to not containing |
@peterhalicky The whole point of representing the query component as |
@ok2c The fragment component often contains all kinds of content. Sometimes it is A possible solution could be to let the client build the fragment by parts, specifying how much encoding should be done for each part. Something like |
c9ee307 to
bca8af6
Compare
|
@peterhalicky Those folks who need this stuff are welcome to build their own builders. |
| private Charset charset; | ||
| private String fragment; | ||
| private String encodedFragment; | ||
| private EncodingPolicy encodingPolicy = EncodingPolicy.RFC_3986; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg Should not it be EncodingPolicy.ALL_RESERVED by default?
6c63fce to
c685037
Compare
Previously, URIBuilder relied on partial or inconsistent character sets for various components, causing valid sub-delims or other characters to be unnecessarily percent-encoded or left unencoded in the fragment, query, userinfo, reg-name, and path segments. This patch introduces dedicated BitSets for each URI component (userinfo, host/reg-name, path segments, query, and fragment) and updates URIBuilder to use them. As a result, characters like ':', '@', '/', and '?' remain unencoded in the fragment and query where allowed by RFC 3986, while certain sub-delimiters in the path and host are now percent-encoded for strictness. This ensures consistent, RFC 3986–compliant encoding across all URI components.
88acd57 to
cb71a7e
Compare
| sb.append("#"); | ||
| PercentCodec.encode(sb, this.fragment, this.charset, PercentCodec.URIC, false); | ||
| PercentCodec.encode(sb, this.fragment, this.charset, | ||
| encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.FRAGMENT, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg We should use URIC here for consistency with the latest changes in 5.3.x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
03616b7 to
1daec67
Compare
| USERINFO.set(':'); | ||
| REG_NAME.or(UNRESERVED); | ||
| REG_NAME.or(SUB_DELIMS); | ||
| REG_NAME.clear('!'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg Why clearing '!'? The reg-name is defimed as *( unreserved / pct-encoded / sub-delims ). The '!' should be permitted or am I missing something?
| PATH_SEGMENT.clear('('); | ||
| PATH_SEGMENT.clear(')'); | ||
| PATH_SEGMENT.clear('&'); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg I cannot find anything in the RFC stating that. What am I missing? What a super strict mode we have ALL_RESERVED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg I cannot find anything in the RFC stating that. What am I missing? What a super strict mode we have
ALL_RESERVED
@olegk The test expects (, ), & to remain percent-encoded in path segments (e.g., blah-%28%20-blah-%20%26%20-blah-%20%29-blah/), but without the PATH_SEGMENT.clear('('), clear(')'), clear('&') in PercentCodec.java, it outputs blah-(%20-blah-%20&%20-blah-%20)-blah/. Reintroducing the clear() calls fixes it, but I’m wondering if this aligns with our intended behavior or if we should adjust formatPath to handle encoding differently based on encodingPolicy.
I noticed the older formatPath used PercentCodec.encode(buf, segment, charset, false) with UNRESERVED, which worked, but the new version with encodingPolicy support might need tweaking. Could you take a look and advise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg You have overlooked static formatQuery and formatPath methods. They should also take the encoding policy into account. Apply this patch and everything should work
===================================================================
diff --git a/httpcore5/src/main/java/org/apache/hc/core5/net/PercentCodec.java b/httpcore5/src/main/java/org/apache/hc/core5/net/PercentCodec.java
--- a/httpcore5/src/main/java/org/apache/hc/core5/net/PercentCodec.java (revision 53477ecc93d331f8348a6748a918d86b25666471)
+++ b/httpcore5/src/main/java/org/apache/hc/core5/net/PercentCodec.java (date 1742025384945)
@@ -137,10 +137,6 @@
FRAGMENT.or(PCHAR);
FRAGMENT.set('/');
FRAGMENT.set('?');
- // Some sub-delims remain encoded (RFC 3986 allows them unencoded, but we choose to be strict).
- PATH_SEGMENT.clear('(');
- PATH_SEGMENT.clear(')');
- PATH_SEGMENT.clear('&');
}
private static final int RADIX = 16;
Index: httpcore5/src/main/java/org/apache/hc/core5/net/URIBuilder.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/httpcore5/src/main/java/org/apache/hc/core5/net/URIBuilder.java b/httpcore5/src/main/java/org/apache/hc/core5/net/URIBuilder.java
--- a/httpcore5/src/main/java/org/apache/hc/core5/net/URIBuilder.java (revision 53477ecc93d331f8348a6748a918d86b25666471)
+++ b/httpcore5/src/main/java/org/apache/hc/core5/net/URIBuilder.java (date 1742026397263)
@@ -34,6 +34,7 @@
import java.nio.charset.StandardCharsets;
import java.util.ArrayList;
import java.util.Arrays;
+import java.util.BitSet;
import java.util.Collections;
import java.util.LinkedList;
import java.util.List;
@@ -343,19 +344,25 @@
return list;
}
- static void formatPath(final StringBuilder buf, final Iterable<String> segments, final boolean rootless, final Charset charset) {
+ static void formatPath(final StringBuilder buf, final Iterable<String> segments, final boolean rootless,
+ final Charset charset, final BitSet safechars) {
int i = 0;
for (final String segment : segments) {
if (i > 0 || !rootless) {
buf.append(PATH_SEPARATOR);
}
- PercentCodec.encode(buf, segment, charset, PercentCodec.PATH_SEGMENT, false);
+ PercentCodec.encode(buf, segment, charset, safechars, false);
i++;
}
}
- static void formatQuery(final StringBuilder buf, final Iterable<? extends NameValuePair> params, final Charset charset,
- final boolean blankAsPlus) {
+ static void formatPath(final StringBuilder buf, final Iterable<String> segments, final boolean rootless,
+ final Charset charset) {
+ formatPath(buf, segments, rootless, charset, PercentCodec.UNRESERVED);
+ }
+
+ static void formatQuery(final StringBuilder buf, final Iterable<? extends NameValuePair> params,
+ final Charset charset, final BitSet safechars, final boolean blankAsPlus) {
int i = 0;
for (final NameValuePair parameter : params) {
if (i > 0) {
@@ -364,12 +371,17 @@
PercentCodec.encode(buf, parameter.getName(), charset, blankAsPlus);
if (parameter.getValue() != null) {
buf.append(PARAM_VALUE_SEPARATOR);
- PercentCodec.encode(buf, parameter.getValue(), charset, blankAsPlus);
+ PercentCodec.encode(buf, parameter.getValue(), charset, safechars, blankAsPlus);
}
i++;
}
}
+ static void formatQuery(final StringBuilder buf, final Iterable<? extends NameValuePair> params,
+ final Charset charset, final boolean blankAsPlus) {
+ formatQuery(buf, params, charset, PercentCodec.UNRESERVED, blankAsPlus);
+ }
+
/**
* Builds a {@link URI} instance.
*/
@@ -429,13 +441,15 @@
}
sb.append(this.encodedPath);
} else if (this.pathSegments != null) {
- formatPath(sb, this.pathSegments, !authoritySpecified && this.pathRootless, this.charset);
+ formatPath(sb, this.pathSegments, !authoritySpecified && this.pathRootless, this.charset,
+ encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.PATH_SEGMENT);
}
if (this.encodedQuery != null) {
sb.append("?").append(this.encodedQuery);
} else if (this.queryParams != null && !this.queryParams.isEmpty()) {
sb.append("?");
- formatQuery(sb, this.queryParams, this.charset, false);
+ formatQuery(sb, this.queryParams, this.charset,
+ encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.QUERY, false);
} else if (this.query != null) {
sb.append("?");
PercentCodec.encode(sb, this.query, this.charset,
395ae51 to
53477ec
Compare
167591a to
00f5bba
Compare
| .setScheme("https") | ||
| .setHost("somehost.com") | ||
| .setFragment("some fragment with all sorts of $tuff in it!!!") | ||
| .setEncodingPolicy(URIBuilder.EncodingPolicy.RFC_3986) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arturobernalg I think the test should also pass with ALL_RESERVED default
httpcore5/src/test/java/org/apache/hc/core5/net/TestURIBuilder.java
Outdated
Show resolved
Hide resolved
| * is {@link EncodingPolicy#RFC_3986}. | ||
| * | ||
| * @param encodingPolicy the encoding policy to apply, or {@code null} to reset | ||
| * to the default ({@link EncodingPolicy#RFC_3986}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is no longer the case ALL_RESERVED seems to be the default now (see above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
… and fragment encoding to PercentCodec.FRAGMENT under RFC_3986
45d5bd3 to
62bd76d
Compare
This pull request addresses issue HTTPCORE-778, which noted that our encoding for fragment identifiers in selectors was not fully compliant with RFC 3986. The RFC defines a fragment as:
fragment = *( pchar / "/" / "?" )with:
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"