Skip to content

Conversation

@arturobernalg
Copy link
Member

This pull request addresses issue HTTPCORE-778, which noted that our encoding for fragment identifiers in selectors was not fully compliant with RFC 3986. The RFC defines a fragment as:

fragment = *( pchar / "/" / "?" )

with:

pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

@ok2c
Copy link
Member

ok2c commented Mar 12, 2025

@arturobernalg It makes no sense to change formatting of the fragment component and ignore all other components.

@arturobernalg arturobernalg force-pushed the HTTPCORE-778 branch 3 times, most recently from 286850f to 8a46343 Compare March 12, 2025 12:38
@arturobernalg
Copy link
Member Author

@arturobernalg It makes no sense to change formatting of the fragment component and ignore all other components.

@ok2c please take another look.

@ok2c
Copy link
Member

ok2c commented Mar 12, 2025

@arturobernalg What about userinfo, reg-name, path segment?

@arturobernalg
Copy link
Member Author

@arturobernalg What about userinfo, reg-name, path segment?

@ok2c here's a consolidated fix ensuring consistent RFC 3986–compliant encoding for userinfo, reg-name, path segments, query, and fragment.

Copy link
Member

@ok2c ok2c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Do you mind investing some more effort into it and adding a new attribute encodingPolicy as an enum of ALL_RESERVED|RFC_3986 or similar and encoding components based on that policy?

URIC.or(UNRESERVED);
}

static final BitSet FRAGMENT_SAFE = new BitSet(256);
Copy link
Member

@ok2c ok2c Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg FRAGMENT_SAFE does not seem to be used anymore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

@arturobernalg
Copy link
Member Author

@arturobernalg What about userinfo, reg-name, path segment?

@ok2c here's a consolidated fix ensuring consistent RFC 3986–compliant encoding for userinfo, reg-name, path segments, query, and fragment.

@ok2c Done. Please do another pas

Copy link
Member

@ok2c ok2c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Do you mind keeping ALL_RESERVED mode for compatibility with the behavior of previous versions?

if (idx != -1) {
PercentCodec.encode(sb, this.userInfo.substring(0, idx), this.charset);
PercentCodec.encode(sb, this.userInfo.substring(0, idx), this.charset,
encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.URIC, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Should not it be USERINFO, not URIC? Same below.

} else {
sb.append(PercentCodec.encode(this.host, this.charset));
PercentCodec.encode(sb, this.host, this.charset,
encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.URIC, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Should not it be REG_NAME, not URIC?

sb.append("?");
PercentCodec.encode(sb, this.query, this.charset, PercentCodec.URIC, false);
PercentCodec.encode(sb, this.query, this.charset,
encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.FRAGMENT, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Careful there. Here we are encoding a complete query that may use sub-delims. Even in ALL_RESERVED mode it should still be URIC.

I suppose this line should be

encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.URIC : PercentCodec.QUERY, false)

@peterhalicky
Copy link

peterhalicky commented Mar 14, 2025

The initial motivation for HTTPCORE-778 was the need to put some application/x-www-form-urlencoded formatted values into the fragment part of a URI. The format is pretty much the same as what is used in the query component of the URI, i.e. contains one or more key=value assignments separated by &.

With the above changes, if I read the diff correctly, we'll be able to use & as an unencoded separator. However, the keys/values (data) should be fully encoded as they can possibly contain the & character, and there likely isn't a way to create such URI -- either it will encode all the separators, or it won't any.

I understand that this probably opens a can of worms as it likely would require functionality similar to what is available for query parameters, but unlike query parameters in case of fragment this is only one of many ways how fragments are used.

@ok2c
Copy link
Member

ok2c commented Mar 14, 2025

@peterhalicky no matter how hard we try we just cannot make everyone happy, can we?

The only way of accommodating your request would be to add support for List<NameValuePair> fragmentParams and make them work the same way as the query parameters.

@peterhalicky
Copy link

@peterhalicky no matter how hard we try we just cannot make everyone happy, can we?

The only way of accommodating your request would be to add support for List<NameValuePair> fragmentParams and make them work the same way as the query parameters.

Yes, I agree, I'd be quite hesitant to implement this myself. For my use case I can get away with restricting the keys and values to not containing &. I just wanted to point out that this is still a restriction that will not allow users to create all possible URIs that would otherwise be legal as per the RFC (not sure if the library aims for that or not).

@ok2c
Copy link
Member

ok2c commented Mar 14, 2025

I just wanted to point out that this is still a restriction that will not allow users to create all possible URIs that would otherwise be legal as per the RFC

@peterhalicky The whole point of representing the query component as List<NameValuePair> is make it possible for name/value pairs to contain arbitrary characters. The same could be done for the fragment component.

@peterhalicky
Copy link

peterhalicky commented Mar 14, 2025

The same could be done for the fragment component.

@ok2c The fragment component often contains all kinds of content. Sometimes it is application/x-www-form-urlencoded, other times it looks like a path, i.e. / separated path segments. Representing the fragment component as List<NameValuePair> will only support the former, but not the latter.

A possible solution could be to let the client build the fragment by parts, specifying how much encoding should be done for each part. Something like .appendFragment("&", URIC) and .appendFragment("valuewith&", RESERVED).

@ok2c
Copy link
Member

ok2c commented Mar 14, 2025

@peterhalicky Those folks who need this stuff are welcome to build their own builders.

@arturobernalg arturobernalg requested a review from ok2c March 14, 2025 12:53
private Charset charset;
private String fragment;
private String encodedFragment;
private EncodingPolicy encodingPolicy = EncodingPolicy.RFC_3986;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Should not it be EncodingPolicy.ALL_RESERVED by default?

Previously, URIBuilder relied on partial or inconsistent character sets for
various components, causing valid sub-delims or other characters to be
unnecessarily percent-encoded or left unencoded in the fragment, query, userinfo,
reg-name, and path segments. This patch introduces dedicated BitSets for each
URI component (userinfo, host/reg-name, path segments, query, and fragment)
and updates URIBuilder to use them. As a result, characters like ':', '@', '/',
and '?' remain unencoded in the fragment and query where allowed by RFC 3986,
while certain sub-delimiters in the path and host are now percent-encoded for
strictness. This ensures consistent, RFC 3986–compliant encoding across all
URI components.
@arturobernalg arturobernalg force-pushed the HTTPCORE-778 branch 3 times, most recently from 88acd57 to cb71a7e Compare March 14, 2025 13:34
@arturobernalg arturobernalg requested a review from ok2c March 14, 2025 14:09
sb.append("#");
PercentCodec.encode(sb, this.fragment, this.charset, PercentCodec.URIC, false);
PercentCodec.encode(sb, this.fragment, this.charset,
encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.FRAGMENT, false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg We should use URIC here for consistency with the latest changes in 5.3.x

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

USERINFO.set(':');
REG_NAME.or(UNRESERVED);
REG_NAME.or(SUB_DELIMS);
REG_NAME.clear('!');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg Why clearing '!'? The reg-name is defimed as *( unreserved / pct-encoded / sub-delims ). The '!' should be permitted or am I missing something?

PATH_SEGMENT.clear('(');
PATH_SEGMENT.clear(')');
PATH_SEGMENT.clear('&');
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg I cannot find anything in the RFC stating that. What am I missing? What a super strict mode we have ALL_RESERVED

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg I cannot find anything in the RFC stating that. What am I missing? What a super strict mode we have ALL_RESERVED

@olegk The test expects (, ), & to remain percent-encoded in path segments (e.g., blah-%28%20-blah-%20%26%20-blah-%20%29-blah/), but without the PATH_SEGMENT.clear('('), clear(')'), clear('&') in PercentCodec.java, it outputs blah-(%20-blah-%20&%20-blah-%20)-blah/. Reintroducing the clear() calls fixes it, but I’m wondering if this aligns with our intended behavior or if we should adjust formatPath to handle encoding differently based on encodingPolicy.
I noticed the older formatPath used PercentCodec.encode(buf, segment, charset, false) with UNRESERVED, which worked, but the new version with encodingPolicy support might need tweaking. Could you take a look and advise?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg You have overlooked static formatQuery and formatPath methods. They should also take the encoding policy into account. Apply this patch and everything should work

===================================================================
diff --git a/httpcore5/src/main/java/org/apache/hc/core5/net/PercentCodec.java b/httpcore5/src/main/java/org/apache/hc/core5/net/PercentCodec.java
--- a/httpcore5/src/main/java/org/apache/hc/core5/net/PercentCodec.java	(revision 53477ecc93d331f8348a6748a918d86b25666471)
+++ b/httpcore5/src/main/java/org/apache/hc/core5/net/PercentCodec.java	(date 1742025384945)
@@ -137,10 +137,6 @@
         FRAGMENT.or(PCHAR);
         FRAGMENT.set('/');
         FRAGMENT.set('?');
-        // Some sub-delims remain encoded (RFC 3986 allows them unencoded, but we choose to be strict).
-        PATH_SEGMENT.clear('(');
-        PATH_SEGMENT.clear(')');
-        PATH_SEGMENT.clear('&');
     }
 
     private static final int RADIX = 16;
Index: httpcore5/src/main/java/org/apache/hc/core5/net/URIBuilder.java
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/httpcore5/src/main/java/org/apache/hc/core5/net/URIBuilder.java b/httpcore5/src/main/java/org/apache/hc/core5/net/URIBuilder.java
--- a/httpcore5/src/main/java/org/apache/hc/core5/net/URIBuilder.java	(revision 53477ecc93d331f8348a6748a918d86b25666471)
+++ b/httpcore5/src/main/java/org/apache/hc/core5/net/URIBuilder.java	(date 1742026397263)
@@ -34,6 +34,7 @@
 import java.nio.charset.StandardCharsets;
 import java.util.ArrayList;
 import java.util.Arrays;
+import java.util.BitSet;
 import java.util.Collections;
 import java.util.LinkedList;
 import java.util.List;
@@ -343,19 +344,25 @@
         return list;
     }
 
-    static void formatPath(final StringBuilder buf, final Iterable<String> segments, final boolean rootless, final Charset charset) {
+    static void formatPath(final StringBuilder buf, final Iterable<String> segments, final boolean rootless,
+                           final Charset charset, final BitSet safechars) {
         int i = 0;
         for (final String segment : segments) {
             if (i > 0 || !rootless) {
                 buf.append(PATH_SEPARATOR);
             }
-            PercentCodec.encode(buf, segment, charset, PercentCodec.PATH_SEGMENT, false);
+            PercentCodec.encode(buf, segment, charset, safechars, false);
             i++;
         }
     }
 
-    static void formatQuery(final StringBuilder buf, final Iterable<? extends NameValuePair> params, final Charset charset,
-                            final boolean blankAsPlus) {
+    static void formatPath(final StringBuilder buf, final Iterable<String> segments, final boolean rootless,
+                           final Charset charset) {
+        formatPath(buf, segments, rootless, charset, PercentCodec.UNRESERVED);
+    }
+
+    static void formatQuery(final StringBuilder buf, final Iterable<? extends NameValuePair> params,
+                            final Charset charset, final BitSet safechars, final boolean blankAsPlus) {
         int i = 0;
         for (final NameValuePair parameter : params) {
             if (i > 0) {
@@ -364,12 +371,17 @@
             PercentCodec.encode(buf, parameter.getName(), charset, blankAsPlus);
             if (parameter.getValue() != null) {
                 buf.append(PARAM_VALUE_SEPARATOR);
-                PercentCodec.encode(buf, parameter.getValue(), charset, blankAsPlus);
+                PercentCodec.encode(buf, parameter.getValue(), charset, safechars, blankAsPlus);
             }
             i++;
         }
     }
 
+    static void formatQuery(final StringBuilder buf, final Iterable<? extends NameValuePair> params,
+                            final Charset charset, final boolean blankAsPlus) {
+        formatQuery(buf, params, charset, PercentCodec.UNRESERVED, blankAsPlus);
+    }
+
     /**
      * Builds a {@link URI} instance.
      */
@@ -429,13 +441,15 @@
                 }
                 sb.append(this.encodedPath);
             } else if (this.pathSegments != null) {
-                formatPath(sb, this.pathSegments, !authoritySpecified && this.pathRootless, this.charset);
+                formatPath(sb, this.pathSegments, !authoritySpecified && this.pathRootless, this.charset,
+                        encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.PATH_SEGMENT);
             }
             if (this.encodedQuery != null) {
                 sb.append("?").append(this.encodedQuery);
             } else if (this.queryParams != null && !this.queryParams.isEmpty()) {
                 sb.append("?");
-                formatQuery(sb, this.queryParams, this.charset, false);
+                formatQuery(sb, this.queryParams, this.charset,
+                        encodingPolicy == EncodingPolicy.ALL_RESERVED ? PercentCodec.UNRESERVED : PercentCodec.QUERY, false);
             } else if (this.query != null) {
                 sb.append("?");
                 PercentCodec.encode(sb, this.query, this.charset,

@arturobernalg arturobernalg force-pushed the HTTPCORE-778 branch 2 times, most recently from 167591a to 00f5bba Compare March 15, 2025 13:45
@arturobernalg arturobernalg requested a review from ok2c March 15, 2025 13:49
@ok2c ok2c changed the title Fix percent-encoding in fragment identifiers per RFC 3986 RFC 3986 conformant URI encoding policy Mar 15, 2025
.setScheme("https")
.setHost("somehost.com")
.setFragment("some fragment with all sorts of $tuff in it!!!")
.setEncodingPolicy(URIBuilder.EncodingPolicy.RFC_3986)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arturobernalg I think the test should also pass with ALL_RESERVED default

* is {@link EncodingPolicy#RFC_3986}.
*
* @param encodingPolicy the encoding policy to apply, or {@code null} to reset
* to the default ({@link EncodingPolicy#RFC_3986})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is no longer the case ALL_RESERVED seems to be the default now (see above).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

… and fragment encoding to PercentCodec.FRAGMENT under RFC_3986
@arturobernalg arturobernalg merged commit 817b265 into apache:master Mar 16, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants