Skip to content

Add Uri\WhatWg classes to ext/uri #18672

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from
Open

Conversation

kocsismate
Copy link
Member

Copy link
Member

@TimWolla TimWolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments regarding the tests.

Copy link
Member

@TimWolla TimWolla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some first comments before ending my day. I'm having some troubles with the amount of “indirection” / “function pointers” handling things in a very abstract fashion that I'm not sure is always necessary.

Comment on lines 140 to 146
if (property_handler->write_func(new_internal_uri, property_zv, &errors) == FAILURE) { \
throw_invalid_uri_exception(new_internal_uri->handler, &errors); \
zval_ptr_dtor(&errors); \
zend_object_release(new_object); \
zval_ptr_dtor(property_zv); \
RETURN_THROWS(); \
} \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner if the write_func() would throw the exception by itself. This gives it best control over the exception data.

If necessary a silent parameter could be added, but extensions are already able to suppress any exceptions if they are undesired, so this is probably not necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to throw in the handlers themselves because of the internal API: different kinds of exceptions/errors may be used in different contexts, especially when code needs to be backward compatible. For example SOAP uses a SoapFault with the "Unable to parse URL" message (https://github.com/php/php-src/blob/cf0c982b0ba937f4b94923c8fe4d541f0181a283/ext/soap/php_http.c#L467) to indicate an error, while the FILTER_VALIDATE_URL filter uses the false return value etc. Another example from the code of the RFC itself is the parse_uri handler that you spotted below: it should generally throw an InvalidUriException, except when it's used during unserialization.

That's why my reasoning is that handlers should be free of any exception throwing, and the caller should decide based on the context what to do with the error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why my reasoning is that handlers should be free of any exception throwing, and the caller should decide based on the context what to do with the error.

When internal code throws an exception while an Exception is already active, it will automatically set the $previous value to the active exception. This matches the:

It MUST set the $previous property to the original exception when doing so.

rule from the Throwable policy RFC (https://externals.io/message/127340) that is very likely to pass. Thus it is super easy to wrap the exception during unserialization (and probably also in Soap). You can see this pattern in action at:

if (FAILURE == php_random_bytes_throw(ZSTR_VAL(buffer), ZSTR_LEN(buffer))) {
zend_value_error("Unable to generate salt");
zend_string_release_ex(buffer, 0);
return NULL;
}

Where the Random\RandomException will be wrapped into the ValueError.


Generally speaking, APIs that do not throw an Exception on error should become more rare going forward, thus throwing an exception is the "standard case". In case an exception is absolutely undesired either a silent parameter would work or alternatively the exception can be suppressed with zend_clear_exception() after extracting the message.

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round, didn't check the tests yet

@kocsismate
Copy link
Member Author

I addressed most review comments (that I marked with 👀 ). I'll continue tomorrow with the rest of the comments.

Niels, please let me know if you'd prefer to resolve your comments yourself or I should do it. I'll resolve the obvious ones for Tim tomorrow.

@nielsdos
Copy link
Member

Niels, please let me know if you'd prefer to resolve your comments yourself or I should do it. I'll resolve the obvious ones for Tim tomorrow.

It's fine by me if you resolve them. If you'd like some extra hands helping with the coding just hmu.

kocsismate and others added 5 commits June 2, 2025 08:55
@kocsismate
Copy link
Member Author

I'll push the fix for the test failures later today

@TimWolla TimWolla self-requested a review June 2, 2025 18:49
@nielsdos
Copy link
Member

nielsdos commented Jun 2, 2025

Question btw: Why are the "Cannot recompose %s to string" exceptions of type \Exception instead of \Uri\UriException ? I tried to find this in the RFC but couldn't

@kocsismate
Copy link
Member Author

Why are the "Cannot recompose %s to string" exceptions of type \Exception instead of \Uri\UriException ? I tried to find this in the RFC but couldn't.

I'm not sure anymore why I used a regular exception, but maybe because the possible errors are not something people want to catch (since normally, recomposition should always be successful. If not, then there's some legitimate exceptional situation).

Currently, this exception cannot be triggered, as neither lexbor, nor uriparser can fail (only memory errors can happen, but we bail out (I hope I used the correct term) then anyway). However, possible 3rd party implementations may fail (the example scenario I can bring up is when an URI parser is not integrated into PHP's memory handler), so I kept the error handling in place.

BTW this practice was questioned by Tim here: #18672 (comment)

Comment on lines 207 to 244
PHPAPI void php_uri_instantiate_uri(
INTERNAL_FUNCTION_PARAMETERS, const uri_handler_t *handler, const zend_string *uri_str, const zend_object *base_url_object,
bool should_throw, bool should_update_this_object, zval *errors_zv
) {
zval errors;
ZVAL_UNDEF(&errors);

void *base_url = NULL;
if (base_url_object != NULL) {
uri_internal_t *internal_base_url = uri_internal_from_obj(base_url_object);
URI_ASSERT_INITIALIZATION(internal_base_url);
base_url = internal_base_url->uri;
}

void *uri = handler->parse_uri(uri_str, base_url, should_throw || errors_zv != NULL ? &errors : NULL);
if (UNEXPECTED(uri == NULL)) {
if (should_throw) {
throw_invalid_uri_exception(handler, &errors);
zval_ptr_dtor(&errors);
RETURN_THROWS();
} else {
if (pass_errors_by_ref(errors_zv, &errors) == FAILURE) {
RETURN_THROWS();
}
RETURN_NULL();
}
}

if (pass_errors_by_ref(errors_zv, &errors) == FAILURE) {
RETURN_THROWS();
}

uri_object_t *uri_object;
if (should_update_this_object) {
uri_object = Z_URI_OBJECT_P(ZEND_THIS);
} else {
if (EX(func)->common.fn_flags & ZEND_ACC_STATIC) {
object_init_ex(return_value, Z_CE_P(ZEND_THIS));
} else {
object_init_ex(return_value, Z_OBJCE_P(ZEND_THIS));
}
uri_object = Z_URI_OBJECT_P(return_value);
}

uri_object->internal.handler = handler;
uri_object->internal.uri = uri;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to my remark regarding throw_invalid_uri_exception, I have a feeling that trying to handle the constructor in a generic fashion internally leads to great complexity that we don't need and that is not helping folks. The external API is different for good reason, so trying to unify the internal API is going to result in very awkward logic.

My suggestion would be the following:

  • Expose lexbor_parse_uri() directly: Users that know they need WHATWG semantics can then use the function directly and get improved type-safety.
  • Make the parse_uri handler a wrapper around lexbor_parse_uri(): This one exposes a simple API void *parse_uri(zend_string *uri, bool silent) for generic internal usage where the exact implementation does not matter. If NULL is returned an error happened.
  • Make the constructor for Uri\WhatWg\Url use lexbor_parse_uri() (and possibly fill_errors) directly.

My reasoning is that folks that use the “generic” API will not be able to meaningfully make use of the errors value, since the possible errors depend on the URI implementation. The only thing they can do is pass it to the create_invalid_uri_exception handler, which then is able to make use of it. We can thus avoid that layer of indirection.

Trying to handle things generically that are quite different in the details is also something that burned me in ext/random, specifically the seeding. That's why every engine now exposes engine-specific php_random_*_seed* functions.

I was thinking something like this:

diff --git i/ext/uri/php_lexbor.c w/ext/uri/php_lexbor.c
index ba8f1440e0b..0703e7be228 100644
--- i/ext/uri/php_lexbor.c
+++ w/ext/uri/php_lexbor.c
@@ -537,27 +537,39 @@ void lexbor_request_shutdown(void)
 	lexbor_urls = 0;
 }
 
-static void *lexbor_parse_uri(const zend_string *uri_str, const void *base_url, zval *errors)
+lxb_url_t *lexbor_parse_uri(const zend_string *uri_str, const void *base_url, zval *errors, bool silent)
 {
 	lexbor_cleanup_parser();
+	zval z;
+	if (!silent && errors == NULL) {
+		errors = &z;
+	}
 
 	const lxb_url_t *lexbor_base_url = base_url;
 	lxb_url_t *url = lxb_url_parse(lexbor_parser, lexbor_base_url, (unsigned char *) ZSTR_VAL(uri_str), ZSTR_LEN(uri_str));
 	fill_errors(errors);
 
+	if (url == NULL && !silent) {
+		zval exception;
+
+		object_init_ex(&exception, uri_whatwg_invalid_url_exception_ce);
+
+		zval value;
+		ZVAL_STRING(&value, "URL parsing failed");
+		zend_update_property_ex(uri_whatwg_invalid_url_exception_ce, Z_OBJ(exception), ZSTR_KNOWN(ZEND_STR_MESSAGE), &value);
+		zval_ptr_dtor_str(&value);
+
+		zend_update_property(uri_whatwg_invalid_url_exception_ce, Z_OBJ(exception), ZEND_STRL("errors"), errors);
+
+		zend_throw_exception_object(&exception);
+	}
+
 	return url;
 }
 
-static void lexbor_create_invalid_uri_exception(zval *exception_zv, zval *errors)
+static void *parse_uri(const zend_string *uri_str, bool silent)
 {
-	object_init_ex(exception_zv, uri_whatwg_invalid_url_exception_ce);
-
-	zval value;
-	ZVAL_STRING(&value, "URL parsing failed");
-	zend_update_property_ex(uri_whatwg_invalid_url_exception_ce, Z_OBJ_P(exception_zv), ZSTR_KNOWN(ZEND_STR_MESSAGE), &value);
-	zval_ptr_dtor_str(&value);
-
-	zend_update_property(uri_whatwg_invalid_url_exception_ce, Z_OBJ_P(exception_zv), ZEND_STRL("errors"), errors);
+	return lexbor_parse_uri(uri_str, NULL, NULL, silent);
 }
 
 static void *lexbor_clone_uri(void *uri)
@@ -599,8 +611,7 @@ static void lexbor_free_uri(void *uri)
 
 const uri_handler_t lexbor_uri_handler = {
 	.name = URI_PARSER_WHATWG,
-	.parse_uri = lexbor_parse_uri,
-	.create_invalid_uri_exception = lexbor_create_invalid_uri_exception,
+	.parse_uri = parse_uri,
 	.clone_uri = lexbor_clone_uri,
 	.uri_to_string = lexbor_uri_to_string,
 	.free_uri = lexbor_free_uri,
diff --git i/ext/uri/php_lexbor.h w/ext/uri/php_lexbor.h
index 24f4fbfffe5..e3ce19161b9 100644
--- i/ext/uri/php_lexbor.h
+++ w/ext/uri/php_lexbor.h
@@ -22,6 +22,8 @@
 
 extern const uri_handler_t lexbor_uri_handler;
 
+lxb_url_t *lexbor_parse_uri(const zend_string *uri_str, const void *base_url, zval *errors, bool silent);
+
 void lexbor_module_init(void);
 void lexbor_module_shutdown(void);
 
diff --git i/ext/uri/php_uri.c w/ext/uri/php_uri.c
index 8493d0136cf..43dd8158beb 100644
--- i/ext/uri/php_uri.c
+++ w/ext/uri/php_uri.c
@@ -218,11 +218,9 @@ PHPAPI void php_uri_instantiate_uri(
 		base_url = internal_base_url->uri;
 	}
 
-	void *uri = handler->parse_uri(uri_str, base_url, should_throw || errors_zv != NULL ? &errors : NULL);
+	lxb_url_t *uri = lexbor_parse_uri(uri_str, base_url, errors_zv != NULL ? &errors : NULL, !should_throw);
 	if (UNEXPECTED(uri == NULL)) {
 		if (should_throw) {
-			throw_invalid_uri_exception(handler, &errors);
-			zval_ptr_dtor(&errors);
 			RETURN_THROWS();
 		} else {
 			if (pass_errors_by_ref(errors_zv, &errors) == FAILURE) {

And the implementation for Rfc3986 would have a rfc3986_parse_uri() that does not have an errors parameter, since it can't emit detailed errors.

return SUCCESS;
}

static ZEND_RESULT_CODE init_idna(void)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest being consistent about using ZEND_RESULT_CODE or zend_result

Suggested change
static ZEND_RESULT_CODE init_idna(void)
static zend_result init_idna(void)


zval_string_or_null_to_lexbor_str(value, &str);

if ( lxb_url_api_search_set(lexbor_uri, lexbor_parser, str.data, str.length) != LXB_STATUS_OK) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ( lxb_url_api_search_set(lexbor_uri, lexbor_parser, str.data, str.length) != LXB_STATUS_OK) {
if (lxb_url_api_search_set(lexbor_uri, lexbor_parser, str.data, str.length) != LXB_STATUS_OK) {

RETURN_THROWS();
}

zend_update_property(uri_whatwg_url_validation_error_ce, Z_OBJ_P(ZEND_THIS), ZEND_STRL("type"), type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
zend_update_property(uri_whatwg_url_validation_error_ce, Z_OBJ_P(ZEND_THIS), ZEND_STRL("type"), type);
zend_update_property_ex(uri_whatwg_url_validation_error_ce, Z_OBJ_P(ZEND_THIS), ZSTR_KNOWN(ZEND_STR_TYPE), type);

internal_uri->uri = internal_uri->handler->parse_uri(Z_STR_P(uri_zv), NULL, &errors);
if (internal_uri->uri == NULL) {
zend_throw_exception_ex(NULL, 0, "Invalid serialization data for %s object", ZSTR_VAL(object->ce->name));
zval_ptr_dtor(&errors);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading the API for internal_uri->handler->parse_uri() correctly, the errors parameter can be set to NULL if the errors are not going to be used. Here, they are not used, on either success or failure - suggest omitting them entirely

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice catch, thanks!


zval error_type;
zend_enum_new(&error_type, uri_whatwg_url_validation_error_type_ce, error_str, NULL);
zend_update_property(uri_whatwg_url_validation_error_ce, Z_OBJ(error), ZEND_STRL("type"), &error_type);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
zend_update_property(uri_whatwg_url_validation_error_ce, Z_OBJ(error), ZEND_STRL("type"), &error_type);
zend_update_property_ex(uri_whatwg_url_validation_error_ce, Z_OBJ(error), ZSTR_KNOWN(ZEND_STR_TYPE), &error_type);

@nielsdos
Copy link
Member

nielsdos commented Jun 5, 2025

I don't immediately spot something that still needs changing.
I just wonder whether it makes sense to have the lexbor parser structure as a normal variable instead of as a pointer: https://gist.github.com/nielsdos/2e9e46ca83e3c16f6a014598c84f21f2 . That makes more sense to me as it doesn't need to be on the heap per se.

@kocsismate
Copy link
Member Author

https://gist.github.com/nielsdos/2e9e46ca83e3c16f6a014598c84f21f2 . That makes more sense to me as it doesn't need to be on the heap per se.

This makes sense to me :)

I don't immediately spot something that still needs changing.

Yes, I would also prefer to merge it soon, so that the rest of the changes can also be continued (and of course this code can still be fine tuned later). First, I'd probably create a PR for the uriparser stuffs without the write handlers, than probably the internal API related code.

I'm still thinking about Tim's suggestion above (#18672 (comment)). I'm currently playing with the idea. I like the fact that the exception handler wouldn't be needed anymore, but in the same time, all the write handlers should also take care of throwing the exception themselves.

@nielsdos
Copy link
Member

nielsdos commented Jun 5, 2025

I'm still thinking about Tim's suggestion above (#18672 (comment)). I'm currently playing with the idea. I like the fact that the exception handler wouldn't be needed anymore, but in the same time, all the write handlers should also take care of throwing the exception themselves.

I don't have strong preference for either design decision.

Copy link
Member

@nielsdos nielsdos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay for me.
Please add an entry to UPGRADING.INTERNALS for the zend_exceptions.c zend_update_exception_properties API addition

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants