PS-9237 feature: Include support for utf8mb4_0900_ai_ci in MySQL 5.7 #5364
Conversation
…(connection) https://perconadev.atlassian.net/browse/PS-9237 Extended the behavior of the 'default_collation_for_utf8mb4' system variable. Now, when 'default_collation_for_utf8mb4' is set to 'utf8mb4_general_ci', its value also affects the value of the 'character_set_client', 'collation_connection' and 'character_set_results' variables. This helps in cases when a 8.0 mysql client (for instance, 'mysql' command line utility or 'mysqldump' utility) that is used with the '--default-character-set' command line option explicitly set to 'utf8mb4' or not set at all is trying to connect to a 8.0 server with 'default_collation_for_utf8mb4' set to 'utf8mb4_general_ci'. In both cases 'MYSQL_SET_CHARSET_NAME' API option will be set to 255 and sent to the server as part of the protocol, and without the fix, the value of the 'character_set_client' variable will be 'utf8mb4_0900_ai_ci' (255). This behavior is not desired as it has a lot of negative side effects, especially for the binary logging intended to be compatible with 5.7 (where there is no 'utf8mb4_0900_ai_ci' collation).
…(set_var) https://perconadev.atlassian.net/browse/PS-9237 Extended the behavior of the 'default_collation_for_utf8mb4' system variable. Now, when user sets one of the charset-related variables (for instance, 'character_set_client') to 'utf8mb4', the value written to the binary log for the subsequent QUERY event will have 'SET @@session.character_set_client=...' set to a value based on the value of 'default_collation_for_utf8mb4' (45 for 'utf8mb4_general_ci') instead of pre-defined 255 ('utf8mb4_0900_ai_ci'). This helps in scenarios when an sql file generated by the 'mysqldump' utility has explicit 'SET character_set_client = utf8mb4;' statements and is executed on a server running with 'default_collation_for_utf8mb4' set to 'utf8mb4_general_ci'. With the fix this the default collation for the 'utf8mb4' will be identified as 'utf8mb4_general_ci' (45) instead of `utf8mb4_0900_ai_ci` (255) and will help the generated binary log to be backward compatible with 5.7.
|
This needs to be rebased on top of |
sql/sys_vars.cc
Outdated
| get_charset_by_csname("utf8mb4", MY_CS_PRIMARY, MYF(0)); | ||
| if (thd->variables.default_collation_for_utf8mb4 != | ||
| primary_utf8mb4_collation) { | ||
| if (var->save_result.ptr == primary_utf8mb4_collation) { |
There was a problem hiding this comment.
do not access members of unions; use (boost::)variant instead
| if (thd->variables.default_collation_for_utf8mb4 != | ||
| primary_utf8mb4_collation) { | ||
| if (var->save_result.ptr == primary_utf8mb4_collation) { | ||
| var->save_result.ptr = thd->variables.default_collation_for_utf8mb4; |
There was a problem hiding this comment.
do not access members of unions; use (boost::)variant instead
…(mtr) https://perconadev.atlassian.net/browse/PS-9237 Added new 'main.percona_default_collation_for_utf8mb4_extensions' MTR test case which checks the following: * if setting 'default_collation_for_utf8mb4' affects collation variables for connections established from an external mysql client ('mysql' command line utility) * if setting 'default_collation_for_utf8mb4' affects the behavior of 'SET character_set_client = utf8mb4;' * if collation 'utf8mb4_0900_ai_ci' (255) never appears in the binary log when 'default_collation_for_utf8mb4' is set to 'utf8mb4_general_ci' * if default 'mysqldump ... > mysql ...' backup-restore procedure does not lead to changing collations back to 'utf8mb4_0900_ai_ci' when 'default_collation_for_utf8mb4' is set to 'utf8mb4_general_ci'
dlenev
left a comment
There was a problem hiding this comment.
Hello Yura!
I would like to suggest a few changes to this patch:
sql/sql_connect.cc
Outdated
| // we need to fix 'cs_number' here by setting it to the corresponding number | ||
| // of 'default_collation_for_utf8mb4' (currently only 'utf8mb4_general_ci', | ||
| // number 45, is supported) | ||
| const auto *primary_utf8mb4_collation = |
There was a problem hiding this comment.
As far as I can see in other places where we check if default_collation_for_utf8mb4 needs to kick in (e.g. see sql_lex.cc) we directly use my_charset_utf8mb4_0900_ai_ci instead of getting access to it though get_charset_by_csname().
IMO it makes sense to be consistent and do the same here... Especially since this code seems to be called for each connect so saving even a few CPU cycles would be nice.
What do you think?
There was a problem hiding this comment.
Also I think that instead of two ifs you can simply do:
if (cs_number == my_charset_utf8mb4_0900_ai_ci.number)
cs_numer = thd->variables.default_collation_for_utf8mb4->number;
What do you think?
There was a problem hiding this comment.
Yes, @dlenev I had exactly the same doubts and was going back and forth with this. From one side I did not want to hardcode primary_utf8mb4_collation to be my_charset_utf8mb4_0900_ai_ci (because who knows may be in the next version the default collation will change again). On the other hand, I totally agree that establishing the connection is a critical path and we should not add any unnecessary cycles here.
Anyway, if this caught your attention as well, then it is probably more serious than I thought.
Let's wait for the final feedback from the customer and I will add the changes you suggested into the final patch.
There was a problem hiding this comment.
@dlenev I reworked the critical paths code with simplified versions
if (thd->variables.default_collation_for_utf8mb4 !=
&my_charset_utf8mb4_0900_ai_ci) {
if (client_cs == &my_charset_utf8mb4_0900_ai_ci) {
client_cs = thd->variables.default_collation_for_utf8mb4;
}
}that does not involve charset by name lookup.
sql/sys_vars.cc
Outdated
| // 'utf8mb4', we need to fix the returned value depending on the value of | ||
| // 'default_collation_for_utf8mb4' (currently, only 'utf8mb4_general_ci' | ||
| // is possible) | ||
| const auto *primary_utf8mb4_collation = |
There was a problem hiding this comment.
The same comment as for previous patch:
Perhaps it is better simply do:
if (var->save_result.ptr == &my_charset_utf8mb4_0900_ai_ci)
var->save_result.ptr = thd->variables.default_collation_for_utf8mb4;
instead ?
What do you think?
There was a problem hiding this comment.
Reworked the same way.
dlenev
left a comment
There was a problem hiding this comment.
A bit more comments about second and third parts of your patch.
Otherwise it looks OK to me.
| my_error(ER_UNKNOWN_CHARACTER_SET, MYF(0), err.ptr()); | ||
| return true; | ||
| } | ||
| // if 'default_collation_for_utf8mb4' is set to something other than |
There was a problem hiding this comment.
What about similar code in check_collation_not_null() ?
In theory one can do SET @@global.collation_connection = utf8mb4.
Perhaps we should handle this case as well?
Or at least add comment why we don't think it is necessary/don't want to do it (indeed it might be non-trivial to distinguish cases when one uses charset name only - utf8mb4 vs case when one specifies collation explicitly -utf8mb4_0900_ai_ci.
There was a problem hiding this comment.
After a bit more pondering about it, I think it is a good idea to have test coverage for cases when user can use both charset name and collation name. Like SET NAMES utf8mb4; vs SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci and SET @@global.collation_connection = utf8mb4 vs SET @@global.collation_connection = utf8mb4_0900_ai_ci.
Do you agree?
There was a problem hiding this comment.
@dlenev I extended coverage with the following statements
SET character_set_client = utf8mb4;
SET NAMES DEFAULT;
SET NAMES utf8mb4;
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
SET CHARACTER SET DEFAULT;
SET CHARACTER SET utf8mb4;
SET collation_connection = utf8mb4_general_ci;Note that 'SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci' will generate
character_set_client=255 in the binary log - I believe this is correct behavior as uses pecifies the collation explicitly.
Likewise, 'SET collation_connection = utf8mb4_0900_ai_ci' will generate
character_set_client=255 in the binary log.
As for 'SET collation_connection = utf8mb4' (when we try to assign a character set name to a collation variable), there is now problem here as this statatement is considered
syntactically incorrect In other words, no need to change check_collation_not_null().
…(connection postfix) https://perconadev.atlassian.net/browse/PS-9237 Eliminated unnecessary call to 'get_charset_by_csname()' inside 'thd_init_client_charset()' as this function is in critical path.
…(set_var postfix) https://perconadev.atlassian.net/browse/PS-9237 Eliminated unnecessary call to 'get_charset_by_csname()' inside 'check_charset()'.
…(triggers) https://perconadev.atlassian.net/browse/PS-9237 Fixed the value of the 'character_set_client' field for the binary log events generated implicitly from triggers. Updated 'main.percona_default_collation_for_utf8mb4_extensions' MTR test case with 'INSERT' statements executed from triggers and stored procedures.
…(SET CHARACTER SET) https://perconadev.atlassian.net/browse/PS-9237 'SET CHARACTER SET' statement now also takes into account the value of the 'default_collation_for_utf8mb4' system variable. 'main.percona_default_collation_for_utf8mb4_extensions' MTR test case extended with additional checks for 'SET NAMES' and `SET CHARACTER SET`.
…(SET collation_connection mtr) https://perconadev.atlassian.net/browse/PS-9237 'main.percona_default_collation_for_utf8mb4_extensions' MTR test case extended with additional checks for 'SET collation_connection'.
| // 'utf8mb4_general_ci' is possible) | ||
| if (thd->variables.default_collation_for_utf8mb4 != | ||
| &my_charset_utf8mb4_0900_ai_ci) { | ||
| if (var->save_result.ptr == &my_charset_utf8mb4_0900_ai_ci) { |
There was a problem hiding this comment.
do not access members of unions; use (boost::)variant instead
a35c35d to
bfc10f2
Compare
No description provided.