Skip to content

add FAQs about collation for JDBC connections #20848

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Apr 28, 2025
Merged
16 changes: 15 additions & 1 deletion develop/dev-guide-sample-application-java-jdbc.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,23 @@ In this tutorial, you can learn how to use TiDB and JDBC to accomplish the follo
- Connect to your TiDB cluster using JDBC.
- Build and run your application. Optionally, you can find [sample code snippets](#sample-code-snippets) for basic CRUD operations.

<CustomContent platform="tidb">

> **Note:**
>
> This tutorial works with TiDB Cloud Serverless, TiDB Cloud Dedicated, and TiDB Self-Managed.
> - This tutorial works with TiDB Cloud Serverless, TiDB Cloud Dedicated, and TiDB Self-Managed.
> - Starting from TiDB v7.4, if `connectionCollation` is not configured, and `characterEncoding` is either not configured or set to `UTF-8` in the JDBC URL, the collation used in a JDBC connection depends on the JDBC driver version. For more information, see [Collation used in JDBC connections](/faq/sql-faq.md#collation-used-in-jdbc-connections).

</CustomContent>

<CustomContent platform="tidb-cloud">

> **Note:**
>
> - This tutorial works with TiDB Cloud Serverless, TiDB Cloud Dedicated, and TiDB Self-Managed.
> - Starting from TiDB v7.4, if `connectionCollation` is not configured, and `characterEncoding` is either not configured or set to `UTF-8` in the JDBC URL, the collation used in a JDBC connection depends on the JDBC driver version. For more information, see [Collation used in JDBC connections](https://docs.pingcap.com/tidb/stable/sql-faq#collation-used-in-jdbc-connections).

</CustomContent>

## Prerequisites

Expand Down
67 changes: 67 additions & 0 deletions faq/sql-faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -337,6 +337,73 @@ Whether your cluster is a new cluster or an upgraded cluster from an earlier ver
- If the owner does not exist, try manually triggering owner election with: `curl -X POST http://{TiDBIP}:10080/ddl/owner/resign`.
- If the owner exists, export the Goroutine stack and check for the possible stuck location.

## Collation used in JDBC connections

This section lists questions related to collations used in JDBC connections. For information about character sets and collations supported by TiDB, see [Character Set and Collation](/character-set-and-collation.md).

### What collation is used in a JDBC connection when `connectionCollation` is not configured in the JDBC URL?

When `connectionCollation` is not configured in the JDBC URL, there are two scenarios:

**Scenario 1**: Neither `connectionCollation` nor `characterEncoding` is configured in the JDBC URL

- For Connector/J 8.0.25 and earlier versions, the JDBC driver attempts to use the server's default character set. Because the default character set of TiDB is `utf8mb4`, the driver uses `utf8mb4_bin` as the connection collation.
- For Connector/J 8.0.26 and later versions, the JDBC driver uses the `utf8mb4` character set and automatically selects the collation based on the return value of `SELECT VERSION()`.

- When the return value is less than `8.0.1`, the driver uses `utf8mb4_general_ci` as the connection collation. TiDB follows the driver and uses `utf8mb4_general_ci` as the collation.
- When the return value is greater than or equal to `8.0.1`, the driver uses `utf8mb4_0900_ai_ci` as the connection collation. TiDB v7.4.0 and later versions follow the driver and use `utf8mb4_0900_ai_ci` as the collation, while TiDB versions earlier than v7.4.0 fall back to using the default collation `utf8mb4_bin` because the `utf8mb4_0900_ai_ci` collation is not supported in these versions.

**Scenario 2**: `characterEncoding=utf8` is configured in the JDBC URL but `connectionCollation` is not configured. The JDBC driver uses the `utf8mb4` character set according to the mapping rules. The collation is determined according to the rules described in scenario 1.

### How to handle collation changes after upgrading TiDB?

In TiDB v7.4 and earlier versions, if `connectionCollation` is not configured, and `characterEncoding` is either not configured or set to `UTF-8` in the JDBC URL, the TiDB [`collation_connection`](/system-variables.md#collation_connection) variable defaults to the `utf8mb4_bin` collation.

Starting from TiDB v7.4, if `connectionCollation` is not configured, and `characterEncoding` is either not configured or set to `UTF-8` in the JDBC URL, the value of the [`collation_connection`](/system-variables.md#collation_connection) variable depends on the JDBC driver version. For example, for Connector/J 8.0.26 and later versions, the JDBC driver defaults to the `utf8mb4` character set and uses `utf8mb4_general_ci` as the connection collation. TiDB follows the driver, and the [`collation_connection`](/system-variables.md#collation_connection) variable uses the `utf8mb4_0900_ai_ci` collation. For more information, see [Collation used in JDBC connections](#what-collation-is-used-in-a-jdbc-connection-when-connectioncollation-is-not-configured-in-the-jdbc-url).

When upgrading from an earlier version to v7.4 or later (for example, from v6.5 to v7.5), if you need to maintain the `collation_connection` as `utf8mb4_bin` for JDBC connections, it is recommended to configure the `connectionCollation` parameter in the JDBC URL.

The following is a common JDBC URL configuration in TiDB v6.5:

```
spring.datasource.url=JDBC:mysql://{TiDBIP}:{TiDBPort}/{DBName}?characterEncoding=UTF-8&useSSL=false&useServerPrepStmts=true&cachePrepStmts=true&prepStmtCacheSqlLimit=10000&prepStmtCacheSize=1000&useConfigs=maxPerformance&rewriteBatchedStatements=true&defaultfetchsize=-2147483648&allowMultiQueries=true
```

After upgrading to TiDB v7.5 or a later version, it is recommended to configure the `connectionCollation` parameter in the JDBC URL:

```
spring.datasource.url=JDBC:mysql://{TiDBIP}:{TiDBPort}/{DBName}?characterEncoding=UTF-8&connectionCollation=utf8mb4_bin&useSSL=false&useServerPrepStmts=true&cachePrepStmts=true&prepStmtCacheSqlLimit=10000&prepStmtCacheSize=1000&useConfigs=maxPerformance&rewriteBatchedStatements=true&defaultFetchSize=-2147483648&allowMultiQueries=true
```

### What are the differences between the `utf8mb4_bin` and `utf8mb4_0900_ai_ci` collations?

| Collation | Case-sensitive | Ignore trailing spaces | Accent-sensitive | Comparison method |
|----------------------|----------------|------------------|--------------|------------------------|
| `utf8mb4_bin` | Yes | Yes | Yes | Compare binary values |
| `utf8mb4_0900_ai_ci` | No | No | No | Use Unicode sorting algorithm |

For example:

```sql
-- utf8mb4_bin is case-sensitive
SELECT 'apple' = 'Apple' COLLATE utf8mb4_bin; -- Returns 0 (FALSE)

-- utf8mb4_0900_ai_ci is case-insensitive
SELECT 'apple' = 'Apple' COLLATE utf8mb4_0900_ai_ci; -- Returns 1 (TRUE)

-- utf8mb4_bin ignores trailing spaces
SELECT 'Apple ' = 'Apple' COLLATE utf8mb4_bin; -- Returns 1 (TRUE)

-- utf8mb4_0900_ai_ci does not ignore trailing spaces
SELECT 'Apple ' = 'Apple' COLLATE utf8mb4_0900_ai_ci; -- Returns 0 (FALSE)

-- utf8mb4_bin is accent-sensitive
SELECT 'café' = 'cafe' COLLATE utf8mb4_bin; -- Returns 0 (FALSE)

-- utf8mb4_0900_ai_ci is accent-insensitive
SELECT 'café' = 'cafe' COLLATE utf8mb4_0900_ai_ci; -- Returns 1 (TRUE)
```

## SQL optimization

### TiDB execution plan description
Expand Down
6 changes: 6 additions & 0 deletions faq/upgrade-faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,12 @@ It is not recommended to upgrade TiDB using the binary. Instead, it is recommend

This section lists some FAQs and their solutions after you upgrade TiDB.

### The collation in JDBC connections changes after upgrading TiDB

When upgrading from an earlier version to v7.4 or later, if the `connectionCollation` is not configured, and the `characterEncoding` is either not configured or configured as `UTF-8` in the JDBC URL, the default collation in your JDBC connections might change from `utf8mb4_bin` to `utf8mb4_0900_ai_ci` after upgrading. If you need to maintain the collation as `utf8mb4_bin`, configure `connectionCollation=utf8mb4_bin` in the JDBC URL.

For more information, see [Collation used in JDBC connections](/faq/sql-faq.md#collation-used-in-jdbc-connections).

### The character set (charset) errors when executing DDL operations

In v2.1.0 and earlier versions (including all versions of v2.0), the character set of TiDB is UTF-8 by default. But starting from v2.1.1, the default character set has been changed into UTF8MB4.
Expand Down