You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/data-formats/arrow-avro-orc.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -48,7 +48,7 @@ FORMAT Avro;
48
48
49
49
### Avro and ClickHouse data types {#avro-and-clickhouse-data-types}
50
50
51
-
Consider [data types matching](/interfaces/formats.md/#data_types-matching) when importing or exporting Avro files. Use explicit type casting to convert when loading data from Avro files:
51
+
Consider [data types matching](/interfaces/formats/Avro#data-types-matching) when importing or exporting Avro files. Use explicit type casting to convert when loading data from Avro files:
52
52
53
53
```sql
54
54
SELECT
@@ -100,7 +100,7 @@ INTO OUTFILE 'export.arrow'
100
100
FORMAT Arrow
101
101
```
102
102
103
-
Also, check [data types matching](/interfaces/formats.md/#data-types-matching-arrow) to know if any should be converted manually.
103
+
Also, check [data types matching](/interfaces/formats/Arrow#data-types-matching) to know if any should be converted manually.
104
104
105
105
### Arrow data streaming {#arrow-data-streaming}
106
106
@@ -150,7 +150,7 @@ FROM INFILE 'data.orc'
150
150
FORMAT ORC;
151
151
```
152
152
153
-
Also, check [data types matching](/interfaces/formats.md/#data-types-matching-orc) as well as [additional settings](/interfaces/formats.md/#parquet-format-settings) to tune export and import.
153
+
Also, check [data types matching](/interfaces/formats/ORC) as well as [additional settings](/interfaces/formats/Parquet#format-settings) to tune export and import.
You can see a lot of the columns are detected as Nullable. We [do not recommend using the Nullable](/sql-reference/data-types/nullable#storage-features) type when not absolutely needed. You can use [schema_inference_make_columns_nullable](/interfaces/schema-inference#schema_inference_make_columns_nullable) to control the behavior of when Nullable is applied.
93
+
You can see a lot of the columns are detected as Nullable. We [do not recommend using the Nullable](/sql-reference/data-types/nullable#storage-features) type when not absolutely needed. You can use [schema_inference_make_columns_nullable](/operations/settings/formats#schema_inference_make_columns_nullable) to control the behavior of when Nullable is applied.
94
94
:::
95
95
96
96
We can see that most columns have automatically been detected as `String`, with `update_date` column correctly detected as a `Date`. The `versions` column has been created as an `Array(Tuple(created String, version String))` to store a list of objects, with `authors_parsed` being defined as `Array(Array(String))` for nested arrays.
97
97
98
98
:::note Controlling type detection
99
-
The auto-detection of dates and datetimes can be controlled through the settings [`input_format_try_infer_dates`](/interfaces/schema-inference#input_format_try_infer_dates) and [`input_format_try_infer_datetimes`](/interfaces/schema-inference#input_format_try_infer_datetimes) respectively (both enabled by default). The inference of objects as tuples is controlled by the setting [`input_format_json_try_infer_named_tuples_from_objects`](/operations/settings/formats#input_format_json_try_infer_named_tuples_from_objects). Other settings which control schema inference for JSON, such as the auto-detection of numbers, can be found [here](/interfaces/schema-inference#text-formats).
99
+
The auto-detection of dates and datetimes can be controlled through the settings [`input_format_try_infer_dates`](/operations/settings/formats#input_format_try_infer_dates) and [`input_format_try_infer_datetimes`](/operations/settings/formats#input_format_try_infer_datetimes) respectively (both enabled by default). The inference of objects as tuples is controlled by the setting [`input_format_json_try_infer_named_tuples_from_objects`](/operations/settings/formats#input_format_json_try_infer_named_tuples_from_objects). Other settings which control schema inference for JSON, such as the auto-detection of numbers, can be found [here](/interfaces/schema-inference#text-formats).
100
100
:::
101
101
102
102
## Querying JSON {#querying-json}
@@ -183,7 +183,7 @@ ORDER BY update_date
183
183
SETTINGS index_granularity = 8192
184
184
```
185
185
186
-
The above is the correct schema for this data. Schema inference is based on sampling the data and reading the data row by row. Column values are extracted according to the format, with recursive parsers and heuristics used to determine the type for each value. The maximum number of rows and bytes read from the data in schema inference is controlled by the settings [`input_format_max_rows_to_read_for_schema_inference`](/interfaces/schema-inference#input_format_max_rows_to_read_for_schema_inferenceinput_format_max_bytes_to_read_for_schema_inference) (25000 by default) and [`input_format_max_bytes_to_read_for_schema_inference`](/interfaces/schema-inference#input_format_max_rows_to_read_for_schema_inferenceinput_format_max_bytes_to_read_for_schema_inference) (32MB by default). In the event detection is not correct, users can provide hints as described [here](/interfaces/schema-inference#schema_inference_hints).
186
+
The above is the correct schema for this data. Schema inference is based on sampling the data and reading the data row by row. Column values are extracted according to the format, with recursive parsers and heuristics used to determine the type for each value. The maximum number of rows and bytes read from the data in schema inference is controlled by the settings [`input_format_max_rows_to_read_for_schema_inference`](/operations/settings/formats#input_format_max_rows_to_read_for_schema_inference) (25000 by default) and [`input_format_max_bytes_to_read_for_schema_inference`](/interfaces/schema-inference#input_format_max_rows_to_read_for_schema_inferenceinput_format_max_bytes_to_read_for_schema_inference) (32MB by default). In the event detection is not correct, users can provide hints as described [here](/interfaces/schema-inference#schema_inference_hints).
187
187
188
188
### Creating tables from snippets {#creating-tables-from-snippets}
189
189
@@ -272,7 +272,7 @@ FORMAT PrettyJSONEachRow
272
272
273
273
## Handling errors {#handling-errors}
274
274
275
-
Sometimes, you might have bad data. For example, specific columns that do not have the right type or an improperly formatted JSON. For this, you can use the setting [`input_format_allow_errors_ratio`](/operations/settings/formats#input_format_allow_errors_ratio) to allow a certain number of rows to be ignored if the data is triggering insert errors. Additionally, [hints](/interfaces/schema-inference#schema_inference_hints) can be provided to assist inference.
275
+
Sometimes, you might have bad data. For example, specific columns that do not have the right type or an improperly formatted JSON. For this, you can use the setting [`input_format_allow_errors_ratio`](/operations/settings/formats#input_format_allow_errors_ratio) to allow a certain number of rows to be ignored if the data is triggering insert errors. Additionally, [hints](/operations/settings/formats#schema_inference_hints) can be provided to assist inference.
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/data-formats/json/schema.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -508,7 +508,7 @@ SELECT JSONExtractString(tags, 'holidays') as holidays FROM people
508
508
1 row in set. Elapsed: 0.002 sec.
509
509
```
510
510
511
-
Notice how the functions require both a reference to the `String` column `tags` and a path in the JSON to extract. Nested paths require functions to be nested e.g. `JSONExtractUInt(JSONExtractString(tags, 'car'), 'year')` which extracts the column `tags.car.year`. The extraction of nested paths can be simplified through the functions [JSON_QUERY](/sql-reference/functions/json-functions.md/#json_queryjson-path) AND [JSON_VALUE](/sql-reference/functions/json-functions.md/#json_valuejson-path).
511
+
Notice how the functions require both a reference to the `String` column `tags` and a path in the JSON to extract. Nested paths require functions to be nested e.g. `JSONExtractUInt(JSONExtractString(tags, 'car'), 'year')` which extracts the column `tags.car.year`. The extraction of nested paths can be simplified through the functions [JSON_QUERY](/sql-reference/functions/json-functions#json_query) AND [JSON_VALUE](/sql-reference/functions/json-functions#json_value).
512
512
513
513
Consider the extreme case with the `arxiv` dataset where we consider the entire body to be a `String`.
By default, ClickHouse is strict with column names, types, and values. But sometimes, we can skip nonexistent columns or unsupported values during import. This can be managed with [Parquet settings](/interfaces/formats.md/#parquet-format-settings).
128
+
By default, ClickHouse is strict with column names, types, and values. But sometimes, we can skip nonexistent columns or unsupported values during import. This can be managed with [Parquet settings](/interfaces/formats/Parquet#format-settings).
129
129
130
130
131
131
## Exporting to Parquet format {#exporting-to-parquet-format}
@@ -146,7 +146,7 @@ FORMAT Parquet
146
146
This will create the `export.parquet` file in a working directory.
147
147
148
148
## ClickHouse and Parquet data types {#clickhouse-and-parquet-data-types}
149
-
ClickHouse and Parquet data types are mostly identical but still [differ a bit](/interfaces/formats.md/#data-types-matching-parquet). For example, ClickHouse will export `DateTime` type as a Parquets' `int64`. If we then import that back to ClickHouse, we're going to see numbers ([time.parquet file](assets/time.parquet)):
149
+
ClickHouse and Parquet data types are mostly identical but still [differ a bit](/interfaces/formats/Parquet#data-types-matching-parquet). For example, ClickHouse will export `DateTime` type as a Parquets' `int64`. If we then import that back to ClickHouse, we're going to see numbers ([time.parquet file](assets/time.parquet)):
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/dbms/jdbc-with-clickhouse.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,7 +17,7 @@ Using JDBC requires the ClickHouse JDBC bridge, so you will need to use `clickho
17
17
18
18
**Overview:** The <a href="https://github.com/ClickHouse/clickhouse-jdbc-bridge" target="_blank">ClickHouse JDBC Bridge</a> in combination with the [jdbc table function](/sql-reference/table-functions/jdbc.md) or the [JDBC table engine](/engines/table-engines/integrations/jdbc.md) allows ClickHouse to access data from any external data source for which a <a href="https://en.wikipedia.org/wiki/JDBC_driver" target="_blank">JDBC driver</a> is available:
This is handy when there is no native built-in [integration engine](/engines/table-engines/index.md#integration-engines-integration-engines), table function, or external dictionary for the external data source available, but a JDBC driver for the data source exists.
20
+
This is handy when there is no native built-in [integration engine](/engines/table-engines/integrations), table function, or external dictionary for the external data source available, but a JDBC driver for the data source exists.
21
21
22
22
You can use the ClickHouse JDBC Bridge for both reads and writes. And in parallel for multiple external data sources, e.g. you can run distributed queries on ClickHouse across multiple external and internal data sources in real time.
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/insert-local-files.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ ENGINE = MergeTree
37
37
ORDER BY toYYYYMMDD(timestamp)
38
38
```
39
39
40
-
3. We want to lowercase the `author` column, which is easily done with the [`lower` function](/sql-reference/functions/string-functions/#lower-lcase). We also want to split the `comment` string into tokens and store the result in the `tokens` column, which can be done using the [`extractAll` function](/sql-reference/functions/string-search-functions/#extractallhaystack-pattern). You do all of this in one `clickhouse-client` command - notice how the `comments.tsv` file is piped into the `clickhouse-client` using the `<` operator:
40
+
3. We want to lowercase the `author` column, which is easily done with the [`lower` function](/sql-reference/functions/string-functions#lower). We also want to split the `comment` string into tokens and store the result in the `tokens` column, which can be done using the [`extractAll` function](/sql-reference/functions/string-search-functions#extractall). You do all of this in one `clickhouse-client` command - notice how the `comments.tsv` file is piped into the `clickhouse-client` using the `<` operator:
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/kafka/confluent/kafka-connect-http.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -137,7 +137,7 @@ The following additional parameters are relevant to using the HTTP Sink with Cli
137
137
* `ssl.enabled` - set to true if using SSL.
138
138
* `connection.user` - username for ClickHouse.
139
139
* `connection.password` - password for ClickHouse.
140
-
* `batch.max.size` - The number of rows to send in a single batch. Ensure this set is to an appropriately large number. Per ClickHouse [recommendations](../../../../concepts/why-clickhouse-is-so-fast.md#performance-when-inserting-data) a value of 1000 is should be considered a minimum.
140
+
* `batch.max.size` - The number of rows to send in a single batch. Ensure this set is to an appropriately large number. Per ClickHouse [recommendations](/sql-reference/statements/insert-into#performance-considerations) a value of 1000 should be considered a minimum.
141
141
* `tasks.max` - The HTTP Sink connector supports running one or more tasks. This can be used to increase performance. Along with batch size this represents your primary means of improving performance.
142
142
* `key.converter` - set according to the types of your keys.
143
143
* `value.converter` - set based on the type of data on your topic. This data does not need a schema. The format here must be consistent with the FORMAT specified in the parameter `http.api.url`. The simplest here is to use JSON and the org.apache.kafka.connect.json.JsonConverter converter. Treating the value as a string, via the converter org.apache.kafka.connect.storage.StringConverter, is also possible - although this will require the user to extract a value in the insert statement using functions. [Avro format](../../../../interfaces/formats.md#data-format-avro) is also supported in ClickHouse if using the io.confluent.connect.avro.AvroConverter converter.
Copy file name to clipboardExpand all lines: docs/integrations/data-ingestion/kafka/kafka-connect-jdbc.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -55,7 +55,7 @@ The following parameters are relevant to using the JDBC connector with ClickHous
55
55
* `_connection.url_` - this should take the form of `jdbc:clickhouse://<clickhouse host>:<clickhouse http port>/<target database>`
56
56
* `connection.user` - a user with write access to the target database
57
57
* `table.name.format`- ClickHouse table to insert data. This must exist.
58
-
* `batch.size` - The number of rows to send in a single batch. Ensure this set is to an appropriately large number. Per ClickHouse [recommendations](../../../concepts/why-clickhouse-is-so-fast.md#performance-when-inserting-data) a value of 1000 should be considered a minimum.
58
+
* `batch.size` - The number of rows to send in a single batch. Ensure this set is to an appropriately large number. Per ClickHouse [recommendations](/sql-reference/statements/insert-into#performance-considerations) a value of 1000 should be considered a minimum.
59
59
* `tasks.max` - The JDBC Sink connector supports running one or more tasks. This can be used to increase performance. Along with batch size this represents your primary means of improving performance.
60
60
* `value.converter.schemas.enable` - Set to false if using a schema registry, true if you embed your schemas in the messages.
61
61
* `value.converter` - Set according to your datatype e.g. for JSON, `io.confluent.connect.json.JsonSchemaConverter`.
0 commit comments