You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tutorials/data-packs.md
+16-8Lines changed: 16 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,18 @@
1
-
# Interactive Date Picker - Custom Data
1
+
# Introduction to ICU4X - Data packs
2
+
3
+
If you're happy shipping your app with the recommended set of locales included in `ICU4X`, you can stop reading now. If you want to include additional locales, do runtime data loading, or build your own complex data pipelines, this tutorial is for you.
2
4
3
5
In this tutorial, we will add additional locale data to your app. ICU4X compiled data contains data for hundreds of languages, but there are languages that have data in CLDR that are not included (generally because they don't have comprehensive coverage). For example, if you try using the locale `ccp` (Chakma) in your app, you will get output like `2023 M11 7`. Believe it or not, but this is not actually correct output for Chakma. Instead ICU4X fell back to the "root locale", which tries to be as neutral as possible. Note how it avoided calling the month by name by using `M11`, even though we requested a format with a non-numeric month name.
4
6
5
7
So, let's add some data for Chakma.
6
8
7
-
## 1. Installing `icu4x-datagen`
9
+
## 1. Prerequisites
10
+
11
+
This tutorial assumes you have finished the [introductory tutorial](quickstart.md) and continues where that tutorial left off. In particular, you should still have the latest version of your code.
8
12
9
13
Data generation is done using the `icu4x-datagen` tool, which pulls data from [Unicode's *Common Locale Data Repository* (*CLDR*)](http://cldr.unicode.org/index/downloads) and from `ICU4C` releases.
10
14
11
-
Verify that Rust is installed. If it's not, you can install it in a few seconds from [https://rustup.rs/](https://rustup.rs/).
15
+
Verify that Rust is installed (even if you're following the JavaScript tutorial). If it's not, you can install it in a few seconds from [https://rustup.rs/](https://rustup.rs/).
This will generate a `ccp.blob` file containing data for Chakma.
33
37
38
+
`icu4x-datagen` has many options, some of which we'll discover below. The default options should work for most purposes, but check out `icu4x-datagen --help` to learn more about fine-tuning your data.
39
+
34
40
💡 Note: if you're having technical difficulties, this file is available [here](https://storage.googleapis.com/static-493776/icu4x_2023-11-03/ccp.blob).
35
41
36
42
37
43
## 3. Using the data pack
38
44
39
-
### Rust Part 3
45
+
<details>
46
+
<summary>Rust</summary>
40
47
41
48
To use blob data, we will need to add the `icu_provider_blob` crate to our project:
42
49
@@ -54,11 +61,9 @@ Now, update the instantiation of the datetime formatter to load data from the bl
54
61
locale is Chakma:
55
62
56
63
```rust
57
-
// At the top of the file:
58
64
useicu::locale::locale;
59
65
useicu_provider_blob::BlobDataProvider;
60
66
61
-
// replace the date_formatter creation
62
67
letdate_formatter=iflocale==locale!("ccp") {
63
68
println!("Using buffer provider");
64
69
@@ -78,9 +83,10 @@ let date_formatter = if locale == locale!("ccp") {
78
83
};
79
84
```
80
85
81
-
Try using `ccp` now!
86
+
</details>
82
87
83
-
### JavaScript Part 3
88
+
<details>
89
+
<summary>JavaScript</summary>
84
90
85
91
Update the formatting logic to load data from the blob if the locale is Chakma. Note that this code uses a callback, as it does an HTTP request:
This tutorial introduces data providers as well as the `icu4x-datagen` tool.
1
+
# Introduction to ICU4X - Data slimming
4
2
5
3
If you're happy shipping your app with the recommended set of locales included in `ICU4X`, you can stop reading now. If you want to reduce code size, do runtime data loading, or build your own complex data pipelines, this tutorial is for you.
6
4
5
+
In this tutorial, we will remove unneeded locale data from our app. ICU4X compiled data contains data for hundreds of languages, but not all locales might be required at runtime. Usually there is a fixed set that a user can choose from, which in our example is going to be Japanese and English (`ja` and `en`).
6
+
7
7
## 1. Prerequisites
8
8
9
-
This tutorial assumes you have finished the [introductory tutorial](quickstart.md) and continues where that tutorial left off. In particular, you should still have the latest version of code for `myapp`.
9
+
This tutorial assumes you have finished the [introductory tutorial](quickstart.md) and continues where that tutorial left off. In particular, you should still have the latest version of your code.
10
+
11
+
Data generation is done using the `icu4x-datagen` tool, which pulls data from [Unicode's *Common Locale Data Repository* (*CLDR*)](http://cldr.unicode.org/index/downloads) and from `ICU4C` releases.
10
12
11
-
## 2. Generating data
13
+
Verify that Rust is installed (even if you're following the JavaScript tutorial). If it's not, you can install it in a few seconds from [https://rustup.rs/](https://rustup.rs/).
12
14
13
-
Data generation is done using the `icu4x-datagen` tool, which pulls in data from [Unicode's *Common Locale Data Repository* (*CLDR*)](http://cldr.unicode.org/index/downloads) and from `ICU4C` releases to generate `ICU4X` data.
15
+
```console
16
+
cargo --version
17
+
# cargo 1.86.0 (adf9b6ad1 2025-02-28)
18
+
```
14
19
15
-
First we will need to install the binary:
20
+
Now you can run
16
21
17
22
```console
18
23
cargo install icu4x-datagen
19
24
```
20
25
21
-
Get a coffee, this might take a while ☕.
26
+
## 2. Generating custom data
22
27
23
28
Once installed, run:
24
29
25
30
```console
26
-
icu4x-datagen --markers all --locales ja --format baked --pretty --out my_data
31
+
icu4x-datagen --markers all --locales ja en --format baked --pretty --out my_data
27
32
```
28
33
29
-
This will generate a `my_data` directory containing the data for all components in the `ja`locale.
34
+
This will generate a `my_data` directory containing the data for all components in the `ja`and `en` locales.
30
35
31
36
`icu4x-datagen` has many options, some of which we'll discover below. The default options should work for most purposes, but check out `icu4x-datagen --help` to learn more about fine-tuning your data.
32
37
33
-
### Should you check in data to your repository?
38
+
<details>
39
+
<summary>Aside: Should you check in data to your repository?</summary>
34
40
35
41
You can check in the generated data to your version control system, or you can add it to a build script. There are pros and cons of both approaches.
36
42
@@ -46,8 +52,12 @@ You should generate it automatically at build time if:
46
52
47
53
If you check in the generated data, it is recommended that you configure a job in continuous integration that verifies that the data in your repository reflects the latest CLDR/Unicode releases; otherwise, your app may drift out of date.
48
54
55
+
</details>
56
+
49
57
## 3. Using the generated data
50
58
59
+
Note: this section is currently only possible in Rust. 🤷
60
+
51
61
Once we have generated the data, we need to instruct `ICU4X` to use it. To do this, set the `ICU4X_DATA_DIR` during the compilation of your app:
52
62
53
63
```console
@@ -79,41 +89,26 @@ Because of these two data provider types, every `ICU4X` API has three constructo
79
89
80
90
## 5. Using the generated data explicitly
81
91
92
+
Note: this section is currently only possible in Rust. 🤷
93
+
82
94
The data we generated in section 2 is actually just Rust code defining `DataProvider` implementations for all markers using hardcoded data (go take a look!).
83
95
84
96
So far we've used it through the default `try_new` constructor by using the environment variable to replace the built-in data. However, we can also directly access the `DataProvider` implementations if we want, for example to combine it with other providers.
85
97
86
98
We include the generated code with the `include!` macro. The `impl_data_provider!` macro adds the generated implementations to any type.
87
99
88
-
```rust,compile_fail
89
-
extern crate alloc; // required as my-data is written for #[no_std]
90
-
use icu::locale::{locale, Locale};
91
-
use icu::calendar::Date;
92
-
use icu::datetime::{DateTimeFormatter, fieldsets::YMD};
93
-
94
-
const LOCALE: Locale = locale!("ja");
100
+
Replace your `date_time_formatter` construction with the following code:
95
101
96
-
struct MyDataProvider;
102
+
```rust,compile_fail
103
+
extern crate alloc; // required as my_data is written for #[no_std]
97
104
include!("../my_data/mod.rs");
105
+
struct MyDataProvider;
98
106
impl_data_provider!(MyDataProvider);
99
107
100
-
fn main() {
101
-
let baked_provider = MyDataProvider;
102
-
103
-
let dtf = DateTimeFormatter::try_new_unstable(
104
-
&baked_provider,
105
-
LOCALE.into(),
106
-
YMD::long()
107
-
)
108
-
.expect("ja data should be available");
109
-
110
-
let date = Date::try_new_iso(2020, 10, 14)
111
-
.expect("date should be valid");
112
-
113
-
let formatted_date = dtf.format(&date);
114
-
115
-
println!("📅: {}", formatted_date);
116
-
}
108
+
// Create and use an ICU4X date formatter:
109
+
let date_formatter = DateTimeFormatter::try_new_unstable(MyDataProvider, locale.into(), YMDT::medium())
The `impl_data_provider!` code will require additional crates, see its documentation for a list.
@@ -152,52 +147,53 @@ This will generate a `my_data_blob.postcard` file containing the serialized data
152
147
153
148
### Locale Fallbacking
154
149
150
+
<details>
151
+
<summary>Rust</summary>
152
+
155
153
Unlike `BakedDataProvider`, `BlobDataProvider` (and `FsDataProvider`) does not perform locale fallbacking. For example, if `en-US` is requested but only `en` data is available, then the data request will fail. To enable fallback, we can wrap the provider in a `LocaleFallbackProvider`.
156
154
157
155
Note that fallback comes at a cost, as fallbacking code and data has to be included and executed on every request. If you don't need fallback (disclaimer: you probably do), you can use the `BlobDataProvider` directly (for baked data, see [`Options::skip_internal_fallback`](https://docs.rs/icu_provider_baked/latest/icu_provider_baked/export/struct.Options.html)).
158
156
159
157
We can then use the provider in our code:
160
158
161
159
```rust,no_run
162
-
use icu::locale::{locale, Locale, fallback::LocaleFallbacker};
163
-
use icu::calendar::Date;
164
-
use icu::datetime::{DateTimeFormatter, fieldsets::YMD};
160
+
use icu::locale::fallback::LocaleFallbacker;
165
161
use icu_provider_adapters::fallback::LocaleFallbackProvider;
166
162
use icu_provider_blob::BlobDataProvider;
167
163
168
-
const LOCALE: Locale = locale!("ja");
164
+
let blob = std::fs::read("my_data_blob.postcard").expect("Failed to read file");
let dtf = DateTimeFormatter::try_new_with_buffer_provider(
182
-
&buffer_provider,
183
-
LOCALE.into(),
184
-
YMD::long()
185
-
)
186
-
.expect("blob should contain required markers and `ja` data");
181
+
As you can see in the second `expect` message, it's not possible to statically tell whether the correct data markers are included. While `BakedDataProvider` would result in a compile error for missing `DataProvider<M>` implementations, `BlobDataProvider` returns runtime errors if markers are missing.
187
182
188
-
let date = Date::try_new_iso(2020, 10, 14)
189
-
.expect("date should be valid");
183
+
</details>
190
184
191
-
let formatted_date = dtf.format(&date);
185
+
<details>
186
+
<summary>JavaScript</summary>
192
187
193
-
println!("📅: {}", formatted_date);
194
-
}
195
-
```
188
+
TODO
189
+
190
+
</details>
196
191
197
-
As you can see in the second `expect` message, it's not possible to statically tell whether the correct data markers are included. While `BakedDataProvider` would result in a compile error for missing `DataProvider<M>` implementations, `BlobDataProvider` returns runtime errors if markers are missing.
198
192
199
193
## 7. Data slicing
200
194
195
+
Note: this section is currently only possible in Rust. 🤷
196
+
201
197
You might have noticed that the blob we generated is a hefty 5MB. This is no surprise, as we used `--markers all`. However, our binary only uses date formatting data in Japanese. There's room for optimization:
202
198
203
199
```console
@@ -211,38 +207,13 @@ But there is more to optimize. You might have noticed this in the output of the
211
207
We can instead use `FixedCalendarDateTimeFormatter<Gregorian>`, which only supports formatting `Date<Gregorian>`s:
212
208
213
209
```rust,no_run
214
-
use icu::locale::{locale, Locale, fallback::LocaleFallbacker};
215
-
use icu::calendar::{Date, Gregorian};
216
-
use icu::datetime::{FixedCalendarDateTimeFormatter, fieldsets::YMD};
217
-
use icu_provider_adapters::fallback::LocaleFallbackProvider;
218
-
use icu_provider_blob::BlobDataProvider;
219
-
220
-
const LOCALE: Locale = locale!("ja");
221
-
222
-
fn main() {
223
-
let blob = std::fs::read("my_data_blob.postcard").expect("Failed to read file");
@@ -260,6 +231,6 @@ These API-level optimizations also apply to compiled data (there's no need to us
260
231
261
232
We have learned how to generate data and load it into our programs, optimize data size, and gotten to know the different data providers that are part of `ICU4X`.
262
233
263
-
For a deeper dive into configuring your data providers in code, see [data-provider-runtime.md].
234
+
For a deeper dive into configuring your data providers in code, see [the runtime data provider tutorial](data-provider-runtime.md).
264
235
265
236
You can learn more about datagen, including the Rust API which we have not used in this tutorial, by reading [the docs](https://docs.rs/icu_provider_export/latest/).
0 commit comments