@@ -17,9 +17,17 @@ The input string can be a file path or URL.
17
17
18
18
## Read from CSV
19
19
20
+ Before you can read data from CSV, make sure you have the following dependency:
21
+
22
+ ``` kotlin
23
+ implementation(" org.jetbrains.kotlinx:dataframe-csv:$dataframe_version " )
24
+ ```
25
+
26
+ It's included by default if you have ` org.jetbrains.kotlinx:dataframe:$dataframe_version ` already.
27
+
20
28
To read a CSV file, use the ` .readCsv() ` function.
21
29
22
- Since DataFrame v0.15, a new experimental CSV integration is available.
30
+ Since DataFrame v0.15, this new CSV integration is available.
23
31
It is faster and more flexible than the old one, now being based on
24
32
[ Deephaven CSV] ( https://github.com/deephaven/deephaven-csv ) .
25
33
@@ -43,6 +51,21 @@ import java.net.URL
43
51
DataFrame .readCsv(URL (" https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv" ))
44
52
```
45
53
54
+ Zip and GZip files are supported as well.
55
+
56
+ To read CSV from ` String ` :
57
+
58
+ ``` kotlin
59
+ val csv = """
60
+ A,B,C,D
61
+ 12,tuv,0.12,true
62
+ 41,xyz,3.6,not assigned
63
+ 89,abc,7.1,false
64
+ """ .trimIndent()
65
+
66
+ DataFrame .readCsvStr(csv)
67
+ ```
68
+
46
69
### Specify delimiter
47
70
48
71
By default, CSV files are parsed using ` , ` as the delimiter. To specify a custom delimiter, use the ` delimiter ` argument:
@@ -60,9 +83,19 @@ val df = DataFrame.readCsv(
60
83
61
84
<!-- -END-->
62
85
86
+ Aside from the delimiter, there are many other parameters to change.
87
+ These include the header, the number of rows to skip, the number of rows to read, the quote character, and more.
88
+ Check out the KDocs for more information.
89
+
63
90
### Column type inference from CSV
64
91
65
- Column types are inferred from the CSV data. Suppose that the CSV from the previous
92
+ Column types are inferred from the CSV data.
93
+
94
+ We rely on the fast implementation of [ Deephaven CSV] ( https://github.com/deephaven/deephaven-csv ) for inferring and
95
+ parsing to (nullable) ` Int ` , ` Long ` , ` Double ` , and ` Boolean ` types.
96
+ For other types we fall back to [ the parse operation] ( parse.md ) .
97
+
98
+ Suppose that the CSV from the previous
66
99
example had the following content:
67
100
68
101
<table >
@@ -81,15 +114,15 @@ C: Double
81
114
D: Boolean?
82
115
```
83
116
84
- [ ` DataFrame ` ] ( DataFrame.md ) tries to parse columns as JSON, so when reading the following table with JSON object in column D:
117
+ [ ` DataFrame ` ] ( DataFrame.md ) can [ parse] ( parse.md ) columns as JSON too , so when reading the following table with JSON object in column D:
85
118
86
119
<table >
87
120
<tr ><th >A</th ><th >D</th ></tr >
88
121
<tr ><td >12</td ><td >{"B":2,"C":3}</td ></tr >
89
122
<tr ><td >41</td ><td >{"B":3,"C":2}</td ></tr >
90
123
</table >
91
124
92
- We get this data schema where D is [ ` ColumnGroup ` ] ( DataColumn.md#columngroup ) with 2 children columns:
125
+ We get this data schema where D is [ ` ColumnGroup ` ] ( DataColumn.md#columngroup ) with two nested columns:
93
126
94
127
``` text
95
128
A: Int
@@ -123,10 +156,10 @@ Sometimes columns in your CSV can be interpreted differently depending on your s
123
156
<tr ><td >41,111</td ></tr >
124
157
</table >
125
158
126
- Here a comma can be decimal or thousands separator, thus different values.
127
- You can deal with it in two ways:
159
+ Here a comma can be a decimal-, or thousands separator, and thus become different values.
160
+ You can deal with it in multiple ways, for instance :
128
161
129
- 1 ) Provide locale as a parser option
162
+ 1 ) Provide locale as parser option
130
163
131
164
<!-- -FUN readNumbersWithSpecificLocale-->
132
165
@@ -168,23 +201,26 @@ columns like this may be recognized as simple `String` values rather than actual
168
201
169
202
You can fix this whenever you [ parse] ( parse.md ) a string-based column (e.g., using [ ` DataFrame.readCsv() ` ] ( read.md#read-from-csv ) ,
170
203
[ ` DataFrame.readTsv() ` ] ( read.md#read-from-csv ) , or [ ` DataColumn<String>.convertTo<>() ` ] ( convert.md ) ) by providing
171
- a custom date-time pattern. There are two ways to do this:
204
+ a custom date-time pattern.
205
+
206
+ There are two ways to do this:
172
207
173
208
1 ) By providing the date-time pattern as raw string to the ` ParserOptions ` argument:
174
209
175
- <!-- -FUN readNumbersWithSpecificDateTimePattern -->
210
+ <!-- -FUN readDatesWithSpecificDateTimePattern -->
176
211
177
212
``` kotlin
178
213
val df = DataFrame .readCsv(
179
214
file,
180
215
parserOptions = ParserOptions (dateTimePattern = " dd/MMM/yy h:mm a" )
181
216
)
182
217
```
218
+
183
219
<!-- -END-->
184
220
185
221
2 ) By providing a ` DateTimeFormatter ` to the ` ParserOptions ` argument:
186
222
187
- <!-- -FUN readNumbersWithSpecificDateTimeFormatter -->
223
+ <!-- -FUN readDatesWithSpecificDateTimeFormatter -->
188
224
189
225
``` kotlin
190
226
val df = DataFrame .readCsv(
@@ -204,6 +240,50 @@ The result will be a dataframe with properly parsed `DateTime` columns.
204
240
>
205
241
> For more details on the parse operation, see the [ ` parse operation ` ] ( parse.md ) .
206
242
243
+ ### Provide a default type for all columns
244
+
245
+ While you can provide a ` ColType ` per column, you might not
246
+ always know how many columns there are or what their names are.
247
+ In such cases, you can disable type inference for all columns
248
+ by providing a default type for all columns:
249
+
250
+ <!-- -FUN readDatesWithDefaultType-->
251
+
252
+ ``` kotlin
253
+ val df = DataFrame .readCsv(
254
+ file,
255
+ colTypes = mapOf (ColType .DEFAULT to ColType .String ),
256
+ )
257
+ ```
258
+
259
+ <!-- -END-->
260
+
261
+ This default can be combined with specific types for other columns as well.
262
+
263
+ ### Unlocking Deephaven CSV features
264
+
265
+ For each group of functions (` readCsv ` , ` readDelim ` , ` readTsv ` , etc.)
266
+ we provide one overload which has the ` adjustCsvSpecs ` parameter.
267
+ This is an advanced option because it exposes the
268
+ [ CsvSpecs.Builder] ( https://github.com/deephaven/deephaven-csv/blob/main/src/main/java/io/deephaven/csv/CsvSpecs.java )
269
+ of the underlying Deephaven implementation.
270
+ Generally, we don't recommend using this feature unless there's no other way to achieve your goal.
271
+
272
+ For example, to enable the (unconfigurable but) very fast [ ISO DateTime Parser of Deephaven CSV] ( https://medium.com/@deephavendatalabs/a-high-performance-csv-reader-with-type-inference-4bf2e4baf2d1 ) :
273
+
274
+ <!-- -FUN readDatesWithDeephavenDateTimeParser-->
275
+
276
+ ``` kotlin
277
+ val df = DataFrame .readCsv(
278
+ inputStream = file.openStream(),
279
+ adjustCsvSpecs = { // it: CsvSpecs.Builder
280
+ it.putParserForName(" date" , Parsers .DATETIME )
281
+ },
282
+ )
283
+ ```
284
+
285
+ <!-- -END-->
286
+
207
287
## Read from JSON
208
288
209
289
To read a JSON file, use the ` .readJson() ` function. JSON files can be read from a file or a URL.
@@ -434,6 +514,8 @@ Before you can read data from Excel, add the following dependency:
434
514
implementation(" org.jetbrains.kotlinx:dataframe-excel:$dataframe_version " )
435
515
```
436
516
517
+ It's included by default if you have ` org.jetbrains.kotlinx:dataframe:$dataframe_version ` already.
518
+
437
519
To read an Excel spreadsheet, use the ` .readExcel() ` function. Excel spreadsheets can be read from a file or a URL. Supported
438
520
Excel spreadsheet formats are: xls, xlsx.
439
521
@@ -484,6 +566,8 @@ Before you can read data from Apache Arrow format, add the following dependency:
484
566
implementation(" org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version " )
485
567
```
486
568
569
+ It's included by default if you have ` org.jetbrains.kotlinx:dataframe:$dataframe_version ` already.
570
+
487
571
To read Apache Arrow formats, use the ` .readArrowFeather() ` function:
488
572
489
573
<!-- -FUN readArrowFeather-->
0 commit comments