Skip to content

Commit b57ec5b

Browse files
committed
updating csv docs and examples
1 parent a6f9758 commit b57ec5b

File tree

3 files changed

+147
-10
lines changed
  • docs/StardustDocs/topics
  • tests/src/test

3 files changed

+147
-10
lines changed

docs/StardustDocs/topics/read.md

+94-10
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,17 @@ The input string can be a file path or URL.
1717

1818
## Read from CSV
1919

20+
Before you can read data from CSV, make sure you have the following dependency:
21+
22+
```kotlin
23+
implementation("org.jetbrains.kotlinx:dataframe-csv:$dataframe_version")
24+
```
25+
26+
It's included by default if you have `org.jetbrains.kotlinx:dataframe:$dataframe_version` already.
27+
2028
To read a CSV file, use the `.readCsv()` function.
2129

22-
Since DataFrame v0.15, a new experimental CSV integration is available.
30+
Since DataFrame v0.15, this new CSV integration is available.
2331
It is faster and more flexible than the old one, now being based on
2432
[Deephaven CSV](https://github.com/deephaven/deephaven-csv).
2533

@@ -43,6 +51,21 @@ import java.net.URL
4351
DataFrame.readCsv(URL("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv"))
4452
```
4553

54+
Zip and GZip files are supported as well.
55+
56+
To read CSV from `String`:
57+
58+
```kotlin
59+
val csv = """
60+
A,B,C,D
61+
12,tuv,0.12,true
62+
41,xyz,3.6,not assigned
63+
89,abc,7.1,false
64+
""".trimIndent()
65+
66+
DataFrame.readCsvStr(csv)
67+
```
68+
4669
### Specify delimiter
4770

4871
By default, CSV files are parsed using `,` as the delimiter. To specify a custom delimiter, use the `delimiter` argument:
@@ -60,9 +83,19 @@ val df = DataFrame.readCsv(
6083

6184
<!---END-->
6285

86+
Aside from the delimiter, there are many other parameters to change.
87+
These include the header, the number of rows to skip, the number of rows to read, the quote character, and more.
88+
Check out the KDocs for more information.
89+
6390
### Column type inference from CSV
6491

65-
Column types are inferred from the CSV data. Suppose that the CSV from the previous
92+
Column types are inferred from the CSV data.
93+
94+
We rely on the fast implementation of [Deephaven CSV](https://github.com/deephaven/deephaven-csv) for inferring and
95+
parsing to (nullable) `Int`, `Long`, `Double`, and `Boolean` types.
96+
For other types we fall back to [the parse operation](parse.md).
97+
98+
Suppose that the CSV from the previous
6699
example had the following content:
67100

68101
<table>
@@ -81,15 +114,15 @@ C: Double
81114
D: Boolean?
82115
```
83116

84-
[`DataFrame`](DataFrame.md) tries to parse columns as JSON, so when reading the following table with JSON object in column D:
117+
[`DataFrame`](DataFrame.md) can [parse](parse.md) columns as JSON too, so when reading the following table with JSON object in column D:
85118

86119
<table>
87120
<tr><th>A</th><th>D</th></tr>
88121
<tr><td>12</td><td>{"B":2,"C":3}</td></tr>
89122
<tr><td>41</td><td>{"B":3,"C":2}</td></tr>
90123
</table>
91124

92-
We get this data schema where D is [`ColumnGroup`](DataColumn.md#columngroup) with 2 children columns:
125+
We get this data schema where D is [`ColumnGroup`](DataColumn.md#columngroup) with two nested columns:
93126

94127
```text
95128
A: Int
@@ -123,10 +156,10 @@ Sometimes columns in your CSV can be interpreted differently depending on your s
123156
<tr><td>41,111</td></tr>
124157
</table>
125158

126-
Here a comma can be decimal or thousands separator, thus different values.
127-
You can deal with it in two ways:
159+
Here a comma can be a decimal-, or thousands separator, and thus become different values.
160+
You can deal with it in multiple ways, for instance:
128161

129-
1) Provide locale as a parser option
162+
1) Provide locale as parser option
130163

131164
<!---FUN readNumbersWithSpecificLocale-->
132165

@@ -168,23 +201,26 @@ columns like this may be recognized as simple `String` values rather than actual
168201

169202
You can fix this whenever you [parse](parse.md) a string-based column (e.g., using [`DataFrame.readCsv()`](read.md#read-from-csv),
170203
[`DataFrame.readTsv()`](read.md#read-from-csv), or [`DataColumn<String>.convertTo<>()`](convert.md)) by providing
171-
a custom date-time pattern. There are two ways to do this:
204+
a custom date-time pattern.
205+
206+
There are two ways to do this:
172207

173208
1) By providing the date-time pattern as raw string to the `ParserOptions` argument:
174209

175-
<!---FUN readNumbersWithSpecificDateTimePattern-->
210+
<!---FUN readDatesWithSpecificDateTimePattern-->
176211

177212
```kotlin
178213
val df = DataFrame.readCsv(
179214
file,
180215
parserOptions = ParserOptions(dateTimePattern = "dd/MMM/yy h:mm a")
181216
)
182217
```
218+
183219
<!---END-->
184220

185221
2) By providing a `DateTimeFormatter` to the `ParserOptions` argument:
186222

187-
<!---FUN readNumbersWithSpecificDateTimeFormatter-->
223+
<!---FUN readDatesWithSpecificDateTimeFormatter-->
188224

189225
```kotlin
190226
val df = DataFrame.readCsv(
@@ -204,6 +240,50 @@ The result will be a dataframe with properly parsed `DateTime` columns.
204240
>
205241
> For more details on the parse operation, see the [`parse operation`](parse.md).
206242
243+
### Provide a default type for all columns
244+
245+
While you can provide a `ColType` per column, you might not
246+
always know how many columns there are or what their names are.
247+
In such cases, you can disable type inference for all columns
248+
by providing a default type for all columns:
249+
250+
<!---FUN readDatesWithDefaultType-->
251+
252+
```kotlin
253+
val df = DataFrame.readCsv(
254+
file,
255+
colTypes = mapOf(ColType.DEFAULT to ColType.String),
256+
)
257+
```
258+
259+
<!---END-->
260+
261+
This default can be combined with specific types for other columns as well.
262+
263+
### Unlocking Deephaven CSV features
264+
265+
For each group of functions (`readCsv`, `readDelim`, `readTsv`, etc.)
266+
we provide one overload which has the `adjustCsvSpecs` parameter.
267+
This is an advanced option because it exposes the
268+
[CsvSpecs.Builder](https://github.com/deephaven/deephaven-csv/blob/main/src/main/java/io/deephaven/csv/CsvSpecs.java)
269+
of the underlying Deephaven implementation.
270+
Generally, we don't recommend using this feature unless there's no other way to achieve your goal.
271+
272+
For example, to enable the (unconfigurable but) very fast [ISO DateTime Parser of Deephaven CSV](https://medium.com/@deephavendatalabs/a-high-performance-csv-reader-with-type-inference-4bf2e4baf2d1):
273+
274+
<!---FUN readDatesWithDeephavenDateTimeParser-->
275+
276+
```kotlin
277+
val df = DataFrame.readCsv(
278+
inputStream = file.openStream(),
279+
adjustCsvSpecs = { // it: CsvSpecs.Builder
280+
it.putParserForName("date", Parsers.DATETIME)
281+
},
282+
)
283+
```
284+
285+
<!---END-->
286+
207287
## Read from JSON
208288

209289
To read a JSON file, use the `.readJson()` function. JSON files can be read from a file or a URL.
@@ -434,6 +514,8 @@ Before you can read data from Excel, add the following dependency:
434514
implementation("org.jetbrains.kotlinx:dataframe-excel:$dataframe_version")
435515
```
436516

517+
It's included by default if you have `org.jetbrains.kotlinx:dataframe:$dataframe_version` already.
518+
437519
To read an Excel spreadsheet, use the `.readExcel()` function. Excel spreadsheets can be read from a file or a URL. Supported
438520
Excel spreadsheet formats are: xls, xlsx.
439521

@@ -484,6 +566,8 @@ Before you can read data from Apache Arrow format, add the following dependency:
484566
implementation("org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version")
485567
```
486568

569+
It's included by default if you have `org.jetbrains.kotlinx:dataframe:$dataframe_version` already.
570+
487571
To read Apache Arrow formats, use the `.readArrowFeather()` function:
488572

489573
<!---FUN readArrowFeather-->

tests/src/test/kotlin/org/jetbrains/kotlinx/dataframe/samples/api/Read.kt

+50
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
package org.jetbrains.kotlinx.dataframe.samples.api
44

5+
import io.deephaven.csv.parsers.Parsers
56
import io.kotest.matchers.shouldBe
67
import org.jetbrains.kotlinx.dataframe.DataFrame
78
import org.jetbrains.kotlinx.dataframe.DataRow
@@ -19,6 +20,7 @@ import org.jetbrains.kotlinx.dataframe.testCsv
1920
import org.jetbrains.kotlinx.dataframe.testJson
2021
import org.junit.Ignore
2122
import org.junit.Test
23+
import java.time.format.DateTimeFormatter
2224
import java.util.Locale
2325
import kotlin.reflect.typeOf
2426

@@ -102,4 +104,52 @@ class Read {
102104
)
103105
// SampleEnd
104106
}
107+
108+
@Test
109+
fun readDatesWithSpecificDateTimePattern() {
110+
val file = testCsv("dates")
111+
// SampleStart
112+
val df = DataFrame.readCsv(
113+
file,
114+
parserOptions = ParserOptions(dateTimePattern = "dd/MMM/yy h:mm a")
115+
)
116+
// SampleEnd
117+
}
118+
119+
@Test
120+
fun readDatesWithSpecificDateTimeFormatter() {
121+
val file = testCsv("dates")
122+
// SampleStart
123+
val df = DataFrame.readCsv(
124+
file,
125+
parserOptions = ParserOptions(dateTimeFormatter = DateTimeFormatter.ofPattern("dd/MMM/yy h:mm a"))
126+
)
127+
// SampleEnd
128+
}
129+
130+
@Test
131+
fun readDatesWithDefaultType() {
132+
val file = testCsv("dates")
133+
// SampleStart
134+
val df = DataFrame.readCsv(
135+
file,
136+
colTypes = mapOf(ColType.DEFAULT to ColType.String),
137+
)
138+
// SampleEnd
139+
}
140+
141+
@Test
142+
fun readDatesWithDeephavenDateTimeParser() {
143+
val file = testCsv("dates")
144+
try {
145+
// SampleStart
146+
val df = DataFrame.readCsv(
147+
inputStream = file.openStream(),
148+
adjustCsvSpecs = { // it: CsvSpecs.Builder
149+
it.putParserForName("date", Parsers.DATETIME)
150+
},
151+
)
152+
// SampleEnd
153+
} catch (_: Exception) {}
154+
}
105155
}

tests/src/test/resources/dates.csv

+3
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
date
2+
13/Jan/23 11:49 AM
3+
14/Mar/23 5:35 PM

0 commit comments

Comments
 (0)