updating csv docs and examples

Jolanrensen · Jolanrensen · commit b57ec5b7304c · 2025-02-14T12:02:26.000+01:00
diff --git a/docs/StardustDocs/topics/read.md b/docs/StardustDocs/topics/read.md
@@ -17,9 +17,17 @@ The input string can be a file path or URL.
 
 ## Read from CSV
 
+Before you can read data from CSV, make sure you have the following dependency:
+
+```kotlin
+implementation("org.jetbrains.kotlinx:dataframe-csv:$dataframe_version")
+```
+
+It's included by default if you have `org.jetbrains.kotlinx:dataframe:$dataframe_version` already.
+
 To read a CSV file, use the `.readCsv()` function.
 
-Since DataFrame v0.15, a new experimental CSV integration is available.
+Since DataFrame v0.15, this new CSV integration is available.
 It is faster and more flexible than the old one, now being based on
 [Deephaven CSV](https://github.com/deephaven/deephaven-csv).
 
@@ -43,6 +51,21 @@ import java.net.URL
 DataFrame.readCsv(URL("https://raw.githubusercontent.com/Kotlin/dataframe/master/data/jetbrains_repositories.csv"))
 ```
 
+Zip and GZip files are supported as well.
+
+To read CSV from `String`:
+
+```kotlin
+val csv = """
+    A,B,C,D
+    12,tuv,0.12,true
+    41,xyz,3.6,not assigned
+    89,abc,7.1,false
+""".trimIndent()
+
+DataFrame.readCsvStr(csv)
+```
+
 ### Specify delimiter
 
 By default, CSV files are parsed using `,` as the delimiter. To specify a custom delimiter, use the `delimiter` argument:
@@ -60,9 +83,19 @@ val df = DataFrame.readCsv(
 
 <!---END-->
 
+Aside from the delimiter, there are many other parameters to change.
+These include the header, the number of rows to skip, the number of rows to read, the quote character, and more.
+Check out the KDocs for more information.
+
 ### Column type inference from CSV
 
-Column types are inferred from the CSV data. Suppose that the CSV from the previous
+Column types are inferred from the CSV data.
+
+We rely on the fast implementation of [Deephaven CSV](https://github.com/deephaven/deephaven-csv) for inferring and
+parsing to (nullable) `Int`, `Long`, `Double`, and `Boolean` types.
+For other types we fall back to [the parse operation](parse.md).
+
+Suppose that the CSV from the previous
 example had the following content:
 
 <table>
@@ -81,15 +114,15 @@ C: Double
 D: Boolean?
 ```
 
-[`DataFrame`](DataFrame.md) tries to parse columns as JSON, so when reading the following table with JSON object in column D:
+[`DataFrame`](DataFrame.md) can [parse](parse.md) columns as JSON too, so when reading the following table with JSON object in column D:
 
 <table>
 <tr><th>A</th><th>D</th></tr>
 <tr><td>12</td><td>{"B":2,"C":3}</td></tr>
 <tr><td>41</td><td>{"B":3,"C":2}</td></tr>
 </table>
 
-We get this data schema where D is [`ColumnGroup`](DataColumn.md#columngroup) with 2 children columns:
+We get this data schema where D is [`ColumnGroup`](DataColumn.md#columngroup) with two nested columns:
 
 ```text
 A: Int
@@ -123,10 +156,10 @@ Sometimes columns in your CSV can be interpreted differently depending on your s
 <tr><td>41,111</td></tr>
 </table>
 
-Here a comma can be decimal or thousands separator, thus different values.
-You can deal with it in two ways:
+Here a comma can be a decimal-, or thousands separator, and thus become different values.
+You can deal with it in multiple ways, for instance:
 
-1) Provide locale as a parser option
+1) Provide locale as parser option
 
 <!---FUN readNumbersWithSpecificLocale-->
 
@@ -168,23 +201,26 @@ columns like this may be recognized as simple `String` values rather than actual
 
 You can fix this whenever you [parse](parse.md) a string-based column (e.g., using [`DataFrame.readCsv()`](read.md#read-from-csv),
 [`DataFrame.readTsv()`](read.md#read-from-csv), or [`DataColumn<String>.convertTo<>()`](convert.md)) by providing
-a custom date-time pattern. There are two ways to do this:
+a custom date-time pattern. 
+
+There are two ways to do this:
 
 1) By providing the date-time pattern as raw string to the `ParserOptions` argument:
 
-<!---FUN readNumbersWithSpecificDateTimePattern-->
+<!---FUN readDatesWithSpecificDateTimePattern-->
 
 ```kotlin
 val df = DataFrame.readCsv(
     file,
     parserOptions = ParserOptions(dateTimePattern = "dd/MMM/yy h:mm a")
 )
 ```
+
 <!---END-->
 
 2) By providing a `DateTimeFormatter` to the `ParserOptions` argument:
 
-<!---FUN readNumbersWithSpecificDateTimeFormatter-->
+<!---FUN readDatesWithSpecificDateTimeFormatter-->
 
 ```kotlin
 val df = DataFrame.readCsv(
@@ -204,6 +240,50 @@ The result will be a dataframe with properly parsed `DateTime` columns.
 > 
 > For more details on the parse operation, see the [`parse operation`](parse.md).
 
+### Provide a default type for all columns
+
+While you can provide a `ColType` per column, you might not
+always know how many columns there are or what their names are.
+In such cases, you can disable type inference for all columns
+by providing a default type for all columns:
+
+<!---FUN readDatesWithDefaultType-->
+
+```kotlin
+val df = DataFrame.readCsv(
+    file,
+    colTypes = mapOf(ColType.DEFAULT to ColType.String),
+)
+```
+
+<!---END-->
+
+This default can be combined with specific types for other columns as well.
+
+### Unlocking Deephaven CSV features
+
+For each group of functions (`readCsv`, `readDelim`, `readTsv`, etc.)
+we provide one overload which has the `adjustCsvSpecs` parameter.
+This is an advanced option because it exposes the
+[CsvSpecs.Builder](https://github.com/deephaven/deephaven-csv/blob/main/src/main/java/io/deephaven/csv/CsvSpecs.java)
+of the underlying Deephaven implementation.
+Generally, we don't recommend using this feature unless there's no other way to achieve your goal.
+
+For example, to enable the (unconfigurable but) very fast [ISO DateTime Parser of Deephaven CSV](https://medium.com/@deephavendatalabs/a-high-performance-csv-reader-with-type-inference-4bf2e4baf2d1):
+
+<!---FUN readDatesWithDeephavenDateTimeParser-->
+
+```kotlin
+val df = DataFrame.readCsv(
+    inputStream = file.openStream(),
+    adjustCsvSpecs = { // it: CsvSpecs.Builder
+        it.putParserForName("date", Parsers.DATETIME)
+    },
+)
+```
+
+<!---END-->
+
 ## Read from JSON
 
 To read a JSON file, use the `.readJson()` function. JSON files can be read from a file or a URL.
@@ -434,6 +514,8 @@ Before you can read data from Excel, add the following dependency:
 implementation("org.jetbrains.kotlinx:dataframe-excel:$dataframe_version")
 ```
 
+It's included by default if you have `org.jetbrains.kotlinx:dataframe:$dataframe_version` already.
+
 To read an Excel spreadsheet, use the `.readExcel()` function. Excel spreadsheets can be read from a file or a URL. Supported
 Excel spreadsheet formats are: xls, xlsx.
 
@@ -484,6 +566,8 @@ Before you can read data from Apache Arrow format, add the following dependency:
 implementation("org.jetbrains.kotlinx:dataframe-arrow:$dataframe_version")
 ```
 
+It's included by default if you have `org.jetbrains.kotlinx:dataframe:$dataframe_version` already.
+
 To read Apache Arrow formats, use the `.readArrowFeather()` function:
 
 <!---FUN readArrowFeather-->
diff --git a/tests/src/test/kotlin/org/jetbrains/kotlinx/dataframe/samples/api/Read.kt b/tests/src/test/kotlin/org/jetbrains/kotlinx/dataframe/samples/api/Read.kt
@@ -2,6 +2,7 @@
 
 package org.jetbrains.kotlinx.dataframe.samples.api
 
+import io.deephaven.csv.parsers.Parsers
 import io.kotest.matchers.shouldBe
 import org.jetbrains.kotlinx.dataframe.DataFrame
 import org.jetbrains.kotlinx.dataframe.DataRow
@@ -19,6 +20,7 @@ import org.jetbrains.kotlinx.dataframe.testCsv
 import org.jetbrains.kotlinx.dataframe.testJson
 import org.junit.Ignore
 import org.junit.Test
+import java.time.format.DateTimeFormatter
 import java.util.Locale
 import kotlin.reflect.typeOf
 
@@ -102,4 +104,52 @@ class Read {
         )
         // SampleEnd
     }
+
+    @Test
+    fun readDatesWithSpecificDateTimePattern() {
+        val file = testCsv("dates")
+        // SampleStart
+        val df = DataFrame.readCsv(
+            file,
+            parserOptions = ParserOptions(dateTimePattern = "dd/MMM/yy h:mm a")
+        )
+        // SampleEnd
+    }
+
+    @Test
+    fun readDatesWithSpecificDateTimeFormatter() {
+        val file = testCsv("dates")
+        // SampleStart
+        val df = DataFrame.readCsv(
+            file,
+            parserOptions = ParserOptions(dateTimeFormatter = DateTimeFormatter.ofPattern("dd/MMM/yy h:mm a"))
+        )
+        // SampleEnd
+    }
+
+    @Test
+    fun readDatesWithDefaultType() {
+        val file = testCsv("dates")
+        // SampleStart
+        val df = DataFrame.readCsv(
+            file,
+            colTypes = mapOf(ColType.DEFAULT to ColType.String),
+        )
+        // SampleEnd
+    }
+
+    @Test
+    fun readDatesWithDeephavenDateTimeParser() {
+        val file = testCsv("dates")
+        try {
+            // SampleStart
+            val df = DataFrame.readCsv(
+                inputStream = file.openStream(),
+                adjustCsvSpecs = { // it: CsvSpecs.Builder
+                    it.putParserForName("date", Parsers.DATETIME)
+                },
+            )
+            // SampleEnd
+        } catch (_: Exception) {}
+    }
 }
diff --git a/tests/src/test/resources/dates.csv b/tests/src/test/resources/dates.csv
@@ -0,0 +1,3 @@
+date
+13/Jan/23 11:49 AM
+14/Mar/23 5:35 PM

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+date`
	`2`	`+13/Jan/23 11:49 AM`
	`3`	`+14/Mar/23 5:35 PM`