Skip to content

Commit 98df6cd

Browse files
Merge pull request #60 from alza-bitz/data-engineering-clojure-support-for-popular-data-tools
Clojure Support for Popular Data Tools: A Data Engineer's Perspective, and a New Clojure API for Snowflake
2 parents 75914c9 + 11a8484 commit 98df6cd

File tree

6 files changed

+309
-3
lines changed

6 files changed

+309
-3
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,4 @@ _site/
3535
.calva/repl.calva-repl
3636
.calva/mcp-server/port
3737
.portal/vs-code.edn
38+
**/password.edn

deps.edn

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,16 +20,20 @@
2020
org.eclipse.elk/org.eclipse.elk.alg.common {:mvn/version "0.10.0"}
2121
org.eclipse.elk/org.eclipse.elk.alg.layered {:mvn/version "0.10.0"}
2222
backtick/backtick {:mvn/version "0.3.5"}
23-
camel-snake-kebab/camel-snake-kebab {:mvn/version "0.4.3"}}
23+
camel-snake-kebab/camel-snake-kebab {:mvn/version "0.4.3"}
24+
io.github.alza-bitz/snowpark-clj {:git/url "https://github.com/alza-bitz/snowpark-clj.git"
25+
:git/sha "7856d9ca2080b188f9feec115ca709d3f54877b0"}}
2426

2527
:aliases
2628
{;; Build the site with `clojure -M:clay -A:markdown`
2729
;; Run Clay in watch mode with `clojure -M:clay`
2830
:clay {:main-opts ["-m" "scicloj.clay.v2.main"]
29-
:jvm-opts ["-Dclojure.main.report=stderr"]}
31+
:jvm-opts ["-Dclojure.main.report=stderr"
32+
"--add-opens=java.base/java.nio=ALL-UNNAMED"]}
3033
;; When debugging libraries
3134
:local-deps {:override-deps {org.scicloj/clay {:local/root "../clay"}
3235
org.scicloj/kindly {:local/root "../kindly"}
3336
org.scicloj/kindly-advice {:local/root "../kindly-advice"}
3437
org.scicloj/kindly-render {:local/root "../kindly-render"}}}
35-
:neil {:project {:name io.github.timothypratley/clojurecivitas}}}}
38+
:neil {:project {:name io.github.timothypratley/clojurecivitas}}
39+
:dev {:jvm-opts ["--add-opens=java.base/java.nio=ALL-UNNAMED"]}}}
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
---
2+
author:
3+
name: Alex Coyle
4+
url: https://github.com/alza-bitz
5+
image: https://avatars.githubusercontent.com/u/1161048?v=4
6+
email: ''
7+
affiliation:
8+
- {name: Scicloj, url: 'https://scicloj.github.io/'}
9+
links:
10+
- {icon: github, href: 'https://github.com/alza-bitz'}
11+
draft: true
12+
type: post
13+
date: '2025-09-04'
14+
category: clojure
15+
tags: [metadata, civitas]
16+
canonical-url: https://alza-bitz.github.io/clojure-support-for-popular-data-tools
17+
format:
18+
html: {title: 'Clojure Support for Popular Data Tools: A Data Engineer''s Perspective, and a New Clojure API for Snowflake'}
19+
20+
---
21+
<style></style><style>.printedClojure .sourceCode {
22+
background-color: transparent;
23+
border-style: none;
24+
}
25+
</style><style>.clay-limit-image-width .clay-image {max-width: 100%}
26+
.clay-side-by-side .sourceCode {margin: 0}
27+
.clay-side-by-side {margin: 1em 0}
28+
</style>
29+
<script src="snowflake_files/md-default0.js" type="text/javascript"></script><script src="snowflake_files/md-default1.js" type="text/javascript"></script>
30+
In this article I look at the extent of Clojure support for some popular on-cluster data processing tools that Clojure users might need for their data engineering or data science tasks. Then for [Snowflake](https://snowflake.com) in particular **I go further and present a new Clojure API.**
31+
32+
Why is the level of Clojure support important? As an example, consider that [Scicloj](https://scicloj.org) is mostly focused on in-memory processing. As such, if you need to work with a large dataset it will be necessary to compute on-cluster and extract a smaller result before continuing your data science task locally.
33+
34+
However, without sufficient Clojure support for on-cluster processing, anyone needing that facility for their data science or data engineering task would be forced to reach outside the Clojure ecosystem. That adds complexity in terms of interop, compatibility and overall stack requirements.
35+
36+
With that in mind, let's examine the level of Clojure support for some popular on-cluster data processing tools. For each tool I selected its official Clojure library if one exists, or if not the most popular and well-known community-supported alternative with at least 100 stars and 10 contributors on GitHub. I then used the following criteria against the library to classify it as "supported" or "support unknown":
37+
38+
1. CI/CD build passing
39+
1. Most recent commit less than 12 months ago
40+
1. Most recent release less than 12 months ago
41+
1. Maintainers responded to any issue or question less than 12 months ago
42+
1. Maintainers either accepted or rejected any PR less than 12 months ago
43+
44+
If I couldn't find any such library at all, I classified it as having "no support".
45+
46+
| Tool Category | Supported | Support Unknown | No Support |
47+
|---------------|---------------------|--------------|---------------|
48+
| **On-cluster batch processing** | | 1. [Spark](https://spark.apache.org) (see [Spark Interop with Geni](#spark_interop_with_geni) below) | |
49+
| **On-cluster stream processing** | | 2. [Kafka Streams](https://kafka.apache.org/documentation/streams) (see [Kafka Interop with Jackdaw](#kafka_interop_with_jackdaw) below) | 3. [Spark Structured Streaming](https://spark.apache.org/streaming),\<br\>4. [Flink](https://flink.apache.org) |
50+
| **On-cluster batch and stream processing** | | | 5. [Databricks](https://databricks.com) (see [Spark Interop with Geni](#spark_interop_with_geni) below),\<br\>6. [Snowflake](https://snowflake.com) (see [Snowflake Interop](#snowflake_interop_with_a_new_clojure_api!) below) |
51+
52+
Please note, I don't wish to make any critical judgments based on either the summary analysis above or the more detailed analysis below. The goal is to understand the situation with respect to Clojure support and highlight any gaps, although I suppose I am also inadvertently highlighting the difficulties of maintaining open source software!
53+
54+
With that said, let's dive into the details for Spark, Kafka and Snowflake.
55+
56+
57+
### Spark Interop with Geni
58+
59+
[Geni](https://github.com/zero-one-group/geni) is the go-to library for Spark interop. Some months back, I was motivated to evaluate the coverage of Spark features. In particular, I wanted to understand what would be involved to support [Spark Connect](https://spark.apache.org/spark-connect/) as it would reduce the complexity of computing on-cluster directly from the Clojure REPL.
60+
61+
However, I found a number of issues that would need to be addressed in order to support Spark Connect and Databricks:
62+
63+
1. Problems with the [default session](https://github.com/zero-one-group/geni/issues/345).
64+
1. Problems with [support for Databricks](https://github.com/zero-one-group/geni/issues/356), although I suspect this is related to point 1.
65+
66+
Also, in general by my criteria the support classification is "support unknown":
67+
1. CI/CD build [failing.](https://github.com/zero-one-group/geni/actions)
68+
1. Version [0.0.42 api docs broken](https://cljdoc.org/d/zero.one/geni/0.0.42/doc/readme%20%20https://cljdoc.org/builds/73977), also affects version 0.0.41
69+
1. No commits since November 2023.
70+
1. No releases since November 2023.
71+
1. No PRs accepted or rejected since November 2023.
72+
1. No response when attempting to contact the author or maintainers.
73+
74+
75+
### Kafka Interop with Jackdaw
76+
77+
[Jackdaw](https://github.com/FundingCircle/jackdaw) is the go-to library for Kafka interop. However, by my criteria the support classification is also "support unknown":
78+
79+
1. No commits since August 2024.
80+
1. No releases since December 2023.
81+
1. No PRs accepted or rejected since August 2024. As a further example, [here's a PR](https://github.com/FundingCircle/jackdaw/pull/374) raised in May 2024 but not yet commented on either way by the maintainers.
82+
83+
84+
### Snowflake Interop with a New Clojure API!
85+
86+
Although the [Snowpark](https://docs.snowflake.com/en/developer-guide/snowpark/java) library has Java and Scala bindings, it doesn't provide anything for Clojure. As such, it's currently not possible to interact with Snowflake using the Clojure way.
87+
88+
To address this gap, I decided to try my hand at creating a [Clojure API for Snowflake](https://github.com/alza-bitz/snowflake-clj) as part of a broader effort to improve the overall situation regarding Clojure support for popular data tools.
89+
90+
The aim is to validate this approach as a foundation for enabling a wide range of data science or data engineering use cases from the Clojure REPL, in situations where Snowflake is the data warehouse of choice.
91+
92+
The [README](https://github.com/alza-bitz/snowpark-clj/blob/main/README.md) provides usage examples for all the current features, but I've copied the essential ones here to illustrate the API:
93+
94+
95+
#### Feature 1. Load data from local and save to a Snowflake table
96+
97+
(require '[clojure.repl.deps :refer [add-lib]])
98+
(add-lib 'io.github.alza-bitz/snowpark-clj {:git/url "https://github.com/alza-bitz/snowpark-clj.git"
99+
:git/sha "7856d9ca2080b188f9feec115ca709d3f54877b0"})
100+
101+
102+
::: {.sourceClojure}
103+
```clojure
104+
(require '[snowpark-clj.core :as sp])
105+
```
106+
:::
107+
108+
109+
Sample data
110+
111+
112+
::: {.sourceClojure}
113+
```clojure
114+
(def employee-data
115+
[{:id 1 :name "Alice" :age 25 :department "Engineering" :salary 75000}
116+
{:id 2 :name "Bob" :age 30 :department "Marketing" :salary 65000}
117+
{:id 3 :name "Charlie" :age 35 :department "Engineering" :salary 80000}])
118+
```
119+
:::
120+
121+
122+
Create session and save data
123+
124+
125+
::: {.sourceClojure}
126+
```clojure
127+
(with-open [session (sp/create-session "src/data_engineering/support_for_popular_data_tools/snowflake.edn")]
128+
(-> (sp/create-dataframe session employee-data)
129+
(sp/save-as-table "employees" :overwrite)))
130+
```
131+
:::
132+
133+
134+
135+
::: {.printedClojure}
136+
```clojure
137+
nil
138+
139+
```
140+
:::
141+
142+
143+
144+
#### Feature 2. Compute over Snowflake table(s) on-cluster to produce a smaller result for local processing
145+
146+
147+
::: {.sourceClojure}
148+
```clojure
149+
(with-open [session (sp/create-session "src/data_engineering/support_for_popular_data_tools/snowflake.edn")]
150+
(let [table-df (sp/table session "employees")]
151+
(-> table-df
152+
(sp/filter (sp/gt (sp/col table-df :salary) (sp/lit 70000)))
153+
(sp/select [:name :salary])
154+
(sp/collect))))
155+
```
156+
:::
157+
158+
159+
160+
::: {.printedClojure}
161+
```clojure
162+
[{:name "Alice", :salary 75000} {:name "Charlie", :salary 80000}]
163+
164+
```
165+
:::
166+
167+
168+
As an early-stage proof-of-concept, it only covers the essential parts of the underlying API without being too concerned with performance or completeness. Other more advanced features are noted and planned, pending further elaboration.
169+
170+
**I hope you find it useful and I welcome any feedback or contributions!**
171+
172+
173+
```{=html}
174+
<div style="background-color:grey;height:2px;width:100%;"></div>
175+
```
176+
177+
178+
179+
```{=html}
180+
<div><pre><small><small>source: <a href="https://github.com/ClojureCivitas/clojurecivitas.github.io/blob/main/src/data_engineering/support_for_popular_data_tools/snowflake.clj">src/data_engineering/support_for_popular_data_tools/snowflake.clj</a></small></small></pre></div>
181+
```

site/db.edn

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,13 @@
8888
:email ""
8989
:affiliation [:scicloj]
9090
:links [{:icon "github" :href "https://github.com/joinr"}]}
91+
{:id :alza-bitz
92+
:name "Alex Coyle"
93+
:url "https://github.com/alza-bitz"
94+
:image "https://avatars.githubusercontent.com/u/1161048?v=4"
95+
:email ""
96+
:affiliation [:scicloj]
97+
:links [{:icon "github" :href "https://github.com/alza-bitz"}]}
9198
{:id :emilbengtsson
9299
:name "Emil Bengtsson"
93100
:url "https://emil0r.com"
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
^{:kindly/hide-code true ; don't render this code to the HTML document
2+
:clay {:title "Clojure Support for Popular Data Tools: A Data Engineer's Perspective, and a New Clojure API for Snowflake"
3+
:external-requirements ["password.edn"]
4+
:quarto {:author :alza-bitz
5+
:draft true ; remove to publish
6+
:type :post
7+
:date "2025-09-04"
8+
:category :clojure
9+
:tags [:metadata :civitas]
10+
:canonical-url "https://alza-bitz.github.io/clojure-support-for-popular-data-tools"}}}
11+
(ns data-engineering.support-for-popular-data-tools.snowflake)
12+
13+
;; In this article I look at the extent of Clojure support for some popular on-cluster data processing tools that Clojure users might need for their data engineering or data science tasks. Then for [Snowflake](https://snowflake.com) in particular **I go further and present a new Clojure API.**
14+
15+
;; Why is the level of Clojure support important? As an example, consider that [Scicloj](https://scicloj.org) is mostly focused on cases where your data fits on a single machine. As such, if you need to work with a large dataset it will be necessary to compute on-cluster and extract a smaller result before continuing your data science task locally.
16+
17+
;; However, without sufficient Clojure support for on-cluster processing, anyone needing that facility for their data science or data engineering task would be forced to reach outside the Clojure ecosystem. That adds complexity in terms of interop, compatibility and overall stack requirements.
18+
19+
;; With that in mind, let's examine the level of Clojure support for some popular on-cluster data processing tools. For each tool I selected its official Clojure library if one exists, or if not the most popular and well-known community-supported alternative with at least 100 stars and 10 contributors on GitHub. I then used the following criteria against the library to classify it as "supported" or "support unknown":
20+
21+
;; 1. CI/CD build passing
22+
;; 1. Most recent commit less than 12 months ago
23+
;; 1. Most recent release less than 12 months ago
24+
;; 1. Maintainers responded to any issue or question less than 12 months ago
25+
;; 1. Maintainers either accepted or rejected any PR less than 12 months ago
26+
27+
;; If I couldn't find any such library at all, I classified it as having "no support".
28+
29+
;; | Tool Category | Supported | Support Unknown | No Support |
30+
;; |---------------|---------------------|--------------|---------------|
31+
;; | **On-cluster batch processing** | | 1. [Spark](https://spark.apache.org) (see [Spark Interop with Geni](#spark_interop_with_geni) below) | |
32+
;; | **On-cluster stream processing** | | 2. [Kafka Streams](https://kafka.apache.org/documentation/streams) (see [Kafka Interop with Jackdaw](#kafka_interop_with_jackdaw) below) | 3. [Spark Structured Streaming](https://spark.apache.org/streaming),\<br\>4. [Flink](https://flink.apache.org) |
33+
;; | **On-cluster batch and stream processing** | | | 5. [Databricks](https://databricks.com) (see [Spark Interop with Geni](#spark_interop_with_geni) below),\<br\>6. [Snowflake](https://snowflake.com) (see [Snowflake Interop](#snowflake_interop_with_a_new_clojure_api!) below) |
34+
35+
;; Please note, I don't wish to make any critical judgments based on either the summary analysis above or the more detailed analysis below. The goal is to understand the situation with respect to Clojure support and highlight any gaps, although I suppose I am also inadvertently highlighting the difficulties of maintaining open source software!
36+
37+
;; With that said, let's dive into the details for Spark, Kafka and Snowflake.
38+
39+
;; ### Spark Interop with Geni
40+
41+
;; [Geni](https://github.com/zero-one-group/geni) is the go-to library for Spark interop. Some months back, I was motivated to evaluate the coverage of Spark features. In particular, I wanted to understand what would be involved to support [Spark Connect](https://spark.apache.org/spark-connect/) as it would reduce the complexity of computing on-cluster directly from the Clojure REPL.
42+
43+
;; However, I found a number of issues that would need to be addressed in order to support Spark Connect and Databricks:
44+
45+
;; 1. Problems with the [default session](https://github.com/zero-one-group/geni/issues/345).
46+
;; 1. Problems with [support for Databricks](https://github.com/zero-one-group/geni/issues/356), although I suspect this is related to point 1.
47+
48+
;; Also, in general by my criteria the support classification is "support unknown":
49+
;; 1. CI/CD build [failing.](https://github.com/zero-one-group/geni/actions)
50+
;; 1. Version [0.0.42 api docs broken](https://cljdoc.org/d/zero.one/geni/0.0.42/doc/readme%20%20https://cljdoc.org/builds/73977), also affects version 0.0.41
51+
;; 1. No commits since November 2023.
52+
;; 1. No releases since November 2023.
53+
;; 1. No PRs accepted or rejected since November 2023.
54+
;; 1. No response when attempting to contact the author or maintainers.
55+
56+
;; ### Kafka Interop with Jackdaw
57+
58+
;; [Jackdaw](https://github.com/FundingCircle/jackdaw) is the go-to library for Kafka interop. However, by my criteria the support classification is also "support unknown":
59+
60+
;; 1. No commits since August 2024.
61+
;; 1. No releases since December 2023.
62+
;; 1. No PRs accepted or rejected since August 2024. As a further example, [here's a PR](https://github.com/FundingCircle/jackdaw/pull/374) raised in May 2024 but not yet commented on either way by the maintainers.
63+
64+
;; ### Snowflake Interop with a New Clojure API!
65+
66+
;; Although the [Snowpark](https://docs.snowflake.com/en/developer-guide/snowpark/java) library has Java and Scala bindings, it doesn't provide anything for Clojure. As such, it's currently not possible to interact with Snowflake using the Clojure way.
67+
68+
;; To address this gap, I decided to try my hand at creating a [Clojure API for Snowflake](https://github.com/alza-bitz/snowflake-clj) as part of a broader effort to improve the overall situation regarding Clojure support for popular data tools.
69+
70+
;; The aim is to validate this approach as a foundation for enabling a wide range of data science or data engineering use cases from the Clojure REPL, in situations where Snowflake is the data warehouse of choice.
71+
72+
;; The [README](https://github.com/alza-bitz/snowpark-clj/blob/main/README.md) provides usage examples for all the current features, but I've copied the essential ones here to illustrate the API:
73+
74+
;; #### Feature 1. Load data from local and save to a Snowflake table
75+
76+
(require '[snowpark-clj.core :as sp])
77+
78+
;; Sample data
79+
(def employee-data
80+
[{:id 1 :name "Alice" :age 25 :department "Engineering" :salary 75000}
81+
{:id 2 :name "Bob" :age 30 :department "Marketing" :salary 65000}
82+
{:id 3 :name "Charlie" :age 35 :department "Engineering" :salary 80000}])
83+
84+
;; Create session and save data
85+
(with-open [session (sp/create-session "src/data_engineering/support_for_popular_data_tools/snowflake.edn")]
86+
(-> (sp/create-dataframe session employee-data)
87+
(sp/save-as-table "employees" :overwrite)))
88+
89+
;; #### Feature 2. Compute over Snowflake table(s) on-cluster to produce a smaller result for local processing
90+
91+
(with-open [session (sp/create-session "src/data_engineering/support_for_popular_data_tools/snowflake.edn")]
92+
(let [table-df (sp/table session "employees")]
93+
(-> table-df
94+
(sp/filter (sp/gt (sp/col table-df :salary) (sp/lit 70000)))
95+
(sp/select [:name :salary])
96+
(sp/collect))))
97+
98+
;; As an early-stage proof-of-concept, it only covers the essential parts of the underlying API without being too concerned with performance or completeness. Other more advanced features are noted and planned, pending further elaboration.
99+
100+
;; **I hope you find it useful and I welcome any feedback or contributions!**
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
{:url "https://vkvupnf-yyb56283.snowflakecomputing.com"
2+
:user "ALZA"
3+
:password #mask #include "password.edn"
4+
:role "SNOWFLAKE_LEARNING_ROLE"
5+
:warehouse "SNOWFLAKE_LEARNING_WH"
6+
:db "SNOWFLAKE_LEARNING_DB"
7+
:schema "SNOWPARK_CLJ_TEST_SCHEMA"
8+
;; SSL Configuration - for dev container environment
9+
;; https://community.snowflake.com/s/article/How-to-turn-off-OCSP-checking-in-Snowflake-client-drivers
10+
;; Driver version 3.22.0 or higher:
11+
;; :disableOCSPChecks true
12+
;; Driver version 3.5.0 or higher:
13+
:insecureMode true}

0 commit comments

Comments
 (0)