You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+7-47
Original file line number
Diff line number
Diff line change
@@ -2,59 +2,25 @@
2
2
3
3
This repository implements Spark applications for transforming raw incoming data into a set of schemas for analysis. You can extend these schemas by deploying additional Spark applications.
4
4
5
-
## Data Flow Overview
5
+
## Transformed Data
6
6
7
-
### Sourced from Postgres
8
-
9
-
Transactional metadata on the devices under test and the test sequences for each device.
10
-
11
-
-`device_metadata` - Stores static information about each device under test.
12
-
-`device_sequence` - Contains the test sequences for each device.
13
-
14
-
### Sourced from Kafka
15
-
16
-
Event and telemetry streams from Kafka are consumed and persisted.
17
-
18
-
-`event_log` - Logs discrete events like 'test started', 'test stopped', and other significant occurrences.
19
-
-`timeseries_raw` - Holds raw data from each test sequence, such as voltage, current, temperature, etc.
20
-
21
-
### Transformed Data
22
-
23
-
Batch processing jobs concatenate telemetry sequences and perform aggregations.
24
-
25
-
-`timeseries_aggregated` - Concatenates multiple sequences, providing a comprehensive device history.
7
+
-`timeseries` - Common schema for individual telemetry records from the battery.
26
8
-`statistics_steps` - Aggregates data at the charge/discharge step level, providing statistics such as average voltage, maximum current, total energy, and average temperature.
27
9
-`statistics_cycles` - Aggregates data over full cycles of charge and discharge, including summaries like total energy discharged, total cycle time, and health indicators.
28
10
29
-
## Persistance Options
30
-
31
-
### PostgreSQL
32
-
33
-
The same PostgreSQL schema that houses the device metadata can be used to persist the incoming and transformed data. Only reccomended for small deployments.
11
+
## Testing
34
12
35
-
### Delta Lake
36
-
37
-
Object storage can be used for larger deployments. This is the cheapest option per GB for storage. The Hive metastore allows you to query this backend as a database using SQL.
38
-
39
-
## Spark Applications
40
-
41
-
### Streaming
42
-
43
-
Streaming applications ingest data from Kafka into persistant storage using Spark structured-streaming. There is a partition in Kafka by sequence id that the consumer takes advantage of.
44
-
45
-
### Batch
46
-
47
-
Batch applications implement incremental data processing. Any devices with test sequences that have been updated within the look-back window are processed by the Spark engine.
48
-
49
-
There is also a maintance job (for Delta storage only) for vacuum and compaction operations.
13
+
-`unit` - unit tests are for transformations against static data files.
14
+
-`integration` - integration tests are for data source and sink connectors.
15
+
-`system` - system tests are for applications against realistic data sources and sinks.
50
16
51
17
## Deployment
52
18
53
19
You can opt for leveraging a managed service (GCP Dataproc, AWS EMR, Databricks, etc.) for deploying the Spark applications or use the provided helm chart. The provided helm chart leverages the [Spark Operator](https://github.com/kubeflow/spark-operator).
54
20
55
21
### Helm Chart
56
22
57
-
This chart packages all of the Spark applications into one distribution. The streaming jobs are deployed as `SparkApplication` and batch jobs as `ScheduledSparkApplication`.
23
+
This chart packages all of the Spark applications into one distribution.
58
24
59
25
See the [chart documentation](LINKHERE) for a list of the available configuration variables.
60
26
@@ -64,12 +30,6 @@ See the [chart documentation](LINKHERE) for a list of the available configuratio
64
30
65
31
You can deploy yourself using [Strimzi operator](https://github.com/strimzi/strimzi-kafka-operator) or use a managed service with compatible API (reccomended).
66
32
67
-
#### PostgreSQL
68
-
69
-
You can deploy yourself using [cloudnative-pg operator](https://github.com/cloudnative-pg/cloudnative-pg) or use a managed service (reccomended).
70
-
71
33
#### Object Storage
72
34
73
35
You can deploy yourself using [Minio operator](https://github.com/minio/operator) or use a managed service (reccomended).
0 commit comments