Skip to content

Commit 09499b0

Browse files
author
Maxwell Dylla
committed
make scope smaller
1 parent a5a854b commit 09499b0

30 files changed

+7
-74
lines changed

README.md

+7-47
Original file line numberDiff line numberDiff line change
@@ -2,59 +2,25 @@
22

33
This repository implements Spark applications for transforming raw incoming data into a set of schemas for analysis. You can extend these schemas by deploying additional Spark applications.
44

5-
## Data Flow Overview
5+
## Transformed Data
66

7-
### Sourced from Postgres
8-
9-
Transactional metadata on the devices under test and the test sequences for each device.
10-
11-
- `device_metadata` - Stores static information about each device under test.
12-
- `device_sequence` - Contains the test sequences for each device.
13-
14-
### Sourced from Kafka
15-
16-
Event and telemetry streams from Kafka are consumed and persisted.
17-
18-
- `event_log` - Logs discrete events like 'test started', 'test stopped', and other significant occurrences.
19-
- `timeseries_raw` - Holds raw data from each test sequence, such as voltage, current, temperature, etc.
20-
21-
### Transformed Data
22-
23-
Batch processing jobs concatenate telemetry sequences and perform aggregations.
24-
25-
- `timeseries_aggregated` - Concatenates multiple sequences, providing a comprehensive device history.
7+
- `timeseries` - Common schema for individual telemetry records from the battery.
268
- `statistics_steps` - Aggregates data at the charge/discharge step level, providing statistics such as average voltage, maximum current, total energy, and average temperature.
279
- `statistics_cycles` - Aggregates data over full cycles of charge and discharge, including summaries like total energy discharged, total cycle time, and health indicators.
2810

29-
## Persistance Options
30-
31-
### PostgreSQL
32-
33-
The same PostgreSQL schema that houses the device metadata can be used to persist the incoming and transformed data. Only reccomended for small deployments.
11+
## Testing
3412

35-
### Delta Lake
36-
37-
Object storage can be used for larger deployments. This is the cheapest option per GB for storage. The Hive metastore allows you to query this backend as a database using SQL.
38-
39-
## Spark Applications
40-
41-
### Streaming
42-
43-
Streaming applications ingest data from Kafka into persistant storage using Spark structured-streaming. There is a partition in Kafka by sequence id that the consumer takes advantage of.
44-
45-
### Batch
46-
47-
Batch applications implement incremental data processing. Any devices with test sequences that have been updated within the look-back window are processed by the Spark engine.
48-
49-
There is also a maintance job (for Delta storage only) for vacuum and compaction operations.
13+
- `unit` - unit tests are for transformations against static data files.
14+
- `integration` - integration tests are for data source and sink connectors.
15+
- `system` - system tests are for applications against realistic data sources and sinks.
5016

5117
## Deployment
5218

5319
You can opt for leveraging a managed service (GCP Dataproc, AWS EMR, Databricks, etc.) for deploying the Spark applications or use the provided helm chart. The provided helm chart leverages the [Spark Operator](https://github.com/kubeflow/spark-operator).
5420

5521
### Helm Chart
5622

57-
This chart packages all of the Spark applications into one distribution. The streaming jobs are deployed as `SparkApplication` and batch jobs as `ScheduledSparkApplication`.
23+
This chart packages all of the Spark applications into one distribution.
5824

5925
See the [chart documentation](LINKHERE) for a list of the available configuration variables.
6026

@@ -64,12 +30,6 @@ See the [chart documentation](LINKHERE) for a list of the available configuratio
6430

6531
You can deploy yourself using [Strimzi operator](https://github.com/strimzi/strimzi-kafka-operator) or use a managed service with compatible API (reccomended).
6632

67-
#### PostgreSQL
68-
69-
You can deploy yourself using [cloudnative-pg operator](https://github.com/cloudnative-pg/cloudnative-pg) or use a managed service (reccomended).
70-
7133
#### Object Storage
7234

7335
You can deploy yourself using [Minio operator](https://github.com/minio/operator) or use a managed service (reccomended).
74-
75-
*Only required for Delta Lake configuration.*

src/pulse_application_core/sparklib/connectors/postgres.py

-8
This file was deleted.

src/pulse_application_core/sparklib/transformations/device_metadata.py

-11
This file was deleted.

src/pulse_application_core/sparklib/transformations/device_sequence.py

-8
This file was deleted.

src/pulse_application_core/sparklib/transformations/event_log.py

Whitespace-only changes.

src/pulse_application_core/sparklib/transformations/timeseries_aggregated.py

Whitespace-only changes.
File renamed without changes.

tests/integration/test_postgres.py

Whitespace-only changes.
File renamed without changes.

tests/system/test_system_delta.py

Whitespace-only changes.

tests/system/test_system_postgres.py

Whitespace-only changes.

tests/unit/test_device_metadata.py

Whitespace-only changes.

tests/unit/test_device_sequence.py

Whitespace-only changes.

tests/unit/test_event_log.py

Whitespace-only changes.

tests/unit/test_timeseries_aggregated.py

Whitespace-only changes.

tests/unit/test_timeseries_raw.py

Whitespace-only changes.

0 commit comments

Comments
 (0)