Skip to content

Commit afdc7fb

Browse files
committed
Added code and charts
Signed-off-by: chris-1187 <[email protected]>
1 parent 30a0087 commit afdc7fb

File tree

3 files changed

+31
-1
lines changed

3 files changed

+31
-1
lines changed

docs/blog/images/amos_mvi.png

54.6 KB
Loading

docs/blog/images/amos_mvi_raw.png

53.1 KB
Loading

docs/blog/posts/enhancing_data_quality_amos.md

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,36 @@ Data cleansing is a vital process in enhancing the quality of data within a data
3737
### Missing Value Imputation
3838

3939
With a dataset refined to exclude unwanted data points and accounting for potential sensor failures, the next step toward ensuring high-quality data is to address any missing values through imputation. The component we developed first identifies and flags missing values by leveraging PySpark’s capabilities in windowing and UDF operations. With these techniques, we are able to dynamically determine the expected interval for each sensor by analyzing historical data patterns within defined partitions. Spline interpolation allows us to estimate missing values in time series data, seamlessly filling gaps with plausible and mathematically derived substitutes. By doing so, data scientists can not only improve the consistency of integrated datasets but also prevent errors or biases in analytics and machine learning models.
40+
To actually show how this is realized with this new RTDIP component, let me show you a short example on how a few lines of code can enhance an exemplary time series load profile:
41+
```python
42+
from rtdip_sdk.pipelines.data_quality import MissingValueImputation
43+
from pyspark.sql import SparkSession
44+
import pandas as pd
45+
46+
spark_session = SparkSession.builder.master("local[2]").appName("test").getOrCreate()
47+
48+
source_df = pd.read_csv('./solar_energy_production_germany_April02.csv')
49+
incomplete_spark_df = spark_session.createDataFrame(vi_april_df, ['Value', 'EventTime', 'TagName', 'Status'])
50+
51+
#Before Missing Value Imputation
52+
spark_df.show()
53+
54+
#Execute RTDIP Pipeline component
55+
clean_df = MissingValueImputation(spark_session, df=incomplete_spark_df).filter()
56+
57+
#After Missing Value Imputation
58+
clean_df.show()
59+
```
60+
To illustrate this visually, plotting the before-and-after DataFrames reveals that all gaps have been successfully filled with meaningful data.
61+
62+
<center>
63+
64+
![blog](../images/amos_mvi_raw.png){width=70%}
65+
66+
![blog](../images/amos_mvi.png){width=70%}
67+
68+
</center>
69+
4070

4171
### Normalization
4272

@@ -56,7 +86,7 @@ Working on the RTDIP Project within AMOS has been a fantastic journey, highlight
5686

5787
To look back, our regular team meetings were the key to our success. Through open communication and collaboration, we tackled challenges and kept improving our processes. This showed us the power of working together in an agile framework and growing as a dedicated SCRUM team.
5888

59-
We’re excited about the future and how these advancements will help data scientists and engineers make better decisions. ((Thank you for joining us on this journey.))
89+
We’re excited about the future and how these advancements will help data scientists and engineers make better decisions.
6090

6191
<br>
6292

0 commit comments

Comments
 (0)