Skip to content

Commit 41112dd

Browse files
committed
TSML Primer: Use dedicated pages for hands-on walkthroughs
1 parent 3c8405f commit 41112dd

File tree

6 files changed

+594
-562
lines changed

6 files changed

+594
-562
lines changed
Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
(tsml-primer-anomaly-detection)=
2+
3+
# Anomaly Detection Example for Machine Data
4+
5+
## Prologue
6+
7+
**NOTE:** While this example should provide more depth to understanding time series modeling,
8+
it is not intended to teach the foundations of this field of data science. Instead, it
9+
will focus more on how to use machine learning models in production scenarios.
10+
11+
If you
12+
are interested in learning more details about time series modeling, we recommend to check out [Time
13+
Series Analysis in Python – A Comprehensive Guide with Examples], by Selva Prabhakaran.
14+
15+
## About
16+
17+
The exercise will use a dataset from the Numenta Anomaly Benchmark (NAB), which includes
18+
real-world and artificial time series data for anomaly detection research. We will choose the
19+
dataset about real measured temperature readings from a machine room.
20+
21+
The goal is to detect
22+
anomalies in the temperature readings, which could indicate a malfunctioning machine. The dataset
23+
simulates machine temperature measurements, and will be loaded into CrateDB upfront.
24+
25+
## Setup
26+
27+
To follow this tutorial, install the prerequisites by running the following commands in your
28+
terminal. Furthermore, load the designated dataset into your [CrateDB Cloud] cluster.
29+
30+
```bash
31+
pip install 'crate[sqlalchemy]' 'numpy==1.23.5' crash matplotlib pandas salesforce-merlion
32+
```
33+
34+
Please note the following external dependencies of the [Merlion] library:
35+
36+
### OpenMP
37+
Some forecasting models depend on OpenMP. Please install it before installing this package,
38+
in order to ensure that OpenMP is configured to work with the lightgbm package, one of
39+
Merlion's dependencies.
40+
41+
When using Anaconda, please run
42+
```shell
43+
conda install -c conda-forge lightgbm
44+
```
45+
When using macOS, please install the Homebrew package manager and invoke
46+
```shell
47+
brew install libomp
48+
```
49+
50+
### Java
51+
Some anomaly detection models depend on the Java Development Kit (JDK). On Debian or Ubuntu, run
52+
```shell
53+
sudo apt-get install openjdk-11-jdk
54+
```
55+
On macOS, install Homebrew, and invoke
56+
```shell
57+
brew tap adoptopenjdk/openjdk
58+
brew install --cask adoptopenjdk11
59+
```
60+
Also, ensure that Java can be found on your `PATH`, and that the `JAVA_HOME` environment variable
61+
is configured correctly.
62+
63+
64+
65+
## Importing Data
66+
67+
If you are using [CrateDB Cloud], navigate to the [Cloud Console], and use the [Data Import] feature
68+
to import the CSV file directly from the given URL into the database table `machine_data`.
69+
```
70+
https://github.com/crate/cratedb-datasets/raw/main/machine-learning/timeseries/nab-machine-failure.csv
71+
```
72+
73+
![CrateDB Cloud Import dialog](/_assets/img/ml-timeseries-primer/cratedb-cloud-import-url.png){width=400px}
74+
![CrateDB Cloud Import ready](/_assets/img/ml-timeseries-primer/cratedb-cloud-import-ready.png){width=400px}
75+
76+
The import process will automatically infer an SQL DDL schema from the shape of the data source.
77+
When visiting the [CrateDB Admin UI] after the import process has concluded, you can observe the
78+
`machine_data` table was created and populated correctly.
79+
80+
![CrateDB Admin UI data imported](/_assets/img/ml-timeseries-primer/cratedb-admin-ui-data-imported.png){width=400px}
81+
82+
If you want to exercise the data import on your workstation, use the `crash` command-line program.
83+
```shell
84+
crash --command 'CREATE TABLE IF NOT EXISTS "machine_data" ("timestamp" TIMESTAMP, "value" REAL);'
85+
crash --command "COPY machine_data FROM 'https://github.com/crate/cratedb-datasets/raw/main/machine-learning/timeseries/nab-machine-failure.csv';"
86+
```
87+
88+
Note: If you are connecting to CrateDB Cloud, use the options
89+
`--hosts 'https://<hostname>:4200' --username '<username>'`. In order to run the program
90+
non-interactively, without being prompted for a password, use `export CRATEPW='<password>'`.
91+
92+
93+
## Loading Data
94+
95+
First, you will load the dataset into a pandas DataFrame and convert the `timestamp` column to a
96+
Python `datetime` object.
97+
98+
```python
99+
from crate import client
100+
import pandas as pd
101+
102+
# Connect to database.
103+
conn = client.connect(
104+
"https://<your-instance>.azure.cratedb.net:4200",
105+
username="admin",
106+
password="<your-password>",
107+
verify_ssl_cert=True)
108+
109+
# Query and load data.
110+
with conn:
111+
cursor = conn.cursor()
112+
cursor.execute("SELECT timestamp, value "
113+
"FROM machine_data ORDER BY timestamp ASC")
114+
data = cursor.fetchall()
115+
116+
# Convert to pandas DataFrame.
117+
time_series = pd.DataFrame(
118+
[{'timestamp': pd.Timestamp.fromtimestamp(item[0] / 1000), 'value': item[1]}
119+
for item in data])
120+
121+
# Set the timestamp as the index.
122+
time_series = time_series.set_index('timestamp')
123+
```
124+
125+
## Downsampling
126+
127+
**TIP:** CrateDB provides many useful analytical functions tailored for time series data. One of
128+
them is the `date_bin` which bins the input timestamp to the specified interval - which makes it
129+
very handy to resample data.
130+
131+
In general, for time series modeling, you often want to sample your data with a high frequency, in
132+
order not to miss any events. However, this results in huge data volumes, increasing the costs of
133+
model training. Here, it is best practice to down-sample your data to reasonable intervals.
134+
135+
This SQL statement demonstrates CrateDB's `date_bin` function to down-sample the data to 5 minute
136+
intervals, reducing both amount of data and complexity of the modeling process.
137+
138+
```sql
139+
SELECT
140+
DATE_BIN('5 min'::INTERVAL, "timestamp", 0) AS timestamp,
141+
MAX(value) AS temperature
142+
FROM machine_data
143+
GROUP BY timestamp
144+
ORDER BY timestamp ASC
145+
```
146+
147+
## Plotting
148+
149+
Next, plot the data to get a better understanding of the dataset.
150+
151+
```python
152+
import pandas as pd
153+
import matplotlib.pyplot as plt
154+
import matplotlib.dates as mdates
155+
156+
anomalies = [
157+
["2013-12-15 17:50:00.000000", "2013-12-17 17:00:00.000000"],
158+
["2014-01-27 14:20:00.000000", "2014-01-29 13:30:00.000000"],
159+
["2014-02-07 14:55:00.000000", "2014-02-09 14:05:00.000000"]
160+
]
161+
162+
plt.figure(figsize=(12,7))
163+
line, = plt.plot(time_series.index, time_series['value'], linestyle='solid', color='black', label='Temperature')
164+
165+
# Highlight anomalies
166+
ctr = 0
167+
for timeframe in anomalies:
168+
ctr += 1
169+
plt.axvspan(pd.to_datetime(timeframe[0]), pd.to_datetime(timeframe[1]), color='blue', alpha=0.3, label=f'Anomaly {ctr}')
170+
171+
# Formatting x-axis for better readability
172+
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y/%m/%d'))
173+
plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=7))
174+
plt.gcf().autofmt_xdate() # Rotate & align the x labels for a better view
175+
176+
plt.title('Temperature Over Time', fontsize=20, fontweight='bold', pad=30)
177+
plt.ylabel('Temperature')
178+
# Add legend to the right
179+
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
180+
181+
plt.tight_layout()
182+
plt.show()
183+
```
184+
185+
![Temperature over time with anomalies](/_assets/img/ml-timeseries-primer/temperature-anomaly-score.png)
186+
187+
## Observations
188+
189+
Please note the blue highlighted areas above - these are real, observed anomalies in the dataset.
190+
You will use them later to evaluate the model. The first anomaly is a planned shutdown of the
191+
machine. The second anomaly is difficult to detect and directly led to the third anomaly, a
192+
catastrophic failure of the machine.
193+
194+
You see that there are some nasty spikes in the data, which make anomalies hard to differentiate
195+
from ordinary measurements. However, as you will see later, modern models are quite good at finding
196+
exactly those spots.
197+
198+
## Model Training
199+
200+
To get there, let's train a small anomaly detection model. As mentioned in the introduction, there
201+
are a multitude of options to choose from. This article will not go into the very details of model
202+
selection, and will just use the [Merlion] library, an excellent open-source time series
203+
analysis package developed by Salesforce.
204+
205+
[Merlion] implements an end-to-end machine
206+
learning framework, that includes loading and transforming data, building and training models,
207+
post-processing model outputs, and evaluating model performance. It supports various time series
208+
learning tasks, including forecasting, anomaly detection, and change point detection.
209+
210+
Start by first splitting the dataset into training and test data. The exercise will use
211+
unsupervised learning, so you want to train the model on data without anomalies, and then
212+
check whether it is able to detect the anomalies in the test data. The data will be split at
213+
2013-12-15.
214+
215+
```python
216+
from merlion.utils import TimeSeries
217+
218+
train_data = TimeSeries.from_pd(time_series[time_series.index < pd.to_datetime('2013-12-15')])
219+
test_data = TimeSeries.from_pd(time_series[time_series.index >= pd.to_datetime('2013-12-15')])
220+
```
221+
222+
![Test/Train Split](/_assets/img/ml-timeseries-primer/temperature-train-test.png)
223+
224+
Now, train the model using the Merlion `DefaultDetector`, which is an anomaly detection model that
225+
balances performance and efficiency. Under the hood, the `DefaultDetector` is an ensemble of an
226+
[ETS model] and a [Random Cut Forest] model, both are excellent for general purpose anomaly detection.
227+
228+
```python
229+
from merlion.models.defaults import DefaultDetectorConfig, DefaultDetector
230+
231+
model = DefaultDetector(DefaultDetectorConfig())
232+
model.train(train_data=train_data)
233+
```
234+
235+
## Evaluation
236+
237+
Let's visually confirm the model performance:
238+
239+
![Temperature with detected anomalies](/_assets/img/ml-timeseries-primer/temperature-anomaly-detected.png)
240+
241+
The model is able to detect the anomalies, a very good result for the first try, and without any
242+
parameter tuning. The next steps will bring this model to production.
243+
244+
In a real-world scenario, you want to further improve the model by tuning the parameters and
245+
evaluating the model performance on a validation dataset. However, for the sake of simplicity,
246+
this step will be skipped. Please refer to the [Merlion documentation] for more information on
247+
how to do this.
248+
249+
250+
[Cloud Console]: https://console.cratedb.cloud/
251+
[CrateDB Admin UI]: https://cratedb.com/docs/crate/admin-ui/
252+
[CrateDB Cloud]: https://cratedb.com/products/cratedb-cloud
253+
[Data Import]: https://community.cratedb.com/t/importing-data-to-cratedb-cloud-clusters/1467
254+
[ETS model]: https://www.statsmodels.org/dev/examples/notebooks/generated/ets.html
255+
[Merlion]: https://github.com/salesforce/Merlion
256+
[Merlion documentation]: https://opensource.salesforce.com/Merlion/v1.0.0/examples/anomaly/1_AnomalyFeatures.html
257+
[Random Cut Forest]: https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html
258+
[Time Series Analysis in Python – A Comprehensive Guide with Examples]: https://www.machinelearningplus.com/time-series/time-series-analysis-python/
Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,16 @@
11
(timeseries-ml-primer)=
2+
(tsml-primer)=
3+
24
# Machine Learning for Time Series Data Primer
35

46
Learn how to apply machine learning procedures to time series data.
57

68
```{toctree}
7-
:glob:
89
:maxdepth: 2
910
10-
*
11+
introduction
12+
anomaly-detection
13+
mlops-intro
14+
mlops-cratedb-mlflow
15+
mlops-cratedb-sql
1116
```

0 commit comments

Comments
 (0)