|
| 1 | +(tsml-primer-anomaly-detection)= |
| 2 | + |
| 3 | +# Anomaly Detection Example for Machine Data |
| 4 | + |
| 5 | +## Prologue |
| 6 | + |
| 7 | +**NOTE:** While this example should provide more depth to understanding time series modeling, |
| 8 | +it is not intended to teach the foundations of this field of data science. Instead, it |
| 9 | +will focus more on how to use machine learning models in production scenarios. |
| 10 | + |
| 11 | +If you |
| 12 | +are interested in learning more details about time series modeling, we recommend to check out [Time |
| 13 | +Series Analysis in Python – A Comprehensive Guide with Examples], by Selva Prabhakaran. |
| 14 | + |
| 15 | +## About |
| 16 | + |
| 17 | +The exercise will use a dataset from the Numenta Anomaly Benchmark (NAB), which includes |
| 18 | +real-world and artificial time series data for anomaly detection research. We will choose the |
| 19 | +dataset about real measured temperature readings from a machine room. |
| 20 | + |
| 21 | +The goal is to detect |
| 22 | +anomalies in the temperature readings, which could indicate a malfunctioning machine. The dataset |
| 23 | +simulates machine temperature measurements, and will be loaded into CrateDB upfront. |
| 24 | + |
| 25 | +## Setup |
| 26 | + |
| 27 | +To follow this tutorial, install the prerequisites by running the following commands in your |
| 28 | +terminal. Furthermore, load the designated dataset into your [CrateDB Cloud] cluster. |
| 29 | + |
| 30 | +```bash |
| 31 | +pip install 'crate[sqlalchemy]' 'numpy==1.23.5' crash matplotlib pandas salesforce-merlion |
| 32 | +``` |
| 33 | + |
| 34 | +Please note the following external dependencies of the [Merlion] library: |
| 35 | + |
| 36 | +### OpenMP |
| 37 | +Some forecasting models depend on OpenMP. Please install it before installing this package, |
| 38 | +in order to ensure that OpenMP is configured to work with the lightgbm package, one of |
| 39 | +Merlion's dependencies. |
| 40 | + |
| 41 | +When using Anaconda, please run |
| 42 | +```shell |
| 43 | +conda install -c conda-forge lightgbm |
| 44 | +``` |
| 45 | +When using macOS, please install the Homebrew package manager and invoke |
| 46 | +```shell |
| 47 | +brew install libomp |
| 48 | +``` |
| 49 | + |
| 50 | +### Java |
| 51 | +Some anomaly detection models depend on the Java Development Kit (JDK). On Debian or Ubuntu, run |
| 52 | +```shell |
| 53 | +sudo apt-get install openjdk-11-jdk |
| 54 | +``` |
| 55 | +On macOS, install Homebrew, and invoke |
| 56 | +```shell |
| 57 | +brew tap adoptopenjdk/openjdk |
| 58 | +brew install --cask adoptopenjdk11 |
| 59 | +``` |
| 60 | +Also, ensure that Java can be found on your `PATH`, and that the `JAVA_HOME` environment variable |
| 61 | +is configured correctly. |
| 62 | + |
| 63 | + |
| 64 | + |
| 65 | +## Importing Data |
| 66 | + |
| 67 | +If you are using [CrateDB Cloud], navigate to the [Cloud Console], and use the [Data Import] feature |
| 68 | +to import the CSV file directly from the given URL into the database table `machine_data`. |
| 69 | +``` |
| 70 | +https://github.com/crate/cratedb-datasets/raw/main/machine-learning/timeseries/nab-machine-failure.csv |
| 71 | +``` |
| 72 | + |
| 73 | +{width=400px} |
| 74 | +{width=400px} |
| 75 | + |
| 76 | +The import process will automatically infer an SQL DDL schema from the shape of the data source. |
| 77 | +When visiting the [CrateDB Admin UI] after the import process has concluded, you can observe the |
| 78 | +`machine_data` table was created and populated correctly. |
| 79 | + |
| 80 | +{width=400px} |
| 81 | + |
| 82 | +If you want to exercise the data import on your workstation, use the `crash` command-line program. |
| 83 | +```shell |
| 84 | +crash --command 'CREATE TABLE IF NOT EXISTS "machine_data" ("timestamp" TIMESTAMP, "value" REAL);' |
| 85 | +crash --command "COPY machine_data FROM 'https://github.com/crate/cratedb-datasets/raw/main/machine-learning/timeseries/nab-machine-failure.csv';" |
| 86 | +``` |
| 87 | + |
| 88 | +Note: If you are connecting to CrateDB Cloud, use the options |
| 89 | +`--hosts 'https://<hostname>:4200' --username '<username>'`. In order to run the program |
| 90 | +non-interactively, without being prompted for a password, use `export CRATEPW='<password>'`. |
| 91 | + |
| 92 | + |
| 93 | +## Loading Data |
| 94 | + |
| 95 | +First, you will load the dataset into a pandas DataFrame and convert the `timestamp` column to a |
| 96 | +Python `datetime` object. |
| 97 | + |
| 98 | +```python |
| 99 | +from crate import client |
| 100 | +import pandas as pd |
| 101 | + |
| 102 | +# Connect to database. |
| 103 | +conn = client.connect( |
| 104 | + "https://<your-instance>.azure.cratedb.net:4200", |
| 105 | + username="admin", |
| 106 | + password="<your-password>", |
| 107 | + verify_ssl_cert=True) |
| 108 | + |
| 109 | +# Query and load data. |
| 110 | +with conn: |
| 111 | + cursor = conn.cursor() |
| 112 | + cursor.execute("SELECT timestamp, value " |
| 113 | + "FROM machine_data ORDER BY timestamp ASC") |
| 114 | + data = cursor.fetchall() |
| 115 | + |
| 116 | +# Convert to pandas DataFrame. |
| 117 | +time_series = pd.DataFrame( |
| 118 | + [{'timestamp': pd.Timestamp.fromtimestamp(item[0] / 1000), 'value': item[1]} |
| 119 | + for item in data]) |
| 120 | + |
| 121 | +# Set the timestamp as the index. |
| 122 | +time_series = time_series.set_index('timestamp') |
| 123 | +``` |
| 124 | + |
| 125 | +## Downsampling |
| 126 | + |
| 127 | +**TIP:** CrateDB provides many useful analytical functions tailored for time series data. One of |
| 128 | +them is the `date_bin` which bins the input timestamp to the specified interval - which makes it |
| 129 | +very handy to resample data. |
| 130 | + |
| 131 | +In general, for time series modeling, you often want to sample your data with a high frequency, in |
| 132 | +order not to miss any events. However, this results in huge data volumes, increasing the costs of |
| 133 | +model training. Here, it is best practice to down-sample your data to reasonable intervals. |
| 134 | + |
| 135 | +This SQL statement demonstrates CrateDB's `date_bin` function to down-sample the data to 5 minute |
| 136 | +intervals, reducing both amount of data and complexity of the modeling process. |
| 137 | + |
| 138 | +```sql |
| 139 | +SELECT |
| 140 | + DATE_BIN('5 min'::INTERVAL, "timestamp", 0) AS timestamp, |
| 141 | + MAX(value) AS temperature |
| 142 | +FROM machine_data |
| 143 | +GROUP BY timestamp |
| 144 | +ORDER BY timestamp ASC |
| 145 | +``` |
| 146 | + |
| 147 | +## Plotting |
| 148 | + |
| 149 | +Next, plot the data to get a better understanding of the dataset. |
| 150 | + |
| 151 | +```python |
| 152 | +import pandas as pd |
| 153 | +import matplotlib.pyplot as plt |
| 154 | +import matplotlib.dates as mdates |
| 155 | + |
| 156 | +anomalies = [ |
| 157 | + ["2013-12-15 17:50:00.000000", "2013-12-17 17:00:00.000000"], |
| 158 | + ["2014-01-27 14:20:00.000000", "2014-01-29 13:30:00.000000"], |
| 159 | + ["2014-02-07 14:55:00.000000", "2014-02-09 14:05:00.000000"] |
| 160 | +] |
| 161 | + |
| 162 | +plt.figure(figsize=(12,7)) |
| 163 | +line, = plt.plot(time_series.index, time_series['value'], linestyle='solid', color='black', label='Temperature') |
| 164 | + |
| 165 | +# Highlight anomalies |
| 166 | +ctr = 0 |
| 167 | +for timeframe in anomalies: |
| 168 | + ctr += 1 |
| 169 | + plt.axvspan(pd.to_datetime(timeframe[0]), pd.to_datetime(timeframe[1]), color='blue', alpha=0.3, label=f'Anomaly {ctr}') |
| 170 | + |
| 171 | +# Formatting x-axis for better readability |
| 172 | +plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y/%m/%d')) |
| 173 | +plt.gca().xaxis.set_major_locator(mdates.DayLocator(interval=7)) |
| 174 | +plt.gcf().autofmt_xdate() # Rotate & align the x labels for a better view |
| 175 | + |
| 176 | +plt.title('Temperature Over Time', fontsize=20, fontweight='bold', pad=30) |
| 177 | +plt.ylabel('Temperature') |
| 178 | +# Add legend to the right |
| 179 | +plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left') |
| 180 | + |
| 181 | +plt.tight_layout() |
| 182 | +plt.show() |
| 183 | +``` |
| 184 | + |
| 185 | + |
| 186 | + |
| 187 | +## Observations |
| 188 | + |
| 189 | +Please note the blue highlighted areas above - these are real, observed anomalies in the dataset. |
| 190 | +You will use them later to evaluate the model. The first anomaly is a planned shutdown of the |
| 191 | +machine. The second anomaly is difficult to detect and directly led to the third anomaly, a |
| 192 | +catastrophic failure of the machine. |
| 193 | + |
| 194 | +You see that there are some nasty spikes in the data, which make anomalies hard to differentiate |
| 195 | +from ordinary measurements. However, as you will see later, modern models are quite good at finding |
| 196 | +exactly those spots. |
| 197 | + |
| 198 | +## Model Training |
| 199 | + |
| 200 | +To get there, let's train a small anomaly detection model. As mentioned in the introduction, there |
| 201 | +are a multitude of options to choose from. This article will not go into the very details of model |
| 202 | +selection, and will just use the [Merlion] library, an excellent open-source time series |
| 203 | +analysis package developed by Salesforce. |
| 204 | + |
| 205 | +[Merlion] implements an end-to-end machine |
| 206 | +learning framework, that includes loading and transforming data, building and training models, |
| 207 | +post-processing model outputs, and evaluating model performance. It supports various time series |
| 208 | +learning tasks, including forecasting, anomaly detection, and change point detection. |
| 209 | + |
| 210 | +Start by first splitting the dataset into training and test data. The exercise will use |
| 211 | +unsupervised learning, so you want to train the model on data without anomalies, and then |
| 212 | +check whether it is able to detect the anomalies in the test data. The data will be split at |
| 213 | +2013-12-15. |
| 214 | + |
| 215 | +```python |
| 216 | +from merlion.utils import TimeSeries |
| 217 | + |
| 218 | +train_data = TimeSeries.from_pd(time_series[time_series.index < pd.to_datetime('2013-12-15')]) |
| 219 | +test_data = TimeSeries.from_pd(time_series[time_series.index >= pd.to_datetime('2013-12-15')]) |
| 220 | +``` |
| 221 | + |
| 222 | + |
| 223 | + |
| 224 | +Now, train the model using the Merlion `DefaultDetector`, which is an anomaly detection model that |
| 225 | +balances performance and efficiency. Under the hood, the `DefaultDetector` is an ensemble of an |
| 226 | +[ETS model] and a [Random Cut Forest] model, both are excellent for general purpose anomaly detection. |
| 227 | + |
| 228 | +```python |
| 229 | +from merlion.models.defaults import DefaultDetectorConfig, DefaultDetector |
| 230 | + |
| 231 | +model = DefaultDetector(DefaultDetectorConfig()) |
| 232 | +model.train(train_data=train_data) |
| 233 | +``` |
| 234 | + |
| 235 | +## Evaluation |
| 236 | + |
| 237 | +Let's visually confirm the model performance: |
| 238 | + |
| 239 | + |
| 240 | + |
| 241 | +The model is able to detect the anomalies, a very good result for the first try, and without any |
| 242 | +parameter tuning. The next steps will bring this model to production. |
| 243 | + |
| 244 | +In a real-world scenario, you want to further improve the model by tuning the parameters and |
| 245 | +evaluating the model performance on a validation dataset. However, for the sake of simplicity, |
| 246 | +this step will be skipped. Please refer to the [Merlion documentation] for more information on |
| 247 | +how to do this. |
| 248 | + |
| 249 | + |
| 250 | +[Cloud Console]: https://console.cratedb.cloud/ |
| 251 | +[CrateDB Admin UI]: https://cratedb.com/docs/crate/admin-ui/ |
| 252 | +[CrateDB Cloud]: https://cratedb.com/products/cratedb-cloud |
| 253 | +[Data Import]: https://community.cratedb.com/t/importing-data-to-cratedb-cloud-clusters/1467 |
| 254 | +[ETS model]: https://www.statsmodels.org/dev/examples/notebooks/generated/ets.html |
| 255 | +[Merlion]: https://github.com/salesforce/Merlion |
| 256 | +[Merlion documentation]: https://opensource.salesforce.com/Merlion/v1.0.0/examples/anomaly/1_AnomalyFeatures.html |
| 257 | +[Random Cut Forest]: https://docs.aws.amazon.com/sagemaker/latest/dg/randomcutforest.html |
| 258 | +[Time Series Analysis in Python – A Comprehensive Guide with Examples]: https://www.machinelearningplus.com/time-series/time-series-analysis-python/ |
0 commit comments