You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 3-variants-of-classification-problems-in-machine-learning.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1
---
2
2
title: "3 Variants of Classification Problems in Machine Learning"
3
3
date: "2020-10-19"
4
-
categories:
4
+
categories:
5
5
- "deep-learning"
6
-
tags:
6
+
tags:
7
7
- "classification"
8
8
- "classifier"
9
9
- "deep-learning"
@@ -57,7 +57,7 @@ This process - distinguishing between object types or _classes_ by automatically
57
57
The first variant of classification problems is called **binary classification**. If you know the binary system of numbers, you'll know that it's related to the number _two_:
58
58
59
59
> In mathematics and digital electronics, a binary number is a number expressed in the base-2 numeral system or binary numeral system, which uses only two symbols: typically "0" (zero) and "1" (one).
60
-
>
60
+
>
61
61
> Wikipedia (2003)
62
62
63
63
Binary classification, here, equals the assembly line scenario that we already covered and will repeat now:
Copy file name to clipboardExpand all lines: a-gentle-introduction-to-long-short-term-memory-networks-lstm.md
+4-4
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1
---
2
2
title: "A gentle introduction to Long Short-Term Memory Networks (LSTM)"
3
3
date: "2020-12-29"
4
-
categories:
4
+
categories:
5
5
- "deep-learning"
6
-
tags:
6
+
tags:
7
7
- "deep-learning"
8
8
- "long-short-term-memory"
9
9
- "lstm"
@@ -53,13 +53,13 @@ After tokenizing a sequence such as a phrase, we can feed individual tokens (e.g
53
53
Especially when you unfold this structure showing the parsing of subsequent tokens \[latex\]x\_{t-1}\[/latex\] etc., we see that hidden state passes across tokens in a left-to-right fashion. Each token can use information from the previous steps and hence benefit from additional context when transducing (e.g. translating) a token.
54
54
55
55
> The structure of the network is similar to that of a standard multilayer perceptron, with the distinction that we allow connections among hidden units associated with a time delay. Through these connections the model can retain information about the past, enabling it to discover temporal correlations between events that are far away from each other in the data.
56
-
>
56
+
>
57
57
> Pascanu et al. (2013)
58
58
59
59
While being a relatively great step forward, especially with larger sequences, classic RNNs did not show great improvements over classic neural networks where the inputs were sets of time steps (i.e. multiple tokens just at once), according to Hochreiter & Schmidhuber (1997). Diving into Hochreiter's thesis work from 6 years earlier, the researchers have identified the [vanishing gradients problem](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) and the relatively large distances error flow has to go when sequences are big as one of the leading causes why such models don't perform well.
60
60
61
61
> The vanishing gradients problem refers to the opposite behaviour, when long term components go exponentially fast to norm 0, making it impossible for the model to learn correlation between temporally distant events.
Copy file name to clipboardExpand all lines: about-loss-and-loss-functions.md
+9-9
Original file line number
Diff line number
Diff line change
@@ -1,10 +1,10 @@
1
1
---
2
2
title: "About loss and loss functions"
3
3
date: "2019-10-04"
4
-
categories:
4
+
categories:
5
5
- "deep-learning"
6
6
- "svms"
7
-
tags:
7
+
tags:
8
8
- "classifier"
9
9
- "deep-learning"
10
10
- "loss-function"
@@ -156,7 +156,7 @@ Remember the MSE?
156
156
157
157

158
158
159
-
There's also something called the RMSE, or the **Root Mean Squared Error** or Root Mean Squared Deviation (RMSD). It goes like this:
159
+
There's also something called the RMSE, or the **Root Mean Squared Error** or Root Mean Squared Deviation (RMSD). It goes like this:
160
160
161
161

162
162
@@ -254,18 +254,18 @@ Because the benefit of the \[latex\]\\delta\[/latex\] is also becoming your bott
254
254
Loss functions are also applied in classifiers. I already discussed in another post what classification is all about, so I'm going to repeat it here:
255
255
256
256
> Suppose that you work in the field of separating non-ripe tomatoes from the ripe ones. It’s an important job, one can argue, because we don’t want to sell customers tomatoes they can’t process into dinner. It’s the perfect job to illustrate what a human classifier would do.
257
-
>
258
-
> Humans have a perfect eye to spot tomatoes that are not ripe or that have any other defect, such as being rotten. They derive certain characteristics for those tomatoes, e.g. based on color, smell and shape:
259
-
>
257
+
>
258
+
> Humans have a perfect eye to spot tomatoes that are not ripe or that have any other defect, such as being rotten. They derive certain characteristics for those tomatoes, e.g. based on color, smell and shape:
259
+
>
260
260
> \- If it’s green, it’s likely to be unripe (or: not sellable);
261
261
> \- If it smells, it is likely to be unsellable;
262
262
> \- The same goes for when it’s white or when fungus is visible on top of it.
263
-
>
263
+
>
264
264
> If none of those occur, it’s likely that the tomato can be sold. We now have _two classes_: sellable tomatoes and non-sellable tomatoes. Human classifiers _decide about which class an object (a tomato) belongs to._
265
-
>
265
+
>
266
266
> The same principle occurs again in machine learning and deep learning.
267
267
> Only then, we replace the human with a machine learning model. We’re then using machine learning for _classification_, or for deciding about some “model input” to “which class” it belongs.
268
-
>
268
+
>
269
269
> Source: [How to create a CNN classifier with Keras?](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/)
270
270
271
271
We'll now cover loss functions that are used for classification.
Copy file name to clipboardExpand all lines: albert-explained-a-lite-bert.md
+6-6
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1
---
2
2
title: "ALBERT explained: A Lite BERT"
3
3
date: "2021-01-06"
4
-
categories:
4
+
categories:
5
5
- "deep-learning"
6
-
tags:
6
+
tags:
7
7
- "albert"
8
8
- "bert"
9
9
- "deep-learning"
@@ -51,13 +51,13 @@ However, let's take a quick look at BERT here as well before we move on. Below,
51
51
Previous studies (such as the [study creating BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) or the [one creating GPT](https://www.machinecurve.com/index.php/2021/01/05/dall-e-openai-gpt-3-model-can-draw-pictures-based-on-text/)) have demonstrated that the size of language models is related to performance. The bigger the language model, the better the model performs, is the general finding.
52
52
53
53
> Evidence from these improvements reveals that a large network is of crucial importance for achieving state-of-the-art performance
54
-
>
54
+
>
55
55
> Lam et al. (2019)
56
56
57
57
While this allows us to build models that really work well, this also comes at a cost: models are really huge and therefore cannot be used widely in practice.
58
58
59
59
> An obstacle to answering this question is the memory limitations of available hardware. Given that current state-of-the-art models often have hundreds of millions or even billions of parameters, it is easy to hit these limitations as we try to scale our models. Training speed can also be significantly hampered in distributed training, as the communication overhead is directly proportional to the number of parameters in the model.
60
-
>
60
+
>
61
61
> Lam et al. (2019)
62
62
63
63
Recall that BERT comes in two flavors: a \[latex\]\\text{BERT}\_\\text{BASE}\[/latex\] model that has 110 million trainable parameters, and a \[latex\]\\text{BERT}\_\\text{LARGE}\[/latex\] model that has 340 million ones (Devlin et al., 2018).
@@ -89,7 +89,7 @@ If things are not clear by now, don't worry - that was expected :D We're going t
89
89
The first key difference between the BERT and ALBERT models is that **parameters of the word embeddings are factorized**.
90
90
91
91
> In mathematics, **factorization** (...) or **factoring** consists of writing a number or another mathematical object as a product of several _factors_, usually smaller or simpler objects of the same kind. For example, 3 × 5 is a factorization of the integer 15
92
-
>
92
+
>
93
93
> Wikipedia (2002)
94
94
95
95
Factorization of these parameters is achieved by taking the matrix representing the weights of the word embeddings \[latex\]E\[/latex\] and decomposing it into two different matrices. Instead of projecting the one-hot encoded vectors directly onto the hidden space, they are first projected on some-kind of lower-dimensional embedding space, which is then projected to the hidden space (Lan et al, 2019). Normally, this should not produce a different result, but let's wait.
@@ -168,7 +168,7 @@ The following results can be reported:
168
168
Beyond the general results, the authors have also performed ablation experiments to see whether the changes actually cause the performance improvement, or not.
169
169
170
170
> An ablation study studies the performance of an AI system by removing certain components, to understand the contribution of the component to the overall system.
Copy file name to clipboardExpand all lines: an-introduction-to-tensorflow-keras-callbacks.md
+14-14
Original file line number
Diff line number
Diff line change
@@ -1,9 +1,9 @@
1
1
---
2
2
title: "An introduction to TensorFlow.Keras callbacks"
3
3
date: "2020-11-10"
4
-
categories:
4
+
categories:
5
5
- "frameworks"
6
-
tags:
6
+
tags:
7
7
- "callbacks"
8
8
- "keras"
9
9
- "tensorflow"
@@ -43,7 +43,7 @@ In Machine Learning terms, each iteration is also called an **epoch**. Hence, tr
43
43
Now, it can be the case that you want to get insights from the training process while it is running. Or you want to provide automated steering in order to avoid wasting resources. In those cases, you might want to add a **callback** to your Keras model.
44
44
45
45
> A callback is an object that can perform actions at various stages of training (e.g. at the start or end of an epoch, before or after a single batch, etc).
46
-
>
46
+
>
47
47
> Keras Team (n.d.)
48
48
49
49
As we shall see later in this article, among others, there are [callbacks for monitoring](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/) and for stopping the training process [when it no longer makes the model better](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/). This is possible because with callbacks, we can 'capture' the training process while it is happening. They essentially 'hook' into the training process by allowing the training process to invoke certain callback definitions. In Keras, each callback implements at least one, but possibly multiple of the following definitions (Keras Team, n.d.).
@@ -116,7 +116,7 @@ model.fit(train_generator,
116
116
If you want to periodically save your Keras model - or the model weights - to some file, the `ModelCheckpoint` callback is what you need.
117
117
118
118
> Callback to save the Keras model or model weights at some frequency.
119
-
>
119
+
>
120
120
> TensorFlow (n.d.)
121
121
122
122
It is available as follows:
@@ -162,7 +162,7 @@ Did you know that you can visualize the training process realtime [with TensorBo
162
162
With the `TensorBoard` callback, you can link TensorBoard with your Keras model.
163
163
164
164
> Enable visualizations for TensorBoard.
165
-
>
165
+
>
166
166
> TensorFlow (n.d.)
167
167
168
168
The callback logs a range of items from the training process into your TensorBoard log location:
During this process, you want to find a model that performs well in terms of predictions (i.e., it is not underfit) but that is not too rigid with respect to the dataset it is trained on (i.e., it is neither overfit). That's why the `EarlyStopping` callback can be useful if you are dealing with a situation like this.
212
212
213
213
> Stop training when a monitored metric has stopped improving.
214
-
>
214
+
>
215
215
> TensorBoard (n.d.)
216
216
217
217
It is implemented as follows:
@@ -252,7 +252,7 @@ During the optimization process, a so called _weight update_ is computed. Howeve
252
252
Preferably being relatively large during the early iterations and lower in the later stages, we must adapt the learning rate during the training process. This is called [learning rate decay](https://www.machinecurve.com/index.php/2019/11/11/problems-with-fixed-and-decaying-learning-rates/) and shows what a _learning rate scheduler_ can be useful for. The `LearningRateScheduler` callback implements this functionality.
253
253
254
254
> At the beginning of every epoch, this callback gets the updated learning rate value from `schedule` function provided at `__init__`, with the current epoch and current learning rate, and applies the updated learning rate on the optimizer.
255
-
>
255
+
>
256
256
> TensorFlow (n.d.)
257
257
258
258
Its implementation is really simple:
@@ -294,7 +294,7 @@ Keeping your learning rate equal when close to a plateau means that your model w
294
294
With the `ReduceLROnPlateau` callback, the optimization process can be instructed to _reduce_ the learning rate (and hence the step) when a plateau is encountered.
295
295
296
296
> Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This callback monitors a quantity and if no improvement is seen for a 'patience' number of epochs, the learning rate is reduced.
297
-
>
297
+
>
298
298
> TensorFlow (n.d.)
299
299
300
300
The callback is implemented as follows:
@@ -333,7 +333,7 @@ Above, we saw that training logs can be distributed to [TensorBoard](https://www
333
333
In those cases, you might wish to send the training logs there instead. The `RemoteMonitor` callback can help you do this.
334
334
335
335
> Callback used to stream events to a server.
336
-
>
336
+
>
337
337
> TensorFlow (n.d.)
338
338
339
339
It is implemented as follows:
@@ -369,7 +369,7 @@ model.fit(train_generator,
369
369
Say that you want a certain function to fire after every batch or every epoch - a simple function, nothing special. However, it's not provided in the collection of callbacks presented with the `tensorflow.keras.callbacks` API. In this case, you might want to use the `LambdaCallback`.
370
370
371
371
> Callback for creating simple, custom callbacks on-the-fly. This callback is constructed with anonymous functions that will be called at the appropriate time. Te
372
-
>
372
+
>
373
373
> TensorFlow (n.d.)
374
374
375
375
It can thus be used to provide anonymous (i.e. `lambda` functions without a name) functions to the training process. The callback looks as follows:
@@ -401,7 +401,7 @@ model.fit(train_generator,
401
401
In some cases (e.g. when you did not apply min-max normalization to your input data), the loss value can be very strange - outputting values close to Infinity or values that are Not a Number (`NaN`). In those cases, you don't want to pursue further training. The `TerminateOnNaN` callback can help here.
402
402
403
403
> Callback that terminates training when a NaN loss is encountered.
404
-
>
404
+
>
405
405
> TensorFlow (n.d.)
406
406
407
407
It is implemented as follows:
@@ -428,7 +428,7 @@ model.fit(train_generator,
428
428
CSV files can be very useful when you need to exchange data. If you want to flush your training logs into a CSV file, the `CSVLogger` callback can be useful to you.
429
429
430
430
> Callback that streams epoch results to a CSV file.
431
-
>
431
+
>
432
432
> TensorFlow (n.d.)
433
433
434
434
It is implemented as follows:
@@ -461,7 +461,7 @@ model.fit(train_generator,
461
461
When you are training a Keras model with verbosity set to `True`, you will see a progress bar in your terminal. With the `ProgbarLogger` callback, you can change what is displayed there.
462
462
463
463
> Callback that prints metrics to stdout.
464
-
>
464
+
>
465
465
> TensorFlow (n.d.)
466
466
467
467
It is implemented as follows:
@@ -493,7 +493,7 @@ model.fit(train_generator,
493
493
When you are training a neural network, especially in a [distributed setting](https://www.machinecurve.com/index.php/2020/10/16/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model/), it would be problematic if your training process suddenly stops - e.g. due to machine failure. Every iteration passed so far will be gone. With the experimental `BackupAndRestore` callback, you can instruct Keras to create temporary checkpoint files after each epoch, to which you can restore later.
494
494
495
495
> `BackupAndRestore` callback is intended to recover from interruptions that happened in the middle of a model.fit execution by backing up the training states in a temporary checkpoint file (based on TF CheckpointManager) at the end of each epoch.
Here, the architectural choices you make (such as the number of filters for a `Conv2D` layer, kernel size, or the number of output nodes for your `Dense` layer) determine what are known as the _parameters_ of your neural network - the weights (and by consequence biases) of your neural network:[](https://datascience.stackexchange.com/posts/17643/timeline)
75
75
76
76
> The parameters of a neural network are typically the weights of the connections. In this case, these parameters are learned during the training stage. So, the algorithm itself (and the input data) tunes these parameters.
77
-
>
77
+
>
78
78
> [Robin, at StackExchange](https://datascience.stackexchange.com/questions/17635/model-parameters-hyper-parameters-of-neural-network-their-tuning-in-training#:~:text=The%20parameters%20of%20a%20neural,or%20the%20number%20of%20epochs.)
79
79
80
80
### Tuning hyperparameters in your neural network
@@ -89,7 +89,7 @@ However, things don't end there. Rather, in step (2), you'll _configure_ the mod
89
89
Here's why they are called _hyper_parameters:
90
90
91
91
> The hyper parameters are typically the learning rate, the batch size or the number of epochs. The are so called "hyper" because they influence how your parameters will be learned. You optimize these hyper parameters as you want (depends on your possibilities): grid search, random search, by hand, using visualisations… The validation stage help you to both know if your parameters have been learned enough and know if your hyper parameters are good.
92
-
>
92
+
>
93
93
> [Robin, at StackExchange](https://datascience.stackexchange.com/questions/17635/model-parameters-hyper-parameters-of-neural-network-their-tuning-in-training#:~:text=The%20parameters%20of%20a%20neural,or%20the%20number%20of%20epochs.)
94
94
95
95
As Robin suggests, hyperparameters can be selected (and optimized) in multiple ways. The easiest way of doing so is by hand: you, as a deep learning engineer, select a set of hyperparameters that you will subsequently alter in an attempt to make the model better.
@@ -103,7 +103,7 @@ However, can't we do this in a better way when training a Keras model?
103
103
As you would have expected: yes, we can! :) Let's introduce Keras Tuner to the scene. As you would expect from engineers, the description as to what it does is really short but provides all the details:
104
104
105
105
> A hyperparameter tuner for Keras, specifically for tf.keras with TensorFlow 2.0.
106
-
>
106
+
>
107
107
> [Keras-tuner on GitHub](https://github.com/keras-team/keras-tuner)
108
108
109
109
If you already want to look around, you could visit their website, and if not, let's take a look at what it does.
0 commit comments