hao-lh
diff --git a/‎3-variants-of-classification-problems-in-machine-learning.md
+3-3 b/‎3-variants-of-classification-problems-in-machine-learning.md
+3-3
diff --git a/‎a-gentle-introduction-to-long-short-term-memory-networks-lstm.md
+4-4 b/‎a-gentle-introduction-to-long-short-term-memory-networks-lstm.md
+4-4
diff --git a/‎a-simple-conv3d-example-with-keras.md
+4-4 b/‎a-simple-conv3d-example-with-keras.md
+4-4
diff --git a/‎about-loss-and-loss-functions.md
+9-9 b/‎about-loss-and-loss-functions.md
+9-9
diff --git a/‎albert-explained-a-lite-bert.md
+6-6 b/‎albert-explained-a-lite-bert.md
+6-6
diff --git a/‎an-introduction-to-dcgans.md
+2-2 b/‎an-introduction-to-dcgans.md
+2-2
diff --git a/‎an-introduction-to-tensorflow-keras-callbacks.md
+14-14 b/‎an-introduction-to-tensorflow-keras-callbacks.md
+14-14
diff --git a/‎automating-neural-network-configuration-with-keras-tuner.md
+5-5 b/‎automating-neural-network-configuration-with-keras-tuner.md
+5-5
diff --git a/‎avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras.md
+2-2 b/‎avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras.md
+2-2
@@ -1,9 +1,9 @@
 ---
 title: "3 Variants of Classification Problems in Machine Learning"
 date: "2020-10-19"
-categories: 
+categories:
   - "deep-learning"
-tags: 
+tags:
   - "classification"
   - "classifier"
   - "deep-learning"
@@ -57,7 +57,7 @@ This process - distinguishing between object types or _classes_ by automatically
 The first variant of classification problems is called **binary classification**. If you know the binary system of numbers, you'll know that it's related to the number _two_:
 
 > In mathematics and digital electronics, a binary number is a number expressed in the base-2 numeral system or binary numeral system, which uses only two symbols: typically "0" (zero) and "1" (one).
-> 
+>
 > Wikipedia (2003)
 
 Binary classification, here, equals the assembly line scenario that we already covered and will repeat now:
 
@@ -1,9 +1,9 @@
 ---
 title: "A gentle introduction to Long Short-Term Memory Networks (LSTM)"
 date: "2020-12-29"
-categories: 
+categories:
   - "deep-learning"
-tags: 
+tags:
   - "deep-learning"
   - "long-short-term-memory"
   - "lstm"
@@ -53,13 +53,13 @@ After tokenizing a sequence such as a phrase, we can feed individual tokens (e.g
 Especially when you unfold this structure showing the parsing of subsequent tokens \[latex\]x\_{t-1}\[/latex\] etc., we see that hidden state passes across tokens in a left-to-right fashion. Each token can use information from the previous steps and hence benefit from additional context when transducing (e.g. translating) a token.
 
 > The structure of the network is similar to that of a standard multilayer perceptron, with the distinction that we allow connections among hidden units associated with a time delay. Through these connections the model can retain information about the past, enabling it to discover temporal correlations between events that are far away from each other in the data.
-> 
+>
 > Pascanu et al. (2013)
 
 While being a relatively great step forward, especially with larger sequences, classic RNNs did not show great improvements over classic neural networks where the inputs were sets of time steps (i.e. multiple tokens just at once), according to Hochreiter & Schmidhuber (1997). Diving into Hochreiter's thesis work from 6 years earlier, the researchers have identified the [vanishing gradients problem](https://www.machinecurve.com/index.php/2019/08/30/random-initialization-vanishing-and-exploding-gradients/) and the relatively large distances error flow has to go when sequences are big as one of the leading causes why such models don't perform well.
 
 > The vanishing gradients problem refers to the opposite behaviour, when long term components go exponentially fast to norm 0, making it impossible for the model to learn correlation between temporally distant events.
-> 
+>
 > Pascanu et al. (2013)
 
 ### Why vanishing gradients?
 
@@ -1,11 +1,11 @@
 ---
 title: "A simple Conv3D example with TensorFlow 2 and Keras"
 date: "2019-10-18"
-categories: 
+categories:
   - "buffer"
   - "deep-learning"
   - "frameworks"
-tags: 
+tags:
   - "conv3d"
   - "convolutional-neural-networks"
   - "deep-learning"
@@ -208,7 +208,7 @@ We can next import and prepare the data:
 ```python
 # -- Process code --
 # Load the HDF5 data file
-with h5py.File("./full_dataset_vectors.h5", "r") as hf:    
+with h5py.File("./full_dataset_vectors.h5", "r") as hf:
 
     # Split the data into training/test features/targets
     X_train = hf["X_train"][:]
@@ -352,7 +352,7 @@ def rgb_data_transform(data):
 
 # -- Process code --
 # Load the HDF5 data file
-with h5py.File("./full_dataset_vectors.h5", "r") as hf:    
+with h5py.File("./full_dataset_vectors.h5", "r") as hf:
 
     # Split the data into training/test features/targets
     X_train = hf["X_train"][:]
 
@@ -1,10 +1,10 @@
 ---
 title: "About loss and loss functions"
 date: "2019-10-04"
-categories: 
+categories:
   - "deep-learning"
   - "svms"
-tags: 
+tags:
   - "classifier"
   - "deep-learning"
   - "loss-function"
@@ -156,7 +156,7 @@ Remember the MSE?
 
 ![](images/image-14-1024x296.png)
 
-There's also something called the RMSE, or the **Root Mean Squared Error** or Root Mean Squared Deviation (RMSD). It goes like this:  
+There's also something called the RMSE, or the **Root Mean Squared Error** or Root Mean Squared Deviation (RMSD). It goes like this:
 
 ![](images/image.png)
 
@@ -254,18 +254,18 @@ Because the benefit of the \[latex\]\\delta\[/latex\] is also becoming your bott
 Loss functions are also applied in classifiers. I already discussed in another post what classification is all about, so I'm going to repeat it here:
 
 > Suppose that you work in the field of separating non-ripe tomatoes from the ripe ones. It’s an important job, one can argue, because we don’t want to sell customers tomatoes they can’t process into dinner. It’s the perfect job to illustrate what a human classifier would do.  
->   
-> Humans have a perfect eye to spot tomatoes that are not ripe or that have any other defect, such as being rotten. They derive certain characteristics for those tomatoes, e.g. based on color, smell and shape:  
->   
+>
+> Humans have a perfect eye to spot tomatoes that are not ripe or that have any other defect, such as being rotten. They derive certain characteristics for those tomatoes, e.g. based on color, smell and shape:
+>
 > \- If it’s green, it’s likely to be unripe (or: not sellable);  
 > \- If it smells, it is likely to be unsellable;  
 > \- The same goes for when it’s white or when fungus is visible on top of it.  
->   
+>
 > If none of those occur, it’s likely that the tomato can be sold. We now have _two classes_: sellable tomatoes and non-sellable tomatoes. Human classifiers _decide about which class an object (a tomato) belongs to._  
->   
+>
 > The same principle occurs again in machine learning and deep learning.  
 > Only then, we replace the human with a machine learning model. We’re then using machine learning for _classification_, or for deciding about some “model input” to “which class” it belongs.
-> 
+>
 > Source: [How to create a CNN classifier with Keras?](https://www.machinecurve.com/index.php/2019/09/17/how-to-create-a-cnn-classifier-with-keras/)
 
 We'll now cover loss functions that are used for classification.
 
@@ -1,9 +1,9 @@
 ---
 title: "ALBERT explained: A Lite BERT"
 date: "2021-01-06"
-categories: 
+categories:
   - "deep-learning"
-tags: 
+tags:
   - "albert"
   - "bert"
   - "deep-learning"
@@ -51,13 +51,13 @@ However, let's take a quick look at BERT here as well before we move on. Below,
 Previous studies (such as the [study creating BERT](https://www.machinecurve.com/index.php/2021/01/04/intuitive-introduction-to-bert/) or the [one creating GPT](https://www.machinecurve.com/index.php/2021/01/05/dall-e-openai-gpt-3-model-can-draw-pictures-based-on-text/)) have demonstrated that the size of language models is related to performance. The bigger the language model, the better the model performs, is the general finding.
 
 > Evidence from these improvements reveals that a large network is of crucial importance for achieving state-of-the-art performance
-> 
+>
 > Lam et al. (2019)
 
 While this allows us to build models that really work well, this also comes at a cost: models are really huge and therefore cannot be used widely in practice.
 
 > An obstacle to answering this question is the memory limitations of available hardware. Given that current state-of-the-art models often have hundreds of millions or even billions of parameters, it is easy to hit these limitations as we try to scale our models. Training speed can also be significantly hampered in distributed training, as the communication overhead is directly proportional to the number of parameters in the model.
-> 
+>
 > Lam et al. (2019)
 
 Recall that BERT comes in two flavors: a \[latex\]\\text{BERT}\_\\text{BASE}\[/latex\] model that has 110 million trainable parameters, and a \[latex\]\\text{BERT}\_\\text{LARGE}\[/latex\] model that has 340 million ones (Devlin et al., 2018).
@@ -89,7 +89,7 @@ If things are not clear by now, don't worry - that was expected :D We're going t
 The first key difference between the BERT and ALBERT models is that **parameters of the word embeddings are factorized**.
 
 > In mathematics, **factorization** (...) or **factoring** consists of writing a number or another mathematical object as a product of several _factors_, usually smaller or simpler objects of the same kind. For example, 3 × 5 is a factorization of the integer 15
-> 
+>
 > Wikipedia (2002)
 
 Factorization of these parameters is achieved by taking the matrix representing the weights of the word embeddings \[latex\]E\[/latex\] and decomposing it into two different matrices. Instead of projecting the one-hot encoded vectors directly onto the hidden space, they are first projected on some-kind of lower-dimensional embedding space, which is then projected to the hidden space (Lan et al, 2019). Normally, this should not produce a different result, but let's wait.
@@ -168,7 +168,7 @@ The following results can be reported:
 Beyond the general results, the authors have also performed ablation experiments to see whether the changes actually cause the performance improvement, or not.
 
 > An ablation study studies the performance of an AI system by removing certain components, to understand the contribution of the component to the overall system.
-> 
+>
 > Wikipedia (n.d.)
 
 These are the results:
 
@@ -1,10 +1,10 @@
 ---
 title: "An introduction to DCGANs"
 date: "2021-03-24"
-categories: 
+categories:
   - "buffer"
   - "deep-learning"
-tags: 
+tags:
   - "convolutional-neural-networks"
   - "dcgan"
   - "deep-learning"
 
@@ -1,9 +1,9 @@
 ---
 title: "An introduction to TensorFlow.Keras callbacks"
 date: "2020-11-10"
-categories: 
+categories:
   - "frameworks"
-tags: 
+tags:
   - "callbacks"
   - "keras"
   - "tensorflow"
@@ -43,7 +43,7 @@ In Machine Learning terms, each iteration is also called an **epoch**. Hence, tr
 Now, it can be the case that you want to get insights from the training process while it is running. Or you want to provide automated steering in order to avoid wasting resources. In those cases, you might want to add a **callback** to your Keras model.
 
 > A callback is an object that can perform actions at various stages of training (e.g. at the start or end of an epoch, before or after a single batch, etc).
-> 
+>
 > Keras Team (n.d.)
 
 As we shall see later in this article, among others, there are [callbacks for monitoring](https://www.machinecurve.com/index.php/2019/11/13/how-to-use-tensorboard-with-keras/) and for stopping the training process [when it no longer makes the model better](https://www.machinecurve.com/index.php/2019/05/30/avoid-wasting-resources-with-earlystopping-and-modelcheckpoint-in-keras/). This is possible because with callbacks, we can 'capture' the training process while it is happening. They essentially 'hook' into the training process by allowing the training process to invoke certain callback definitions. In Keras, each callback implements at least one, but possibly multiple of the following definitions (Keras Team, n.d.).
@@ -116,7 +116,7 @@ model.fit(train_generator,
 If you want to periodically save your Keras model - or the model weights - to some file, the `ModelCheckpoint` callback is what you need.
 
 > Callback to save the Keras model or model weights at some frequency.
-> 
+>
 > TensorFlow (n.d.)
 
 It is available as follows:
@@ -162,7 +162,7 @@ Did you know that you can visualize the training process realtime [with TensorBo
 With the `TensorBoard` callback, you can link TensorBoard with your Keras model.
 
 > Enable visualizations for TensorBoard.
-> 
+>
 > TensorFlow (n.d.)
 
 The callback logs a range of items from the training process into your TensorBoard log location:
@@ -211,7 +211,7 @@ Optimizing your neural network involves applying [gradient descent](https://www.
 During this process, you want to find a model that performs well in terms of predictions (i.e., it is not underfit) but that is not too rigid with respect to the dataset it is trained on (i.e., it is neither overfit). That's why the `EarlyStopping` callback can be useful if you are dealing with a situation like this.
 
 > Stop training when a monitored metric has stopped improving.
-> 
+>
 > TensorBoard (n.d.)
 
 It is implemented as follows:
@@ -252,7 +252,7 @@ During the optimization process, a so called _weight update_ is computed. Howeve
 Preferably being relatively large during the early iterations and lower in the later stages, we must adapt the learning rate during the training process. This is called [learning rate decay](https://www.machinecurve.com/index.php/2019/11/11/problems-with-fixed-and-decaying-learning-rates/) and shows what a _learning rate scheduler_ can be useful for. The `LearningRateScheduler` callback implements this functionality.
 
 > At the beginning of every epoch, this callback gets the updated learning rate value from `schedule` function provided at `__init__`, with the current epoch and current learning rate, and applies the updated learning rate on the optimizer.
-> 
+>
 > TensorFlow (n.d.)
 
 Its implementation is really simple:
@@ -294,7 +294,7 @@ Keeping your learning rate equal when close to a plateau means that your model w
 With the `ReduceLROnPlateau` callback, the optimization process can be instructed to _reduce_ the learning rate (and hence the step) when a plateau is encountered.
 
 > Models often benefit from reducing the learning rate by a factor of 2-10 once learning stagnates. This callback monitors a quantity and if no improvement is seen for a 'patience' number of epochs, the learning rate is reduced.
-> 
+>
 > TensorFlow (n.d.)
 
 The callback is implemented as follows:
@@ -333,7 +333,7 @@ Above, we saw that training logs can be distributed to [TensorBoard](https://www
 In those cases, you might wish to send the training logs there instead. The `RemoteMonitor` callback can help you do this.
 
 > Callback used to stream events to a server.
-> 
+>
 > TensorFlow (n.d.)
 
 It is implemented as follows:
@@ -369,7 +369,7 @@ model.fit(train_generator,
 Say that you want a certain function to fire after every batch or every epoch - a simple function, nothing special. However, it's not provided in the collection of callbacks presented with the `tensorflow.keras.callbacks` API. In this case, you might want to use the `LambdaCallback`.
 
 > Callback for creating simple, custom callbacks on-the-fly. This callback is constructed with anonymous functions that will be called at the appropriate time. Te
-> 
+>
 > TensorFlow (n.d.)
 
 It can thus be used to provide anonymous (i.e. `lambda` functions without a name) functions to the training process. The callback looks as follows:
@@ -401,7 +401,7 @@ model.fit(train_generator,
 In some cases (e.g. when you did not apply min-max normalization to your input data), the loss value can be very strange - outputting values close to Infinity or values that are Not a Number (`NaN`). In those cases, you don't want to pursue further training. The `TerminateOnNaN` callback can help here.
 
 > Callback that terminates training when a NaN loss is encountered.
-> 
+>
 > TensorFlow (n.d.)
 
 It is implemented as follows:
@@ -428,7 +428,7 @@ model.fit(train_generator,
 CSV files can be very useful when you need to exchange data. If you want to flush your training logs into a CSV file, the `CSVLogger` callback can be useful to you.
 
 > Callback that streams epoch results to a CSV file.
-> 
+>
 > TensorFlow (n.d.)
 
 It is implemented as follows:
@@ -461,7 +461,7 @@ model.fit(train_generator,
 When you are training a Keras model with verbosity set to `True`, you will see a progress bar in your terminal. With the `ProgbarLogger` callback, you can change what is displayed there.
 
 > Callback that prints metrics to stdout.
-> 
+>
 > TensorFlow (n.d.)
 
 It is implemented as follows:
@@ -493,7 +493,7 @@ model.fit(train_generator,
 When you are training a neural network, especially in a [distributed setting](https://www.machinecurve.com/index.php/2020/10/16/tensorflow-cloud-easy-cloud-based-training-of-your-keras-model/), it would be problematic if your training process suddenly stops - e.g. due to machine failure. Every iteration passed so far will be gone. With the experimental `BackupAndRestore` callback, you can instruct Keras to create temporary checkpoint files after each epoch, to which you can restore later.
 
 > `BackupAndRestore` callback is intended to recover from interruptions that happened in the middle of a model.fit execution by backing up the training states in a temporary checkpoint file (based on TF CheckpointManager) at the end of each epoch.
-> 
+>
 > TensorFlow (n.d.)
 
 It is implemented as follows:
 
@@ -1,10 +1,10 @@
 ---
 title: "Automating neural network configuration with Keras Tuner"
 date: "2020-06-09"
-categories: 
+categories:
   - "deep-learning"
   - "frameworks"
-tags: 
+tags:
   - "deep-neural-network"
   - "hyperparameter-tuning"
   - "hyperparameters"
@@ -74,7 +74,7 @@ model.add(Dense(no_classes, activation='softmax'))
 Here, the architectural choices you make (such as the number of filters for a `Conv2D` layer, kernel size, or the number of output nodes for your `Dense` layer) determine what are known as the _parameters_ of your neural network - the weights (and by consequence biases) of your neural network:[](https://datascience.stackexchange.com/posts/17643/timeline)
 
 > The parameters of a neural network are typically the weights of the connections. In this case, these parameters are learned during the training stage. So, the algorithm itself (and the input data) tunes these parameters.
-> 
+>
 > [Robin, at StackExchange](https://datascience.stackexchange.com/questions/17635/model-parameters-hyper-parameters-of-neural-network-their-tuning-in-training#:~:text=The%20parameters%20of%20a%20neural,or%20the%20number%20of%20epochs.)
 
 ### Tuning hyperparameters in your neural network
@@ -89,7 +89,7 @@ However, things don't end there. Rather, in step (2), you'll _configure_ the mod
 Here's why they are called _hyper_parameters:
 
 > The hyper parameters are typically the learning rate, the batch size or the number of epochs. The are so called "hyper" because they influence how your parameters will be learned. You optimize these hyper parameters as you want (depends on your possibilities): grid search, random search, by hand, using visualisations… The validation stage help you to both know if your parameters have been learned enough and know if your hyper parameters are good.
-> 
+>
 > [Robin, at StackExchange](https://datascience.stackexchange.com/questions/17635/model-parameters-hyper-parameters-of-neural-network-their-tuning-in-training#:~:text=The%20parameters%20of%20a%20neural,or%20the%20number%20of%20epochs.)
 
 As Robin suggests, hyperparameters can be selected (and optimized) in multiple ways. The easiest way of doing so is by hand: you, as a deep learning engineer, select a set of hyperparameters that you will subsequently alter in an attempt to make the model better.
@@ -103,7 +103,7 @@ However, can't we do this in a better way when training a Keras model?
 As you would have expected: yes, we can! :) Let's introduce Keras Tuner to the scene. As you would expect from engineers, the description as to what it does is really short but provides all the details:
 
 > A hyperparameter tuner for Keras, specifically for tf.keras with TensorFlow 2.0.
-> 
+>
 > [Keras-tuner on GitHub](https://github.com/keras-team/keras-tuner)
 
 If you already want to look around, you could visit their website, and if not, let's take a look at what it does.
 
@@ -1,11 +1,11 @@
 ---
 title: "Using EarlyStopping and ModelCheckpoint with TensorFlow 2 and Keras"
 date: "2019-05-30"
-categories: 
+categories:
   - "buffer"
   - "deep-learning"
   - "frameworks"
-tags: 
+tags:
   - "ai"
   - "callbacks"
   - "deep-learning"