You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/tutorial-nlp-from-scratch.md
+53-52
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ jupyter:
6
6
extension: .md
7
7
format_name: markdown
8
8
format_version: '1.3'
9
-
jupytext_version: 1.11.4
9
+
jupytext_version: 1.11.5
10
10
kernelspec:
11
11
display_name: Python 3 (ipykernel)
12
12
language: python
@@ -23,14 +23,14 @@ Your deep learning model (the LSTM) is a form of a Recurrent Neural Network and
23
23
Today, Deep Learning is getting adopted in everyday life and now it is more important to ensure that decisions that have been taken using AI are not reflecting discriminatory behavior towards a set of populations. It is important to take fairness into consideration while consuming the output from AI. Throughout the tutorial we'll try to question all the steps in our pipeline from an ethics point of view.
24
24
25
25
26
-
## Prerequisites
26
+
## Prerequisites
27
27
28
28
You are expected to be familiar with the Python programming language and array manipulation with NumPy. In addition, some understanding of Linear Algebra and Calculus is recommended. You should also be familiar with how Neural Networks work. For reference, you can visit the [Python](https://docs.python.org/dev/tutorial/index.html), [Linear algebra on n-dimensional arrays](https://numpy.org/doc/stable/user/tutorial-svd.html) and [Calculus](https://d2l.ai/chapter_appendix-mathematics-for-deep-learning/multivariable-calculus.html) tutorials.
29
29
30
30
To get a refresher on Deep Learning basics, You should consider reading [the d2l.ai book](https://d2l.ai/chapter_recurrent-neural-networks/index.html), which is an interactive deep learning book with multi-framework code, math, and discussions. You can also go through the [Deep learning on MNIST from scratch tutorial](https://numpy.org/numpy-tutorials/content/tutorial-deep-learning-on-mnist.html) to understand how a basic neural network is implemented from scratch.
31
31
32
32
In addition to NumPy, you will be utilizing the following Python standard modules for data loading and processing:
33
-
-[`pandas`](https://pandas.pydata.org/docs/) for handling dataframes
33
+
-[`pandas`](https://pandas.pydata.org/docs/) for handling dataframes
34
34
-[`Matplotlib`](https://matplotlib.org/) for data visualization
35
35
-[`pooch`](https://www.fatiando.org/pooch/latest/https://www.fatiando.org/pooch/latest/) to download and cache datasets
36
36
@@ -39,13 +39,13 @@ This tutorial can be run locally in an isolated environment, such as [Virtualenv
39
39
40
40
## Table of contents
41
41
42
-
1. Data Collection
42
+
1. Data Collection
43
43
44
44
2. Preprocess the datasets
45
45
46
46
3. Build and train a LSTM network from scratch
47
47
48
-
4. Perform sentiment analysis on collected speeches
48
+
4. Perform sentiment analysis on collected speeches
49
49
50
50
5. Next steps
51
51
@@ -105,24 +105,26 @@ We made sure to include different demographics in our data and included a range
105
105
>The GloVe word embeddings include sets that were trained on billions of tokens, some up to 840 billion tokens. These algorithms exhibit stereotypical biases, such as gender bias which can be traced back to the original training data. For example certain occupations seem to be more biased towards a particular gender, reinforcing problematic stereotypes. The nearest solution to this problem are some de-biasing algorithms as the one presented in https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1184/reports/6835575.pdf which one can use on embeddings of their choice to mitigate bias, if present.
106
106
<!-- #endregion -->
107
107
108
-
You'll start with importing the necessary packages to build our Deep Learning network
108
+
You'll start with importing the necessary packages to build our Deep Learning network.
109
109
110
-
```python tags=[]
111
-
# Importing the necessary packages
112
-
import numpy as np
113
-
import pandas as pd
114
-
import matplotlib.pyplot as plt
110
+
```python
111
+
# Importing the necessary packages
112
+
import numpy as np
113
+
import pandas as pd
114
+
import matplotlib.pyplot as plt
115
115
import pooch
116
116
import string
117
-
import re
118
-
import zipfile
117
+
import re
118
+
import zipfile
119
119
import os
120
120
```
121
121
122
-
```python tags=["hide-input"]
122
+
Next, you'll define set of text preprocessing helper functions.
123
+
124
+
```python
123
125
classTextPreprocess:
124
126
"""Text Preprocessing for a Natural Language Processing model."""
125
-
127
+
126
128
deftxt_to_df(self, file):
127
129
"""Function to convert a txt file to pandas dataframe.
@@ -444,12 +446,11 @@ Unlike an MLP, the RNN was designed to work with sequence prediction problems.RN
444
446
The problem with an RNN however, is that it cannot retain long-term memory because the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the network’s recurrent connections. This shortcoming is referred to as the vanishing gradient problem. Long Short-Term Memory (LSTM) is an RNN architecture specifically designed to address the [vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem).

450
452
451
-
452
-
In the above gif, The rectangles labeled $A$ are called `Cells` and they are the **Memory Blocks** of our LSTM network. They are responsible for choosing what to remember in a sequence and pass on that information to the next cell via two states called the `hidden state` $H_{t}$ and the `cell state` $C_{t}$ where $t$ indicates the time-step. Each `Cell` has dedicated gates which are responsible for storing, writing or reading the information passed to an LSTM. You will now look closely at the architecture of the network by implementing each mechanism happening inside of it.
453
+
In the above gif, the rectangles labeled $A$ are called `Cells` and they are the **Memory Blocks** of our LSTM network. They are responsible for choosing what to remember in a sequence and pass on that information to the next cell via two states called the `hidden state` $H_{t}$ and the `cell state` $C_{t}$ where $t$ indicates the time-step. Each `Cell` has dedicated gates which are responsible for storing, writing or reading the information passed to an LSTM. You will now look closely at the architecture of the network by implementing each mechanism happening inside of it.
453
454
454
455
455
456
Lets start with writing a function to randomly initialize the parameters which will be learned while our model trains
@@ -513,7 +514,7 @@ The **Forget Gate** takes the current word embedding and the previous hidden sta
513
514
deffp_forget_gate(concat, parameters):
514
515
ft = sigmoid(np.dot(parameters['Wf'], concat)
515
516
+ parameters['bf'])
516
-
return ft
517
+
return ft
517
518
```
518
519
519
520
The **Input Gate** takes the current word embedding and the previous hidden state concatenated together as input. and governs how much of the new data we take into account via the **Candidate Memory Gate** which utilizes the [Tanh](https://d2l.ai/chapter_multilayer-perceptrons/mlp.html?highlight=tanh#tanh-function) to regulate the values flowing through the network.
Finally we have the **Output Gate** which takes information from the current word embedding, previous hidden state and the cell state which has been updated with information from the forget and input gates to update the value of the hidden state.

545
545
546
546
### But how do you obtain sentiment from the LSTM's output?
547
+
547
548
The hidden state you obtain from the output gate of the last memory block in a sequence is considered to be a representation of all the information contained in a sequence. To classify this information into various classes (2 in our case, positive and negative) we use a **Fully Connected layer** which firstly maps this information to a predefined output size (1 in our case). Then, an activation function such as the sigmoid converts this output to a value between 0 and 1. We'll consider values greater than 0.5 to be indicative of a positive sentiment.
548
549
549
550
```python
550
551
deffp_fc_layer(last_hs, parameters):
551
552
z2 = (np.dot(parameters['W2'], last_hs)
552
553
+ parameters['b2'])
553
554
a2 = sigmoid(z2)
554
-
return a2
555
+
return a2
555
556
```
556
557
557
558
Now you will put all these functions together to summarize the **Forward Propagation** step in our model architecture:
Now, for each gate and the fully connected layer, we define a function to calculate the gradient of the loss with respect to the input passed and the parameters used. To understand the mathematics behind how the derivatives were calculated we suggest you to follow this helpful [blog](https://christinakouridi.blog/2019/06/19/backpropagation-lstm/) by Christina Kouridi
646
+
Now, for each gate and the fully connected layer, we define a function to calculate the gradient of the loss with respect to the input passed and the parameters used. To understand the mathematics behind how the derivatives were calculated we suggest you to follow this helpful [blog](https://christinakouridi.blog/2019/06/19/backpropagation-lstm/) by Christina Kouridi.
646
647
647
648
648
649
Define a function to calculate the gradients in the **Forget Gate**:
We update the parameters through an optimization algorithm called [Adam](https://optimization.cbe.cornell.edu/index.php?title=Adam) which is an extension to stochastic gradient descent that has recently seen broader adoption for deep learning applications in computer vision and natural language processing. Specifically, the algorithm calculates an exponential moving average of the gradient and the squared gradient, and the parameters `beta1` and `beta2` control the decay rates of these moving averages. Adam has shown increased convergence and robustness over other gradient descent algorithms and is often recommended as the default optimizer for training.
0 commit comments