Skip to content

whut-zhangwx/RNN-With-Numpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

7dc90d0 · Apr 21, 2024

History

7 Commits
Apr 21, 2024
Apr 21, 2024
Apr 21, 2024

Repository files navigation

项目简介

这是一个没有使用深度学习框架,只借助Python和Numpy实现的RNN网络。由于没有类似Pytorch的自动求梯度的功能,我们需要手动计算Loss到网络各层参数的梯度,这涉及到数学公式的推导,我们将数学公式的推导过程呈现在README文档中。 如果文档不能正常显示数学公式,请移步我的博客 Recurrent Neural Networks 查看。

Conda环境

conda create --name env_for_rnn python=3.9 numpy pandas matplotlib sympy ipykernel scikit-learn
conda activate env_for_rnn

Recurrent Neural Networks

img

RNN的数学描述

输入层

网络的输入是一串m维向量序列 x 1 , x 2 , , x t ,

x 1 = [ x 1 1 x 2 1 x m 1 ] , x 2 = [ x 1 2 x 2 2 x m 2 ] , , x t = [ x 1 t x 2 t x m t ] ,

循环层

网络的状态是一串n维向量序列 s 0 , s 1 , s 2 , s t ,

[ s 1 t s 2 t s n t ] = f ( [ u 11 u 12 u 1 m   u 21 u 22 u 2 m     u n 1 u n 2 u n m   ] [ x 1 t x 2 t x m t ] + [ w 11 w 12 w 1 n   w 21 w 22 w 2 n     w n 1 w n 2 w n n   ] [ s 1 t 1 s 2 t 1 s n t 1 ] + [ b 1 R b 2 R b n R ] )   t = 1 , 2 ,

输出层

网络的输出是一串m维的向量序列 o 1 , o 2 , , o t ,

[ o 1 t o 2 t o m t ] = g ( [ v 11 v 12 v 1 n   v 21 v 22 v 2 n     v m 1 v m 2 v m n   ] [ s 1 t s 2 t s n t ] + [ b 1 O b 2 O b m O ] )   t = 1 , 2 ,

网络的输出

网络在 t 时刻的输出 o t 由前面各时刻的输入 x t , x t 1 , , x 1 和初始状态 s 0 决定

(下面的推导式中省略了偏置项 b )

o t = g ( V s t )   = g ( V f ( U x t + W s t 1 ) )   = g ( V f ( U x t + W f ( U x t 1 + W s t 2 ) ) )     = g ( V f ( U x t + W f ( U x t 1 + W f ( U x t 2 + + W f ( U x 1 + W s 0 ) ) ) ) )  

网络输出的误差

网络在每个 t 时刻的输出 o t 都对应一个目标向量 t t (target), 每个时刻都对应一个误差, 用$E^t$来表示 , E t 是关于 o t t t 的函数, 例如采用二范数的平方表示误差, 误差函数如下计算

E t = 1 2 o t t t 2 2   = 1 2 i = 1 m ( o i t t i t ) 2

梯度的计算(Back Propagate Through Time, BPTT)

循环层到输出层

记输出层 t 时刻的输入向量为 ξ t

[ o 1 t o 2 t o m t ] = g ( [ ξ 1 t   ξ 2 t     ξ m t ] ) , [ ξ 1 t   ξ 2 t     ξ m t ] = [ v 11 v 12 v 1 n   v 21 v 22 v 2 n     v m 1 v m 2 v m n   ] [ s 1 t s 2 t s n t ] + [ b 1 O b 2 O b m O ] E t v i j = E t ξ i t ξ i t v i j = E t ξ i t s j t   E t b i O = E t ξ i t ξ i t b i O = E t ξ i t 1 i = 1 , , m j = 1 , , n

向量化计算梯度

E t b O = [ E t ξ 1 t E t ξ 2 t E t ξ m t ] , E t V = [ E t ξ 1 t E t ξ 2 t E t ξ m t ] [ s 1 t s 2 t s n t ] = [ E t ξ 1 t s 1 t E t ξ 1 t s 2 t E t ξ 1 t s n t   E t ξ 2 t s 1 t E t ξ 2 t s 2 t E t ξ 2 t s n t     E t ξ m t s 1 t E t ξ m t s 2 t E t ξ m t s n t   ]

输入层到循环层

记循环层 t 时刻的输入向量为 η t

[ s 1 t s 2 t s n t ] = f ( [ η 1 t η 2 t η n t ] ) , [ η 1 t η 2 t η n t ] = [ u 11 u 12 u 1 m   u 21 u 22 u 2 m     u n 1 u n 2 u n m   ] [ x 1 t x 2 t x m t ] + [ w 11 w 12 w 1 n   w 21 w 22 w 2 n     w n 1 w n 2 w n n   ] [ s 1 t 1 s 2 t 1 s n t 1 ] + [ b 1 R b 2 R b n R ]

关于矩阵U的偏导

由上面的记号, t 时刻循环层的输入为$\boldsymbol{\eta}^t$, η t 是网络在 t 时刻的输入 x t 和 上一时刻的状态 s t 1 的线性变换

η t = U x t + W s t 1 + b R   s t 1 = f ( η t 1 )

下面的公式推导出一个 E t / U 关于时间的递推式, 我们记 E t U ( t ) t 时刻网络输出的误差 E 关于

Unable to render expression.

$$\begin{split}
\frac{\partial E^t}{\partial U}
% 第一个等号
&=
\frac{\partial E^t}{\partial \boldsymbol{\eta}^t}
\frac{\partial \boldsymbol{\eta}^t}{\partial U} \\\
(\boldsymbol{\eta}^t = U\boldsymbol{x}^t + W\boldsymbol{s}^{t-1}+\boldsymbol{b}^R)\rightarrow
% 第二个等号
&=
\frac{\partial E^t}{\partial \boldsymbol{\eta}^t}
% 第二个等号括号中的内容
\left(
\frac{\partial U\boldsymbol{x}^t}{\partial U} +
\frac{\partial W\boldsymbol{s}^{t-1}}{\partial U}
\right) \\\
% 第三个等号
&=
\frac{\partial E^t}{\partial \boldsymbol{\eta}^t}
% 第三个等号括号中的内容
\left(
\frac{\partial U\boldsymbol{x}^t}{\partial U} +
W\frac{\partial \boldsymbol{s}^{t-1}}{\partial \boldsymbol{\eta}^{t-1}}
\frac{\partial \boldsymbol{\eta}^{t-1}}{\partial U}
\right) \\\
% 第四个等号
将\frac{\partial E^t}{\partial \boldsymbol{\eta}^t}乘进括号中去\rightarrow
&=
% 第四个等号加号左边的内容
\frac{\partial E^t}{\partial \boldsymbol{\eta}^t}
\frac{\partial U\boldsymbol{x}^t}{\partial U} +
% 第四个等号加号右边的内容
\frac{\partial E^t}{\partial \boldsymbol{\eta}^t}
\frac{\partial W\boldsymbol{s}^{t-1}}{\partial \boldsymbol{\eta}^{t-1}}
\frac{\partial \boldsymbol{\eta}^{t-1}}{\partial U} \\\
\left(\frac{\partial W\boldsymbol{s}^{t-1}}{\partial \boldsymbol{\eta}^{t-1}}=
\frac{\partial \boldsymbol{\eta}^t}{\partial \boldsymbol{\eta}^{t-1}}\right)\rightarrow
% 第五个等号
&=
\frac{\partial E^t}{\partial \boldsymbol{\eta}^t}
\frac{\partial U\boldsymbol{x}^t}{\partial U} +
\frac{\partial E^t}{\partial \boldsymbol{\eta}^t}
\frac{\partial \boldsymbol{\eta}^t}{\partial \boldsymbol{\eta}^{t-1}}
\frac{\partial \boldsymbol{\eta}^{t-1}}{\partial U} \\\
&=
\frac{\partial E^t}{\partial \boldsymbol{\eta}^t}
\frac{\partial U\boldsymbol{x}^t}{\partial U} +
\frac{\partial E^t}{\partial \boldsymbol{\eta}^{t-1}}
\frac{\partial \boldsymbol{\eta}^{t-1}}{\partial U}
\end{split}$$

由这个递推式可以得到

E t U = E t η t η t U   = E t η t U x t U + E t η t 1 η t 1 U   = E t η t U x t U + E t η t 1 U x t 1 U + E t η t 2 η t 2 U   = E t η t U x t U + E t η t 1 U x t 1 U + E t η t 2 U x t 2 U + + E t η 2 U x 2 U + E t η 1 U x 1 U

计算 E t η k U x k U

计算 E t η t

Unable to render expression.

$$\begin{split}
\frac{\partial E^t}{\partial \boldsymbol{\eta}^t}
&=
\frac{\partial E^t}{\partial \boldsymbol{\xi}^t}
\frac{\partial \boldsymbol{\xi}^t}{\partial \boldsymbol{s}^t}
\frac{\partial \boldsymbol{s}^t}{\partial \boldsymbol{\eta}^t}
\\\
&=
% 第二行的行向量
\begin{bmatrix}
\frac{\partial E^t}{\partial \xi^t_1}&\frac{\partial E^t}{\partial \xi^t_2}&\cdots&\frac{\partial E^t}{\partial \xi^t_m}
\end{bmatrix}
% 第二行的第一个矩阵
\begin{bmatrix}
\frac{\partial \xi^t_1}{\partial s^t_1} & \frac{\partial \xi^t_1}{\partial s^t_2} & \cdots & \frac{\partial \xi^t_1}{\partial s^t_n} \\\
\frac{\partial \xi^t_2}{\partial s^t_1} & \frac{\partial \xi^t_2}{\partial s^t_2} & \cdots & \frac{\partial \xi^t_2}{\partial s^t_n} \\\
\vdots&\vdots&\ddots&\vdots\\\
\frac{\partial \xi^t_m}{\partial s^t_1} & \frac{\partial \xi^t_m}{\partial s^t_2} & \cdots & \frac{\partial \xi^t_m}{\partial s^t_n} \\\
\end{bmatrix}
% 第二行的第二个矩阵
\begin{bmatrix}
\frac{\partial s^t_1}{\partial \eta^t_1} & \frac{\partial s^t_1}{\partial s^t_2} & \cdots & \frac{\partial s^t_1}{\partial s^t_n} \\\
\frac{\partial s^t_2}{\partial \eta^t_1} & \frac{\partial s^t_2}{\partial \eta^t_2} & \cdots & \frac{\partial s^t_2}{\partial \eta^t_n} \\\
\vdots&\vdots&\ddots&\vdots\\\
\frac{\partial s^t_n}{\partial \eta^t_1} & \frac{\partial s^t_n}{\partial \eta^t_2} & \cdots & \frac{\partial s^t_n}{\partial \eta^t_n} \\\
\end{bmatrix}
\\\
&=
% 第三行的行向量
\begin{bmatrix}
\frac{\partial E^t}{\partial \xi^t_1}&\frac{\partial E^t}{\partial \xi^t_2}&\cdots&\frac{\partial E^t}{\partial \xi^t_m}
\end{bmatrix}
% 第三行的V矩阵
\begin{bmatrix}
v_{11}&v_{12}&\cdots&v_{1n}\\\
v_{21}&v_{22}&\cdots&v_{2n}\\\
\vdots&\vdots&\ddots&\vdots\\\
v_{m1}&v_{m2}&\cdots&v_{mn}\\\
\end{bmatrix}
% 第三行的对角矩阵
\begin{bmatrix}
\frac{\partial s^t_1}{\partial \eta^t_1} & 0 & \cdots & 0 \\\
0 & \frac{\partial s^t_2}{\partial \eta^t_2} & \cdots & 0 \\\
\vdots&\vdots&\ddots&\vdots\\\
0 & 0 & \cdots & \frac{\partial s^t_n}{\partial \eta^t_n} \\\
\end{bmatrix}
\\\
&=
\left[
\frac{\partial s^t_1}{\partial \eta^t_1}
\sum_{i=1}^m(\frac{\partial E^t}{\partial \xi^t_i}v_{i1})
,\quad
\frac{\partial s^t_2}{\partial \eta^t_2}
\sum_{i=1}^m(\frac{\partial E^t}{\partial \xi^t_i}v_{i2})
,\quad
\cdots
,\quad
\frac{\partial s^t_n}{\partial \eta^t_n}
\sum_{i=1}^m(\frac{\partial E^t}{\partial \xi^t_i}v_{in})
\right]
\\\
记为&=
\begin{bmatrix}
\delta^{tt}_1&\delta^{tt}_2&\cdots&\delta^{tt}_n
\end{bmatrix}
\end{split}$$

E t η t 的结果记为 δ t t , 称为循环层 t 时刻(第二个 t )的输入的误差项 (网络 t 时刻输出的误差关于循环层 t 时刻输入的偏导数)

计算 E t η k

Unable to render expression.

$$\begin{split}
\frac{\partial \boldsymbol{\eta}^t}{\partial \boldsymbol{\eta}^{t-1}}
&=
\frac{\partial W\boldsymbol{s}^{t-1}}{\partial \boldsymbol{\eta}^{t-1}}=
W\frac{\partial \boldsymbol{s}^{t-1}}{\partial \boldsymbol{\eta}^{t-1}}=
W
% 第一个矩阵
\begin{bmatrix}
\frac{\partial s^{t-1}_{1}}{\partial \eta^{t-1}_{1}}&
\frac{\partial s^{t-1}_{1}}{\partial \eta^{t-1}_{2}}&
\cdots&
\frac{\partial s^{t-1}_{1}}{\partial \eta^{t-1}_{n}}
\\\
\frac{\partial s^{t-1}_{2}}{\partial \eta^{t-1}_{1}}&
\frac{\partial s^{t-1}_{2}}{\partial \eta^{t-1}_{2}}&
\cdots&
\frac{\partial s^{t-1}_{2}}{\partial \eta^{t-1}_{n}}
\\\
\vdots&\vdots&\ddots&\vdots\\\
\frac{\partial s^{t-1}_{n}}{\partial \eta^{t-1}_{1}}&
\frac{\partial s^{t-1}_{n}}{\partial \eta^{t-1}_{2}}&
\cdots&
\frac{\partial s^{t-1}_{n}}{\partial \eta^{t-1}_{n}}
\end{bmatrix}=
% 第二个矩阵
W\begin{bmatrix}
\frac{\partial s^{t-1}_{1}}{\partial \eta^{t-1}_{1}}&0&\cdots&0
\\\
0&\frac{\partial s^{t-1}_{2}}{\partial \eta^{t-1}_{2}}&\cdots&0
\\\
\vdots&\vdots&\ddots&\vdots\\\
0&0&\cdots&\frac{\partial s^{t-1}_{n}}{\partial \eta^{t-1}_{n}}
\end{bmatrix}
\\\
&=
W\begin{bmatrix}
f'(\eta^{t-1}_{1})&0&\cdots&0
\\\
0&f'(\eta^{t-1}_{2})&\cdots&0
\\\
\vdots&\vdots&\ddots&\vdots\\\
0&0&\cdots&f'(\eta^{t-1}_{n})
\end{bmatrix}
\end{split}$$
E t η k = E t ξ t ξ t s t s t η t ( η t η t 1 η k + 1 η k )   = [ δ 1 t t δ 2 t t δ n t t ] i = ( t 1 ) k ( W [ f ( η 1 i ) 0     0 f ( η n i ) ] )   = [ δ 1 t k δ 2 t k δ n t k ] ( t k 1 )

E t η k 的结果记为 δ t k , 称为循环层 k 时刻输入的误差项 (网络 t 时刻输出的误差关于循环层 k 时刻输入的偏导数)

实际计算中我们会一步一步地计算 δ t t , δ t ( t 1 ) , , δ t 1 , 而不是使用连乘运算

[ δ 1 t ( t 1 ) δ 2 t ( t 1 ) δ n t ( t 1 ) ] = [ δ 1 t k δ 2 t k δ n t k ] W [ f ( η 1 t 1 ) 0     0 f ( η n t 1 ) ]   [ δ 1 t ( t 2 ) δ 2 t ( t 2 ) δ n t ( t 2 ) ] = [ δ 1 t ( t 1 ) δ 2 t ( t 1 ) δ n t ( t 1 ) ] W [ f ( η 1 t 2 ) 0     0 f ( η n t 2 ) ]     [ δ 1 t 1 δ 2 t 1 δ n t 1 ] = [ δ 1 t ( 2 ) δ 2 t ( 2 ) δ n t ( 2 ) ] W [ f ( η 1 1 ) 0     0 f ( η n 1 ) ]

计算 U x k U

U x k U = [ ( η 1 k u 11 η 1 k u 1 m     η 1 k u n 1 η 1 k u n m   )   ( η i k u 11 η i k u 1 m     η i k u n 1 η i k u n m   )   ( η n k u 11 η n k u 1 m     η n k u n 1 η n k u n m   ) ] = [ ( x 1 k x 2 k x m k   0 0 0     0 0 0   )   ( 0 0 0     x 1 k x 2 k x m k     0 0 0   ) 1 i n   ( 0 0 0     0 0 0   x 1 k x 2 k x m k   ) ] ( t k 1 )

计算 E t η k U x k U

E t η k U x k U = [ δ 1 t k δ 2 t k δ n t k ] [ ( x 1 k x 2 k x m k   0 0 0     0 0 0   )   ( 0 0 0     x 1 k x 2 k x m k     0 0 0   ) 1 i n   ( 0 0 0     0 0 0   x 1 k x 2 k x m k   ) ] = [ δ 1 t k δ 2 t k δ n t k ] [ x 1 k x 2 k x m k ] ( t k 1 )

最后结果U的梯度

E t U = k = 1 t ( [ δ 1 t k δ 2 t k δ n t k ] [ x 1 k x 2 k x m k ] )

关于矩阵W的偏导

E t W = E t η t η t W   ( η t = U x t + W s t 1 + b R ) = E t η t ( W s t 1 W )   ( ) = E t η t ( W W s t 1 + W s t 1 W )   = E t η t W W s t 1 + E t η t W s t 1 η t 1 η t 1 W   ( W s t 1 η t 1 = η t η t 1 ) = E t η t W W s t 1 + E t η t η t η t 1 η t 1 W   = E t η t W W s t 1 + E t η t 1 η t 1 W E t W = E t η t η t W   = E t η t W W s t 1 + E t η t 1 η t 1 W   = E t η t W W s t 1 + E t η t 1 W W s t 2 + E t η t 2 η t 2 W   = E t η t W W s t 1 + E t η t 1 W W s t 2 + E t η t 2 W W s t 3 + + E t η 2 W W s 1 + E t η 1 W W s 0

计算 E t η k W W s k 1

计算 W W

W W = ( w 11 w n 1     w n 1 w n n ) ( w 11 w n 1     w n 1 w n n )   = [ ( 1 0 0   0 0 0     0 0 0   ) ( 0 0 1   0 0 0     0 0 0   )     ( 0 0 0   0 0 0     1 0 0   ) ( 0 0 0   0 0 0     0 0 1   ) ]

计算 E t η k W W s k 1

E t η k W W s k 1 = [ δ 1 t k δ 2 t k δ n t k ] [ ( 1 0 0   0 0 0     0 0 0   ) ( 0 0 1   0 0 0     0 0 0   )     ( 0 0 0   0 0 0     1 0 0   ) ( 0 0 0   0 0 0     0 0 1   ) ] [ s 1 k 1 s 2 k 1 s n k 1 ]   = [ δ 1 t k δ 2 t k δ n t k ] [ s 1 k 1 s 2 k 1 s n k 1 ] ( t k 1 )

最后结果W的梯度

E t W = k = 1 t ( [ δ 1 t k δ 2 t k δ n t k ] [ s 1 k 1 s 2 k 1 s n k 1 ] )

关于偏置项 b R 的偏导

E t b R = E t η t η t b R   ( η t = U x t + W s t 1 + b R ) = E t η t ( b R b R + W s t 1 b R )   = E t η t b R b R + E t η t W s t 1 η t 1 η t 1 b R   ( W s t 1 η t 1 = η t η t 1 ) = E t η t b R b R + E t η t η t η t 1 η t 1 b R   = E t η t b R b R + E t η t 1 η t 1 b R E t b R = E t η t η t b R   = E t η t b R b R + E t η t 1 η t 1 b R   = E t η t b R b R + E t η t 1 b R b R + E t η t 2 η t 2 b R   = E t η t b R b R + E t η t 1 b R b R + E t η t 2 b R b R + + E t η 1 b R b R

计算 E t η k b R b R

E t η k b R b R = E t η k I n n = E t η k = [ δ 1 t k δ 2 t k δ n t k ]

最后结果 b R 的梯度

E t b R = k = 1 t ( [ δ 1 t k δ 2 t k δ n t k ] )

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published