-
Notifications
You must be signed in to change notification settings - Fork 45
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
8d63cc3
commit a33197f
Showing
339 changed files
with
200,822 additions
and
20 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,39 +1,51 @@ | ||
# SPEECH-CONDITIONED FACE GENERATION USING GENERATIVE ADVERSARIAL NETWORKS | ||
|
||
# Speech2Face | ||
## Intoduction | ||
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised fashion by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of ten youtubers with notable expressiveness in both the speech and visual signals. | ||
|
||
We used [this](https://github.com/franroldans/tfm-franroldan-wav2pix) project as baseline. | ||
Image synthesis has been a trending task for the AI community in recent years. Many works have shown the potential of Generative Adversarial Networks (GANs) to deal with tasks such as text or audio to image synthesis. In particular, recent advances in deep learning using audio have inspired many works involving both visual and auditory information. In this work we propose a face synthesis method which is trained end-to-end using audio and/or language representations as inputs. We used [this](https://github.com/aelnouby/Text-to-Image-Synthesis) project as baseline. | ||
|
||
<figure><img src='images/a2i.png'></figure> | ||
|
||
<figure><img src='assets/Architecture.png'></figure> | ||
|
||
## Dependencies | ||
## Requirements | ||
|
||
- Python 2.7 | ||
- PyTorch | ||
- pytorch | ||
- h5py | ||
- PIL | ||
- numpy | ||
- matplotlib | ||
|
||
This implementation currently only support running with GPUs. | ||
|
||
This implementation only supports running with GPUs. | ||
|
||
## Usage | ||
### Training | ||
|
||
`python runtime.py` | ||
`python runtime.py | ||
|
||
**Arguments:** | ||
- `lr_D` : The learning rate of the disciminator. default = `0.0004` | ||
- `lr_G` : The learning rate of the generator. default = `0.0001` | ||
- `type` : GAN archiecture to use `(gan | wgan | vanilla_gan | vanilla_wgan)`. default = `gan`. Vanilla mean not conditional | ||
- `dataset`: Dataset to use `(birds | flowers)`. default = `flowers` | ||
- `split` : An integer indicating which split to use `(0 : train | 1: valid | 2: test)`. default = `0` | ||
- `lr` : The learning rate. default = `0.0002` | ||
- `diter` : Only for WGAN, number of iteration for discriminator for each iteration of the generator. default = `5` | ||
- `vis_screen` : The visdom env name for visualization. default = `gan` | ||
- `save_path` : Name of the directory (inside **checkpoints**) where the parameters of them odel will be stored. | ||
- `l1_coef` : L1 loss coefficient in the generator loss fucntion. default=`50` | ||
- `l2_coef` : Feature matching coefficient in the generator loss fucntion. default=`100` | ||
- `save_path` : Path for saving the models. | ||
- `l1_coef` : L1 loss coefficient in the generator loss fucntion for gan and vanilla_gan. default=`50` | ||
- `l2_coef` : Feature matching coefficient in the generator loss fucntion for gan and vanilla_gan. default=`100` | ||
- `pre_trained_disc` : Discriminator pre-tranined model path used for intializing training. | ||
- `pre_trained_gen` : Generator pre-tranined model path used for intializing training. | ||
- `batch_size` : Batch size. default= `64` | ||
- `pre_trained_gen` Generator pre-tranined model path used for intializing training. | ||
- `batch_size`: Batch size. default= `64` | ||
- `num_workers`: Number of dataloader workers used for fetching data. default = `8` | ||
- `epochs` : Number of training epochs. default=`200` | ||
- `softmax_coef`: Paramete for the scale of the loss of the classifier on top of the embedding | ||
- `image_size` : Number of pixels per dimension. They are assumed to be squared. Two possible values: `64 | 128`. default = `64` | ||
- `inference` : Boolean for choosing whether train or test. default = `False` | ||
- `cls`: Boolean flag to whether train with cls algorithms or not. default=`False` | ||
|
||
|
||
## References | ||
[1] Generative Adversarial Text-to-Image Synthesis https://arxiv.org/abs/1605.05396 | ||
|
||
[2] Improved Techniques for Training GANs https://arxiv.org/abs/1606.03498 | ||
|
||
[3] Wasserstein GAN https://arxiv.org/abs/1701.07875 | ||
|
||
[4] Improved Training of Wasserstein GANs https://arxiv.org/pdf/1704.00028.pdf | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
# training pickle files path: | ||
train_faces_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/faces/DA_balanced_faces_train.pkl' | ||
train_audios_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/audios/DA_balanced_audios_train.pkl' | ||
|
||
# inference pickle files path: | ||
#inference_faces_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/faces/DA_balanced_faces_inf.pkl' | ||
#inference_audios_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/audios/DA_balanced_audios_inf.pkl' | ||
|
||
inference_faces_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/faces/DA_balanced_faces_train.pkl' | ||
inference_audios_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/audios/DA_balanced_audios_train.pkl' |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
import torch.nn as nn | ||
|
||
class auxclassifier(nn.Module): | ||
|
||
def __init__(self): | ||
super(auxclassifier,self).__init__() | ||
|
||
self.latent_vector_dim = 128 | ||
self.net = nn.Sequential( | ||
nn.Linear(in_features= self.latent_vector_dim, out_features=200), | ||
nn.BatchNorm1d(num_features=200), | ||
nn.LeakyReLU(negative_slope=0.2, inplace=True), | ||
nn.Linear(in_features=200,out_features=10), | ||
nn.BatchNorm1d(num_features=10), | ||
) | ||
|
||
def forward(self,x): | ||
return self.net(x) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
import torch | ||
import torch.nn as nn | ||
from models.spectral_norm import SpectralNorm | ||
from scripts.utils import Concat_embed | ||
|
||
|
||
class discriminator(nn.Module): | ||
def __init__(self, image_size): | ||
super(discriminator, self).__init__() | ||
self.image_size = image_size | ||
self.num_channels = 3 | ||
self.latent_space = 128 | ||
self.ndf = 64 | ||
|
||
# common network for both architectures, when generating 64x64 or 128x18 images | ||
self.netD_1 = nn.Sequential( | ||
# input is (nc) x 64 x 64 | ||
SpectralNorm(nn.Conv2d(self.num_channels, self.ndf, 4, 2, 1, bias=False)), | ||
nn.LeakyReLU(0.2, inplace=True), | ||
# state size. (ndf) x 32 x 32 | ||
SpectralNorm(nn.Conv2d(self.ndf, self.ndf * 2, 4, 2, 1, bias=False)), | ||
nn.LeakyReLU(0.2, inplace=True), | ||
# state size. (ndf*2) x 16 x 16 | ||
SpectralNorm(nn.Conv2d(self.ndf * 2, self.ndf * 4, 4, 2, 1, bias=False)), | ||
nn.LeakyReLU(0.2, inplace=True), | ||
# state size. (ndf*4) x 8 x 8 | ||
SpectralNorm(nn.Conv2d(self.ndf * 4, self.ndf * 8, 4, 2, 1, bias=False)), | ||
nn.LeakyReLU(0.2, inplace=True), | ||
) | ||
|
||
# if we are feeding D with 64x64 images: | ||
if self.image_size == 64: | ||
self.netD_2 = nn.Conv2d(self.ndf * 8 + self.latent_space, 1, 4, 1, 0, bias=False) | ||
|
||
# if we are feeding D with 128x128 images: | ||
elif self.image_size == 128: | ||
self.netD_1 = nn.Sequential( | ||
self.netD_1, | ||
SpectralNorm(nn.Conv2d(self.ndf * 8, self.ndf * 16, 4, 2, 1, bias=False)), | ||
nn.LeakyReLU(0.2, inplace=True), | ||
) | ||
self.netD_2 = nn.Conv2d(self.ndf * 16 + self.latent_space, 1, 4, 1, 0, bias=False) | ||
|
||
|
||
def forward(self, input_image, z_vector): | ||
|
||
# feeding input images to the first stack of conv layers | ||
x_intermediate = self.netD_1(input_image) | ||
|
||
# replicating the speech embedding spatially and performing a depth concatenation with the embedded audio after | ||
# being fed into segan's D | ||
dimensions = list(x_intermediate.shape) | ||
x = torch.cat([x_intermediate, z_vector.repeat(1,1,dimensions[2],dimensions[3])], 1) | ||
|
||
# feeding to the last conv layer. | ||
x = self.netD_2(x) | ||
|
||
return x.view(-1, 1).squeeze(1), x_intermediate |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
import torch.nn as nn | ||
from models.auxiliary_classifier import auxclassifier | ||
#from models.segan.segan_discriminator import Discriminator | ||
from models.segan import Discriminator | ||
from models.spectral_norm import SpectralNorm | ||
|
||
class generator(nn.Module): | ||
def __init__(self, image_size, audio_samples): | ||
super(generator, self).__init__() | ||
|
||
# defining some useful variables | ||
self.audio_samples = audio_samples | ||
self.num_channels = 3 | ||
self.latent_dim = 128 | ||
self.ngf = 64 | ||
self.image_size = image_size | ||
|
||
# defining segan's D | ||
self.d_fmaps = [16, 32, 128, 256, 512, 1024] | ||
self.audio_embedding = Discriminator(1, self.d_fmaps, 15, nn.LeakyReLU(0.3), self.audio_samples) | ||
# defining the auxiliary classifier | ||
self.aux_classifier = auxclassifier() | ||
|
||
# common network for both architectures when generating 64x64 or 128x18 images | ||
self.netG = nn.Sequential( | ||
# state size. (ngf*4) x 8 x 8 | ||
SpectralNorm(nn.ConvTranspose2d(self.ngf * 8, self.ngf * 4, 4, 2, 1, bias=False)), | ||
nn.Dropout(), | ||
# nn.BatchNorm2d(self.ngf * 2), | ||
nn.ReLU(True), | ||
# state size. (ngf*2) x 16 x 16 | ||
SpectralNorm(nn.ConvTranspose2d(self.ngf * 4, self.ngf * 2, 4, 2, 1, bias=False)), | ||
# nn.BatchNorm2d(self.ngf), | ||
nn.Dropout(), | ||
nn.ReLU(True), | ||
# state size. (ngf) x 32 x 32 | ||
SpectralNorm(nn.ConvTranspose2d(self.ngf * 2, self.ngf, 4, 2, 1, bias=False)), | ||
nn.Dropout(), | ||
nn.ReLU(True), | ||
# If we add here Dropout, we would only generate noise, but not realistic faces | ||
SpectralNorm(nn.ConvTranspose2d(self.ngf, self.num_channels, 4, 2, 1, bias=False)), | ||
# state size. (num_channels) x 128 x 128 | ||
nn.Tanh() | ||
) | ||
|
||
# if we want to generate 64x64 images: | ||
if self.image_size == 64: | ||
self.netG = nn.Sequential( | ||
SpectralNorm(nn.ConvTranspose2d(self.latent_dim, self.ngf*8, 4, 1, 0, bias=False)), | ||
nn.Dropout(), | ||
# nn.BatchNorm2d(self.ngf * 4), | ||
nn.ReLU(True), | ||
self.netG | ||
) | ||
|
||
# if we want to generate 128 x 128 images: | ||
if self.image_size == 128: | ||
self.netG = nn.Sequential( | ||
SpectralNorm(nn.ConvTranspose2d(self.latent_dim, self.ngf*16, 4, 1, 0, bias=False)), | ||
nn.Dropout(), | ||
nn.ReLU(True), | ||
SpectralNorm(nn.ConvTranspose2d(self.ngf*16, self.ngf*8, 4, 2, 1, bias=False)), | ||
nn.Dropout(), | ||
# nn.BatchNorm2d(self.ngf * 4), | ||
nn.ReLU(True), | ||
self.netG | ||
) | ||
|
||
|
||
def forward(self, raw_wav): | ||
|
||
# feeding the audio to segan's D | ||
y, wav_embedding = self.audio_embedding(raw_wav.unsqueeze(1)) | ||
|
||
# storing scores after feeding the audio embedding to the classifier (softmax) | ||
softmax_scores = self.aux_classifier(y) | ||
|
||
# feeding the audio embedding to the GAN generator | ||
z_vector = y.unsqueeze(2).unsqueeze(3) | ||
output = self.netG(z_vector) | ||
|
||
return output, z_vector, softmax_scores | ||
|
Oops, something went wrong.