Skip to content

Commit

Permalink
code uploaded
Browse files Browse the repository at this point in the history
  • Loading branch information
miqueltubau committed Dec 18, 2018
1 parent 8d63cc3 commit a33197f
Show file tree
Hide file tree
Showing 339 changed files with 200,822 additions and 20 deletions.
674 changes: 674 additions & 0 deletions LICENSE

Large diffs are not rendered by default.

52 changes: 32 additions & 20 deletions Readme.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,51 @@
# SPEECH-CONDITIONED FACE GENERATION USING GENERATIVE ADVERSARIAL NETWORKS

# Speech2Face
## Intoduction
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised fashion by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of ten youtubers with notable expressiveness in both the speech and visual signals.

We used [this](https://github.com/franroldans/tfm-franroldan-wav2pix) project as baseline.
Image synthesis has been a trending task for the AI community in recent years. Many works have shown the potential of Generative Adversarial Networks (GANs) to deal with tasks such as text or audio to image synthesis. In particular, recent advances in deep learning using audio have inspired many works involving both visual and auditory information. In this work we propose a face synthesis method which is trained end-to-end using audio and/or language representations as inputs. We used [this](https://github.com/aelnouby/Text-to-Image-Synthesis) project as baseline.

<figure><img src='images/a2i.png'></figure>

<figure><img src='assets/Architecture.png'></figure>

## Dependencies
## Requirements

- Python 2.7
- PyTorch
- pytorch
- h5py
- PIL
- numpy
- matplotlib

This implementation currently only support running with GPUs.

This implementation only supports running with GPUs.

## Usage
### Training

`python runtime.py`
`python runtime.py

**Arguments:**
- `lr_D` : The learning rate of the disciminator. default = `0.0004`
- `lr_G` : The learning rate of the generator. default = `0.0001`
- `type` : GAN archiecture to use `(gan | wgan | vanilla_gan | vanilla_wgan)`. default = `gan`. Vanilla mean not conditional
- `dataset`: Dataset to use `(birds | flowers)`. default = `flowers`
- `split` : An integer indicating which split to use `(0 : train | 1: valid | 2: test)`. default = `0`
- `lr` : The learning rate. default = `0.0002`
- `diter` : Only for WGAN, number of iteration for discriminator for each iteration of the generator. default = `5`
- `vis_screen` : The visdom env name for visualization. default = `gan`
- `save_path` : Name of the directory (inside **checkpoints**) where the parameters of them odel will be stored.
- `l1_coef` : L1 loss coefficient in the generator loss fucntion. default=`50`
- `l2_coef` : Feature matching coefficient in the generator loss fucntion. default=`100`
- `save_path` : Path for saving the models.
- `l1_coef` : L1 loss coefficient in the generator loss fucntion for gan and vanilla_gan. default=`50`
- `l2_coef` : Feature matching coefficient in the generator loss fucntion for gan and vanilla_gan. default=`100`
- `pre_trained_disc` : Discriminator pre-tranined model path used for intializing training.
- `pre_trained_gen` : Generator pre-tranined model path used for intializing training.
- `batch_size` : Batch size. default= `64`
- `pre_trained_gen` Generator pre-tranined model path used for intializing training.
- `batch_size`: Batch size. default= `64`
- `num_workers`: Number of dataloader workers used for fetching data. default = `8`
- `epochs` : Number of training epochs. default=`200`
- `softmax_coef`: Paramete for the scale of the loss of the classifier on top of the embedding
- `image_size` : Number of pixels per dimension. They are assumed to be squared. Two possible values: `64 | 128`. default = `64`
- `inference` : Boolean for choosing whether train or test. default = `False`
- `cls`: Boolean flag to whether train with cls algorithms or not. default=`False`


## References
[1] Generative Adversarial Text-to-Image Synthesis https://arxiv.org/abs/1605.05396

[2] Improved Techniques for Training GANs https://arxiv.org/abs/1606.03498

[3] Wasserstein GAN https://arxiv.org/abs/1701.07875

[4] Improved Training of Wasserstein GANs https://arxiv.org/pdf/1704.00028.pdf

10 changes: 10 additions & 0 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# training pickle files path:
train_faces_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/faces/DA_balanced_faces_train.pkl'
train_audios_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/audios/DA_balanced_audios_train.pkl'

# inference pickle files path:
#inference_faces_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/faces/DA_balanced_faces_inf.pkl'
#inference_audios_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/audios/DA_balanced_audios_inf.pkl'

inference_faces_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/faces/DA_balanced_faces_train.pkl'
inference_audios_path: '/imatge/mtubau/projects/faces/Speech2Youtubers/balanced_pikles/audios/DA_balanced_audios_train.pkl'
Empty file added models/__init__.py
Empty file.
18 changes: 18 additions & 0 deletions models/auxiliary_classifier.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import torch.nn as nn

class auxclassifier(nn.Module):

def __init__(self):
super(auxclassifier,self).__init__()

self.latent_vector_dim = 128
self.net = nn.Sequential(
nn.Linear(in_features= self.latent_vector_dim, out_features=200),
nn.BatchNorm1d(num_features=200),
nn.LeakyReLU(negative_slope=0.2, inplace=True),
nn.Linear(in_features=200,out_features=10),
nn.BatchNorm1d(num_features=10),
)

def forward(self,x):
return self.net(x)
58 changes: 58 additions & 0 deletions models/discriminator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
import torch
import torch.nn as nn
from models.spectral_norm import SpectralNorm
from scripts.utils import Concat_embed


class discriminator(nn.Module):
def __init__(self, image_size):
super(discriminator, self).__init__()
self.image_size = image_size
self.num_channels = 3
self.latent_space = 128
self.ndf = 64

# common network for both architectures, when generating 64x64 or 128x18 images
self.netD_1 = nn.Sequential(
# input is (nc) x 64 x 64
SpectralNorm(nn.Conv2d(self.num_channels, self.ndf, 4, 2, 1, bias=False)),
nn.LeakyReLU(0.2, inplace=True),
# state size. (ndf) x 32 x 32
SpectralNorm(nn.Conv2d(self.ndf, self.ndf * 2, 4, 2, 1, bias=False)),
nn.LeakyReLU(0.2, inplace=True),
# state size. (ndf*2) x 16 x 16
SpectralNorm(nn.Conv2d(self.ndf * 2, self.ndf * 4, 4, 2, 1, bias=False)),
nn.LeakyReLU(0.2, inplace=True),
# state size. (ndf*4) x 8 x 8
SpectralNorm(nn.Conv2d(self.ndf * 4, self.ndf * 8, 4, 2, 1, bias=False)),
nn.LeakyReLU(0.2, inplace=True),
)

# if we are feeding D with 64x64 images:
if self.image_size == 64:
self.netD_2 = nn.Conv2d(self.ndf * 8 + self.latent_space, 1, 4, 1, 0, bias=False)

# if we are feeding D with 128x128 images:
elif self.image_size == 128:
self.netD_1 = nn.Sequential(
self.netD_1,
SpectralNorm(nn.Conv2d(self.ndf * 8, self.ndf * 16, 4, 2, 1, bias=False)),
nn.LeakyReLU(0.2, inplace=True),
)
self.netD_2 = nn.Conv2d(self.ndf * 16 + self.latent_space, 1, 4, 1, 0, bias=False)


def forward(self, input_image, z_vector):

# feeding input images to the first stack of conv layers
x_intermediate = self.netD_1(input_image)

# replicating the speech embedding spatially and performing a depth concatenation with the embedded audio after
# being fed into segan's D
dimensions = list(x_intermediate.shape)
x = torch.cat([x_intermediate, z_vector.repeat(1,1,dimensions[2],dimensions[3])], 1)

# feeding to the last conv layer.
x = self.netD_2(x)

return x.view(-1, 1).squeeze(1), x_intermediate
83 changes: 83 additions & 0 deletions models/generator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
import torch.nn as nn
from models.auxiliary_classifier import auxclassifier
#from models.segan.segan_discriminator import Discriminator
from models.segan import Discriminator
from models.spectral_norm import SpectralNorm

class generator(nn.Module):
def __init__(self, image_size, audio_samples):
super(generator, self).__init__()

# defining some useful variables
self.audio_samples = audio_samples
self.num_channels = 3
self.latent_dim = 128
self.ngf = 64
self.image_size = image_size

# defining segan's D
self.d_fmaps = [16, 32, 128, 256, 512, 1024]
self.audio_embedding = Discriminator(1, self.d_fmaps, 15, nn.LeakyReLU(0.3), self.audio_samples)
# defining the auxiliary classifier
self.aux_classifier = auxclassifier()

# common network for both architectures when generating 64x64 or 128x18 images
self.netG = nn.Sequential(
# state size. (ngf*4) x 8 x 8
SpectralNorm(nn.ConvTranspose2d(self.ngf * 8, self.ngf * 4, 4, 2, 1, bias=False)),
nn.Dropout(),
# nn.BatchNorm2d(self.ngf * 2),
nn.ReLU(True),
# state size. (ngf*2) x 16 x 16
SpectralNorm(nn.ConvTranspose2d(self.ngf * 4, self.ngf * 2, 4, 2, 1, bias=False)),
# nn.BatchNorm2d(self.ngf),
nn.Dropout(),
nn.ReLU(True),
# state size. (ngf) x 32 x 32
SpectralNorm(nn.ConvTranspose2d(self.ngf * 2, self.ngf, 4, 2, 1, bias=False)),
nn.Dropout(),
nn.ReLU(True),
# If we add here Dropout, we would only generate noise, but not realistic faces
SpectralNorm(nn.ConvTranspose2d(self.ngf, self.num_channels, 4, 2, 1, bias=False)),
# state size. (num_channels) x 128 x 128
nn.Tanh()
)

# if we want to generate 64x64 images:
if self.image_size == 64:
self.netG = nn.Sequential(
SpectralNorm(nn.ConvTranspose2d(self.latent_dim, self.ngf*8, 4, 1, 0, bias=False)),
nn.Dropout(),
# nn.BatchNorm2d(self.ngf * 4),
nn.ReLU(True),
self.netG
)

# if we want to generate 128 x 128 images:
if self.image_size == 128:
self.netG = nn.Sequential(
SpectralNorm(nn.ConvTranspose2d(self.latent_dim, self.ngf*16, 4, 1, 0, bias=False)),
nn.Dropout(),
nn.ReLU(True),
SpectralNorm(nn.ConvTranspose2d(self.ngf*16, self.ngf*8, 4, 2, 1, bias=False)),
nn.Dropout(),
# nn.BatchNorm2d(self.ngf * 4),
nn.ReLU(True),
self.netG
)


def forward(self, raw_wav):

# feeding the audio to segan's D
y, wav_embedding = self.audio_embedding(raw_wav.unsqueeze(1))

# storing scores after feeding the audio embedding to the classifier (softmax)
softmax_scores = self.aux_classifier(y)

# feeding the audio embedding to the GAN generator
z_vector = y.unsqueeze(2).unsqueeze(3)
output = self.netG(z_vector)

return output, z_vector, softmax_scores

Loading

0 comments on commit a33197f

Please sign in to comment.