This repository hosts research code and experiments toward high-fidelity, zero-shot voice conversion (VC). The current implemented focus is Stage 1: WavLM → waveform resynthesis (GAN vocoder-style training), with tools for inference, checkpoint evaluation, and ablation studies on WavLM layer usage.
Status
- ✅ Stage 1 (implemented): WavLM-to-Audio resynthesis with GAN training + inference + objective evaluation.
- 🧪 Ablations (implemented): layer weighting / last-N layers experiments (array training on Jean Zay).
- 🚧 Stage 2 (planned): diffusion/flow-based VC conditioned on WavLM representations.
Core training and inference scripts live under:
MIMIC-VC/wavlm_resynth/third_training/
This experiment folder includes:
train_gan.py/train.slurm— GAN training entrypoints (local/DDP and HPC).models_improved.py— generator + WavLM adapter model.discriminators.py— HiFi-GAN style discriminators (MPD/MSD).losses_gan.py— adversarial + reconstruction losses.inference.py/inference.slurm— training-like chunking + overlap-add inference.eval_chkp.py(+ optional batch scripts) — bulk checkpoint evaluation + metrics export.
Ablation experiments are in:
MIMIC-VC/wavlm_resynth/ablation_study/
This folder contains:
train_gan_ablation.py/train_array.slurm— Slurm array training for last-N / weighted modes.models_ablation.py,discriminators.py,losses_gan.py— ablation variants and GAN components.- Analysis + paper utilities:
analyze_*,extract_*,fill_paper_tables_*,results*.csv,table*.tex,figures_ablation/, etc.
KVC/
└── MIMIC-VC/
└── wavlm_resynth/
├── third_training/ # Stage 1: main GAN training/inference/eval
├── ablation_study/ # Stage 1: ablation experiments + analysis/paper tools
└── ... # other experiments (legacy / intermediate)
git clone https://github.com/NassimaOULDOUALI/KVC.git
cd kvcIf you use Conda:
conda env create -f environment.yml
conda activate kvc
You also need a working PyTorch + CUDA stack compatible with your cluster/GPU setup.
Stage 1 expects a pretrained WavLM model (path is typically configured in YAML, e.g. config_gan.yaml). Make sure the checkpoint is accessible from your runtime environment.
A) Training (local / torchrun) From: MIMIC-VC/wavlm_resynth/third_training/
cd MIMIC-VC/wavlm_resynth/third_training
torchrun --standalone --nnodes=1 --nproc_per_node=2 train_gan.py \
--config config_gan.yaml \
--output_dir outputs_gan/run1 \
--seed 1234
Dataset paths, sample rate, segment length, and WavLM settings are defined in config_gan.yaml. Adapt them to your filesystem.
B) Inference (training-like chunking + overlap-add) cd MIMIC-VC/wavlm_resynth/third_training
python inference.py \
--config config_infer.yaml \
--checkpoint /path/to/checkpoint.pt \
--input_wav /path/to/input.wav \
--output_wav /path/to/output.wav
C) Bulk checkpoint evaluation
Use:
eval_chkp.py for checkpoint scanning + decoding + metrics export to CSV.
cd MIMIC-VC/wavlm_resynth/third_training
python eval_chkp.py \
--checkpoints_dir /path/to/checkpoints_dir \
--audio_dir /path/to/audios \
--out_dir outputs_eval/run1 \
--n_audios 3 \
--seed 0
A) Slurm array training (Jean Zay)
From: MIMIC-VC/wavlm_resynth/ablation_study/
cd MIMIC-VC/wavlm_resynth/ablation_study
sbatch train_array.slurm
This array typically sweeps --wavlm_last_n over a range (e.g., 1–12) with a selected --feature_mode.
This folder contains scripts that:
parse logs/checkpoints,
aggregate metrics into CSV,
generate LaTeX tables (table_ablation*.tex, layer_importance_table.tex),
export paper-ready summaries.
Keep configs (*.yaml) and scripts under version control.
Keep heavy outputs out of Git:
checkpoints, decoded audio, logs, wandb runs, etc.
Use deterministic seeds where possible (the Slurm scripts typically pass --seed).
Prefer a single canonical inference script for resynthesis (to avoid diverging pipelines).
Roadmap (planned)
Stage 2: diffusion/flow-based VC conditioned on WavLM representations (speaker/style control).
Streaming inference constraints (latency, chunked decoding, overlap strategies).
Improved disentanglement and robustness (speaker leakage control, prosody transfer modules).
Authors
Nassima Ould Ouali — Hi! PARIS
Supervision: Éric Moulines
License
Apache-2.0
For questions or contributions: 📧 nassima.ould-ouali@ip-paris.fr
If you find this repository useful for research or development, please consider giving it a star on GitHub — it helps increase visibility and supports continued maintenance and improvements.