KVC — Research Codebase for Voice Conversion (WavLM Resynthesis → Diffusion VC)

This repository hosts research code and experiments toward high-fidelity, zero-shot voice conversion (VC). The current implemented focus is Stage 1: WavLM → waveform resynthesis (GAN vocoder-style training), with tools for inference, checkpoint evaluation, and ablation studies on WavLM layer usage.

Status

✅ Stage 1 (implemented): WavLM-to-Audio resynthesis with GAN training + inference + objective evaluation.

🧪 Ablations (implemented): layer weighting / last-N layers experiments (array training on Jean Zay).

🚧 Stage 2 (planned): diffusion/flow-based VC conditioned on WavLM representations.

What is inside (implemented)

1) WavLM → Audio (GAN resynthesis)

Core training and inference scripts live under:

MIMIC-VC/wavlm_resynth/third_training/

This experiment folder includes:

train_gan.py / train.slurm — GAN training entrypoints (local/DDP and HPC).
models_improved.py — generator + WavLM adapter model.
discriminators.py — HiFi-GAN style discriminators (MPD/MSD).
losses_gan.py — adversarial + reconstruction losses.
inference.py / inference.slurm — training-like chunking + overlap-add inference.
eval_chkp.py (+ optional batch scripts) — bulk checkpoint evaluation + metrics export.

2) Ablation study (WavLM layer usage)

Ablation experiments are in:

MIMIC-VC/wavlm_resynth/ablation_study/

This folder contains:

train_gan_ablation.py / train_array.slurm — Slurm array training for last-N / weighted modes.
models_ablation.py, discriminators.py, losses_gan.py — ablation variants and GAN components.
Analysis + paper utilities: analyze_*, extract_*, fill_paper_tables_*, results*.csv, table*.tex, figures_ablation/, etc.

Repository structure (high-level)

KVC/
└── MIMIC-VC/
└── wavlm_resynth/
├── third_training/ # Stage 1: main GAN training/inference/eval
├── ablation_study/ # Stage 1: ablation experiments + analysis/paper tools
└── ... # other experiments (legacy / intermediate)

⚙️ Installation

1. Clone the repository

git clone https://github.com/NassimaOULDOUALI/KVC.git
cd kvc

2. Create the Conda environment

If you use Conda:

conda env create -f environment.yml
conda activate kvc

You also need a working PyTorch + CUDA stack compatible with your cluster/GPU setup.

3. Pretrained models

Stage 1 expects a pretrained WavLM model (path is typically configured in YAML, e.g. config_gan.yaml). Make sure the checkpoint is accessible from your runtime environment.

Stage 1 — WavLM → Audio (GAN) usage

A) Training (local / torchrun) From: MIMIC-VC/wavlm_resynth/third_training/

cd MIMIC-VC/wavlm_resynth/third_training

torchrun --standalone --nnodes=1 --nproc_per_node=2 train_gan.py \
  --config config_gan.yaml \
  --output_dir outputs_gan/run1 \
  --seed 1234

Dataset paths, sample rate, segment length, and WavLM settings are defined in config_gan.yaml. Adapt them to your filesystem.

B) Inference (training-like chunking + overlap-add) cd MIMIC-VC/wavlm_resynth/third_training

python inference.py \
  --config config_infer.yaml \
  --checkpoint /path/to/checkpoint.pt \
  --input_wav /path/to/input.wav \
  --output_wav /path/to/output.wav

C) Bulk checkpoint evaluation

Use:

eval_chkp.py for checkpoint scanning + decoding + metrics export to CSV.

cd MIMIC-VC/wavlm_resynth/third_training

python eval_chkp.py \
  --checkpoints_dir /path/to/checkpoints_dir \
  --audio_dir /path/to/audios \
  --out_dir outputs_eval/run1 \
  --n_audios 3 \
  --seed 0

Ablation study (Stage 1)

A) Slurm array training (Jean Zay)

From: MIMIC-VC/wavlm_resynth/ablation_study/

cd MIMIC-VC/wavlm_resynth/ablation_study
sbatch train_array.slurm

This array typically sweeps --wavlm_last_n over a range (e.g., 1–12) with a selected --feature_mode.

B) Analysis & paper tables

This folder contains scripts that:

parse logs/checkpoints,

aggregate metrics into CSV,

generate LaTeX tables (table_ablation*.tex, layer_importance_table.tex),

export paper-ready summaries.

Reproducibility & housekeeping

Keep configs (*.yaml) and scripts under version control.

Keep heavy outputs out of Git:

checkpoints, decoded audio, logs, wandb runs, etc.

Use deterministic seeds where possible (the Slurm scripts typically pass --seed).

Prefer a single canonical inference script for resynthesis (to avoid diverging pipelines).

Roadmap (planned)

Stage 2: diffusion/flow-based VC conditioned on WavLM representations (speaker/style control).

Streaming inference constraints (latency, chunked decoding, overlap strategies).

Improved disentanglement and robustness (speaker leakage control, prosody transfer modules).

Authors

Nassima Ould Ouali — Hi! PARIS
Supervision: Éric Moulines

License

Apache-2.0

📬 Contact

For questions or contributions: 📧 nassima.ould-ouali@ip-paris.fr

⭐ Star the project

If you find this repository useful for research or development, please consider giving it a star on GitHub — it helps increase visibility and supports continued maintenance and improvements.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
MIMIC-VC		MIMIC-VC
kvc		kvc
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KVC — Research Codebase for Voice Conversion (WavLM Resynthesis → Diffusion VC)

What is inside (implemented)

1) WavLM → Audio (GAN resynthesis)

2) Ablation study (WavLM layer usage)

Repository structure (high-level)

⚙️ Installation

1. Clone the repository

2. Create the Conda environment

3. Pretrained models

Stage 1 — WavLM → Audio (GAN) usage

Ablation study (Stage 1)

B) Analysis & paper tables

Reproducibility & housekeeping

📬 Contact

⭐ Star the project

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KVC — Research Codebase for Voice Conversion (WavLM Resynthesis → Diffusion VC)

What is inside (implemented)

1) WavLM → Audio (GAN resynthesis)

2) Ablation study (WavLM layer usage)

Repository structure (high-level)

⚙️ Installation

1. Clone the repository

2. Create the Conda environment

3. Pretrained models

Stage 1 — WavLM → Audio (GAN) usage

Ablation study (Stage 1)

B) Analysis & paper tables

Reproducibility & housekeeping

📬 Contact

⭐ Star the project

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages