Skip to content

NassimaOULDOUALI/KVC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

178 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KVC — Research Codebase for Voice Conversion (WavLM Resynthesis → Diffusion VC)

This repository hosts research code and experiments toward high-fidelity, zero-shot voice conversion (VC). The current implemented focus is Stage 1: WavLM → waveform resynthesis (GAN vocoder-style training), with tools for inference, checkpoint evaluation, and ablation studies on WavLM layer usage.

Status

  • ✅ Stage 1 (implemented): WavLM-to-Audio resynthesis with GAN training + inference + objective evaluation.
  • 🧪 Ablations (implemented): layer weighting / last-N layers experiments (array training on Jean Zay).
  • 🚧 Stage 2 (planned): diffusion/flow-based VC conditioned on WavLM representations.

What is inside (implemented)

1) WavLM → Audio (GAN resynthesis)

Core training and inference scripts live under:

  • MIMIC-VC/wavlm_resynth/third_training/

This experiment folder includes:

  • train_gan.py / train.slurm — GAN training entrypoints (local/DDP and HPC).
  • models_improved.py — generator + WavLM adapter model.
  • discriminators.py — HiFi-GAN style discriminators (MPD/MSD).
  • losses_gan.py — adversarial + reconstruction losses.
  • inference.py / inference.slurmtraining-like chunking + overlap-add inference.
  • eval_chkp.py (+ optional batch scripts) — bulk checkpoint evaluation + metrics export.

2) Ablation study (WavLM layer usage)

Ablation experiments are in:

  • MIMIC-VC/wavlm_resynth/ablation_study/

This folder contains:

  • train_gan_ablation.py / train_array.slurm — Slurm array training for last-N / weighted modes.
  • models_ablation.py, discriminators.py, losses_gan.py — ablation variants and GAN components.
  • Analysis + paper utilities: analyze_*, extract_*, fill_paper_tables_*, results*.csv, table*.tex, figures_ablation/, etc.

Repository structure (high-level)

KVC/
└── MIMIC-VC/
└── wavlm_resynth/
├── third_training/ # Stage 1: main GAN training/inference/eval
├── ablation_study/ # Stage 1: ablation experiments + analysis/paper tools
└── ... # other experiments (legacy / intermediate)

⚙️ Installation

1. Clone the repository

git clone https://github.com/NassimaOULDOUALI/KVC.git
cd kvc

2. Create the Conda environment

If you use Conda:

conda env create -f environment.yml
conda activate kvc

You also need a working PyTorch + CUDA stack compatible with your cluster/GPU setup.

3. Pretrained models

Stage 1 expects a pretrained WavLM model (path is typically configured in YAML, e.g. config_gan.yaml). Make sure the checkpoint is accessible from your runtime environment.

Stage 1 — WavLM → Audio (GAN) usage

A) Training (local / torchrun) From: MIMIC-VC/wavlm_resynth/third_training/

cd MIMIC-VC/wavlm_resynth/third_training

torchrun --standalone --nnodes=1 --nproc_per_node=2 train_gan.py \
  --config config_gan.yaml \
  --output_dir outputs_gan/run1 \
  --seed 1234

Dataset paths, sample rate, segment length, and WavLM settings are defined in config_gan.yaml. Adapt them to your filesystem.

B) Inference (training-like chunking + overlap-add) cd MIMIC-VC/wavlm_resynth/third_training

python inference.py \
  --config config_infer.yaml \
  --checkpoint /path/to/checkpoint.pt \
  --input_wav /path/to/input.wav \
  --output_wav /path/to/output.wav

C) Bulk checkpoint evaluation

Use:

eval_chkp.py for checkpoint scanning + decoding + metrics export to CSV.

cd MIMIC-VC/wavlm_resynth/third_training

python eval_chkp.py \
  --checkpoints_dir /path/to/checkpoints_dir \
  --audio_dir /path/to/audios \
  --out_dir outputs_eval/run1 \
  --n_audios 3 \
  --seed 0

Ablation study (Stage 1)

A) Slurm array training (Jean Zay)

From: MIMIC-VC/wavlm_resynth/ablation_study/

cd MIMIC-VC/wavlm_resynth/ablation_study
sbatch train_array.slurm

This array typically sweeps --wavlm_last_n over a range (e.g., 1–12) with a selected --feature_mode.

B) Analysis & paper tables

This folder contains scripts that:

parse logs/checkpoints,

aggregate metrics into CSV,

generate LaTeX tables (table_ablation*.tex, layer_importance_table.tex),

export paper-ready summaries.

Reproducibility & housekeeping

Keep configs (*.yaml) and scripts under version control.

Keep heavy outputs out of Git:

checkpoints, decoded audio, logs, wandb runs, etc.

Use deterministic seeds where possible (the Slurm scripts typically pass --seed).

Prefer a single canonical inference script for resynthesis (to avoid diverging pipelines).

Roadmap (planned)

Stage 2: diffusion/flow-based VC conditioned on WavLM representations (speaker/style control).

Streaming inference constraints (latency, chunked decoding, overlap strategies).

Improved disentanglement and robustness (speaker leakage control, prosody transfer modules).

Authors

Nassima Ould Ouali — Hi! PARIS
Supervision: Éric Moulines

License

Apache-2.0

📬 Contact

For questions or contributions: 📧 nassima.ould-ouali@ip-paris.fr

⭐ Star the project

If you find this repository useful for research or development, please consider giving it a star on GitHub — it helps increase visibility and supports continued maintenance and improvements.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors