Quantum Reservoir Computing (QRC) for Molecular Activity Prediction

This repository contains the code and data processing scripts for reproducing the results presented in our paper:

"Robust Quantum Reservoir Computing for Molecular Property Prediction"
Daniel Beaulieu, Milan Kornjaca, Zoran Krunic, Michael Stivaktakis, Thomas Ehmer, Sheng-Tao Wang, Anh Pham
arXiv:2412.06758

Paper link: https://arxiv.org/abs/2412.06758

Overview

This project implements Quantum Reservoir Computing (QRC) and Classical Reservoir Computing (CRC) methods for molecular activity prediction using the Merck Molecular Activity Challenge dataset. The codebase includes data preprocessing, embedding generation, model training, and evaluation scripts.

Prerequisites

Python Environment

Python dependencies are managed using uv
Install dependencies: uv sync
Configuration files: pyproject.toml and uv.lock

Julia Environment

Julia packages are specified in Manifest.toml
All necessary Julia packages are included in the project manifest

Dataset

Download the Merck Molecular Activity Challenge dataset (2.1GB) from Kaggle and place the files in DATA/TrainingSet/. The dataset files are named as ACT{number}_competition_training.csv.

Usage

1. Data Preparation

Run qrc-dataprep.py to create subsamples for 100, 200, and 800 records, as well as 25 subsamples of 100 samples for cross-validation. The script implements the data preparation pipeline and classical regression models used in the paper. It incorporates data cleaning, outlier detection, feature scaling, dimensionality reduction (PCA), and feature selection using SHAP values. It generates multiple stratified subsamples and cross-validation splits, enabling robust evaluation of machine learning models on both original and QRC-embedded features. The workflow supports reproducible experiments by saving processed datasets and feature selections, facilitating systematic comparison of classical and quantum-inspired approaches in large-scale molecular regression tasks.

2. Generate Embeddings

QRC Embeddings

Run qrc_regression_merck.jl to create QRC embeddings and QRC one-body embeddings from the preprocessed data. The script encodes classical molecular descriptors into quantum states using quantum reservoir detuning layers and simulates their evolution under a Rydberg Hamiltonian, extracting quantum observables as high-dimensional embeddings. It processes subsamples for a given data set a a given number of records (100,200,800) and different experimental configurations. The QRC embedding outputs are used in qrc_runalgos_alltypes.py.

CRC Embeddings

Run crc_randforest_embeddingonly.jl to generate Classical Reservoir Computing (CRC) embeddings. Classical reservoir embeddings are generated from vector spin limit of the Rydberg Hamiltonian. The script processes molecular activity datasets by normalizing and projecting features, simulating their evolution through the integration of the spin dynamical system, and extracts time-dependent observables as embeddings. These embeddings are saved for downstream machine learning tasks to systematically compare CRC-based representations with quantum and classical baselines. The code is designed to process subsamples for a particular Merck molecular activity dataset, a number of data subsamples, a specified number of records (100,200,800), and various reservoir configurations. The CRC embedding outputs are used in qrc_runalgos_alltypes_crc.py.

Shot Noise Simulation Embeddings

Run qrc_regression_wavefunction_milan.jl to create embeddings using simulated shot noise through the wavefunction sampling. The script encodes classical molecular descriptors into quantum states via quantum reservoir detuning layers, simulates their evolution under a Rydberg Hamiltonian, and extracts measurement observables from a finite number of wavefunction samples as high-dimensional quantum embeddings, equivalently to qrc_regression_merck.jl. The user can specify the number of samples, with 0 representing exact sampling. The data collection for the paper is run repeatedly for each subsample in order to estimate shot noise related uncertainty.

3. Model Training and Evaluation

Standard Models

Run qrc_runalgos_alltypes.py to train and evaluate models on classical data, QRC embeddings, and QRC one-body embeddings. This script code runs the modeling pipeline and creates regression model diagnostics statistics such as mean-squared error, accuracy, AUC, F1-Score, recall, and precision for classical data, QRC two-body embeddings, and QRC one-body embeddings. It automates data postprocessing and model training using a range of scikit-learn regressors and neural networks. The script evaluates models on both raw molecular features and quantum reservoir computing (QRC) embeddings, systematically comparing performance metrics across multiple data splits and embedding types. Results are aggregated for analysis, enabling at scale evaluation of QRC and classical approaches.

CRC Models

Run qrc_runalgos_alltypes_crc.py to evaluate models specifically for CRC embeddings. This Python code runs the model pipeline for CRC embeddings and creates regression model diagnostics statistics equivalent to qrc_runalgos_alltypes.py.

Noise Simulation Models

Run qrc_runalgos_alltypes_noise.py to evaluate models for wavefunction sampling embeddings with various noise levels. This Python code runs the model pipeline for the shot noise simulated QRC embeddings and creates regression model diagnostics statistics equivalent to qrc_runalgos_alltypes.py.

4. Visualization and Analysis

UMAP Analysis

Run merck_activity_QRC_UMAP_recs200-sub4-act4_wbintargs_v3.ipynb to generate UMAP visualizations and statistical analysis. The notebook loads the QRC and classical embeddings created in the previous script, including QRC, classical reservoir, and classical reservoir computing (CRC) embedding. The program uses UMAP to performs topological data analysis and visualises low dimensional UMAP representations.

Figure Generation

Run Merck_boxplot.ipynb to create figures and tables used in the paper.

File Structure

Core Scripts

qrc-dataprep.py - Data preprocessing and subsampling
qrc_regression_merck.jl - QRC embedding generation
crc_randforest_embeddingonly.jl - CRC embedding generation
qrc_regression_wavefunction_milan.jl - Noise simulation embeddings
qrc_runalgos_alltypes.py - Model pipeline for standard embeddings
qrc_runalgos_alltypes_crc.py - Model pipeline for CRC embeddings
qrc_runalgos_alltypes_noise.py - Model pipeline for noise simulations

Notebooks

merck_activity_QRC_UMAP_recs200-sub4-act4_wbintargs_v3.ipynb - UMAP analysis
Merck_boxplot.ipynb - Paper figure generation

Generated Data Files

All processed data files are stored in the DATA/ folder with the following naming convention:

{x|y}_{train|test}_{N}rec_sub{M}act{K}v{V}.csv

Where:

x = input features, y = labels
N = number of records in subsample (e.g., 200)
M = subsample index (1-25 for cross-validation)
K = activity dataset number (e.g., ACT4)
V = version tag

Key Parameters

Important Variables

nfeats: Number of features selected for QRC by SHAP. Significantly impacts computational time. Reduce for testing.
n_cluster: Number of clusters for sampling. Ensures comprehensive coverage during sampling.
shapsample: Number of observations used in SHAP analysis. Affects computational cost and memory usage.
recs: Number of records in each subsample (e.g., 100, 200, 800).
actfile: Activity dataset identifier (e.g., ACT4, ACT5, etc.).
version: Version tag for generated data files.
subnum: Subsample index number (1-25 for cross-validation splits).

Notes

All Python and Julia files have been tested and verified to run successfully
Running scripts with identical parameters will generate different results due to random sampling
If you encounter "file not found" errors for CSV files, run the data generation scripts first
For parameter modifications or reruns, regenerate data using the source scripts before plotting
Performance optimization: If computations are too slow, reduce nfeats (affects QRC computational complexity) or shapsample (affects SHAP process computational complexity)

Citation

If you use this code in your research, please cite our paper:

@article{beaulieu2024robust,
  title={Robust Quantum Reservoir Computing for Molecular Property Prediction},
  author={Beaulieu, Daniel and Kornjaca, Milan and Krunic, Zoran and Stivaktakis, Michael and Ehmer, Thomas and Wang, Sheng-Tao and Pham, Anh},
  journal={arXiv preprint arXiv:2412.06758},
  year={2024},
  url={https://arxiv.org/abs/2412.06758}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quantum Reservoir Computing (QRC) for Molecular Activity Prediction

Overview

Prerequisites

Python Environment

Julia Environment

Dataset

Usage

1. Data Preparation

2. Generate Embeddings

QRC Embeddings

CRC Embeddings

Shot Noise Simulation Embeddings

3. Model Training and Evaluation

Standard Models

CRC Models

Noise Simulation Models

4. Visualization and Analysis

UMAP Analysis

Figure Generation

File Structure

Core Scripts

Notebooks

Generated Data Files

Key Parameters

Important Variables

Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
DATA		DATA
.gitignore		.gitignore
Manifest.toml		Manifest.toml
Merck_boxplot.ipynb		Merck_boxplot.ipynb
Project.toml		Project.toml
README.md		README.md
crc_randforest_embedingonly.jl		crc_randforest_embedingonly.jl
main.py		main.py
merck_activity_QRC_UMAP_recs200-sub4-act4_wbintargs_v3.ipynb		merck_activity_QRC_UMAP_recs200-sub4-act4_wbintargs_v3.ipynb
pyproject.toml		pyproject.toml
qrc-dataprep.py		qrc-dataprep.py
qrc_regression_merck.jl		qrc_regression_merck.jl
qrc_regression_wavefunction_milan.jl		qrc_regression_wavefunction_milan.jl
qrc_runalgos_alltypes.py		qrc_runalgos_alltypes.py
qrc_runalgos_alltypes_crc.py		qrc_runalgos_alltypes_crc.py
qrc_runalgos_alltypes_noise.py		qrc_runalgos_alltypes_noise.py
uv.lock		uv.lock

QuEraComputing/robust-qrc-molecular-prediction

Folders and files

Latest commit

History

Repository files navigation

Quantum Reservoir Computing (QRC) for Molecular Activity Prediction

Overview

Prerequisites

Python Environment

Julia Environment

Dataset

Usage

1. Data Preparation

2. Generate Embeddings

QRC Embeddings

CRC Embeddings

Shot Noise Simulation Embeddings

3. Model Training and Evaluation

Standard Models

CRC Models

Noise Simulation Models

4. Visualization and Analysis

UMAP Analysis

Figure Generation

File Structure

Core Scripts

Notebooks

Generated Data Files

Key Parameters

Important Variables

Notes

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages