Pegasus: MD-derived protein flexibility prediction

PEGASUS is a sequence-based predictor of MD-derived information on protein flexibility. It generates embeddings using pre-trained models, predicts residue-wise real values of backbone fluctuation (RMSF), Phi & Psi dihedral angles standard deviation, and average Local Distance Difference Test (Mean LDDT) across the trajectory, and can optionally generate interactive result web pages for each protein sequence provided. PEGASUS accepts a FASTA (optionnaly aligned) file as input and allows users to specify the computation device (CPU or GPU).

Webserver

This is the repository of the standalone version of the corresponding webserver:
dsimb.inserm.fr/PEGASUS

Features

Embedding Generation: Generates protein embeddings using state-of-the-art pre-trained models:
- ANKH Base (ankh_base)
- ANKH Large (ankh_large)
- ProtT5-XL UniRef50 (prot_t5_xl_uniref50)
- ESM2 (esm2_t36_3B_UR50D)
Metric Predictions: Predicts per-residue values for:
- Root Mean Square Fluctuation (RMSF)
- Standard Deviation of Phi Angles (Std. Phi)
- Standard Deviation of Psi Angles (Std. Psi)
- Mean Local Distance Difference Test (Mean LDDT)
Interactive Results: Optionally generates interactive HTML pages for visualization, including a comprehensive results overview page with comparison functionality.
Aligned Sequence Support: Supports aligned protein sequences as input, enabling the analysis of multiple sequence alignments.
HTTP Server: Optionally starts an HTTP server to serve the result pages, making it easy to view results in a web browser.
Flexible Computation: Supports CPU and GPU devices, with customizable device allocation for each model.
Reproducibility: Allows setting a random seed for consistent results.
Docker Support: Provides a Dockerfile for containerized execution.

Installation

Prerequisites

Operating System: Linux (not tested on Windows and macOS not supported)
Python: Version 3.8 or higher
CUDA: For GPU support (optional)
Conda: Package and environment management
Docker (optional): For containerized execution

1. Clone the Repository

git clone https://github.com/DSIMB/PEGASUS.git
cd PEGASUS

2. Download Pegasus weights (1.3 Gb)

Download and extract Pegasus weights in the models directory.

aria2c -d ./models https://dsimb.inserm.fr/PEGASUS/models/pegasus_weights.tar.gz && tar -xzvf ~/models/pegasus_weights.tar.gz -C ./models

3. Conda Installation (not necessary if using Docker)

Create and Activate the Conda Environment

conda env create -f pegasus.yml
conda activate pegasus

3 (bis). Docker Installation

Install NVIDIA Container Toolkit for GPU support.
Setup running Docker as a non-root user.

3.1. Build or download the Docker Image

docker pull dsimb/pegasus

or download the bre-built latest docker image:

docker build -t dsimb/pegasus .

3.2. Run the Docker Container

docker run -e USER_ID=$(id -u) -e GROUP_ID=$(id -g) --gpus all -v /path/to/input:/input -v /path/to/output:/output -v /path/to/models:/models dsimb/pegasus -i /input/sequences.fasta --output_dir /output --models_dir /models

Replace /path/to/input, /path/to/output, and /path/to/models with your local directories.
The --gpus all flag enables GPU support. Omit it if you wish to run on CPU.
The -e USER_ID=$(id -u) -e GROUP_ID=$(id -g) flags will generate the results owned by you, not root user when running docker with sudo.

Note

If cuda device is not detected in the docker container, try adding --privileged to the docker run command

Usage

PEGASUS can be run via the command line. Below are the available command-line arguments and usage examples.

Warning

Models are loaded on the selected device sequentially and purged, so the maximum memory required to run Pegasus corresponds to the largest model size, which is ESM2 with 11 Gb.

Command-Line Arguments

usage: pegasus.py [-h] -i INPUT_FASTA [-d {cpu,gpu}] [-m MODELS_DIR] [-o OUTPUT_DIR]
                  [--model_device_map MODEL_DEVICE_MAP [MODEL_DEVICE_MAP ...]]
                  [-s SEED] [-t TOKS_PER_BATCH] [-g] [-k] [-a] [--serve] [--host HOST] [--port PORT]

Option	Description
`input_fasta`	(Required) Path to the input (multi)FASTA file containing protein sequences.
`default_device`	Default computation device (`cpu` or `gpu`). Default is `cpu`.
`models_dir`	Directory containing pre-trained models. Default is the environment variable `MODELS_DIR` or `models`.
`output_dir`	Directory to save output files. Default is the environment variable `OUTPUT_DIR` or `output`.
`model_device_map`	Specify device for each model in the format `model_name:device` (e.g., `prot_t5_xl_uniref50:gpu`). Accepted models: `ankh_base`, `ankh_large`, `prot_t5_xl_uniref50`, `esm2_t36_3B_UR50D`, `pegasus`.
`seed`	Random seed for reproducibility. Default is `42`.
`toks_per_batch`	Maximum tokens per batch to use during embedding generation. Default is `2048`.
`max_seq_length`	Maximum sequence length allowed for processing. Sequences longer than this will be skipped. Default is `2048`.
`generate_html`	Include this flag to generate result web pages for each protein as well as an overview page.
`keep_embeddings`	Keep the LLMs raw embeddings in `OUTPUT_EMBEDDINGS` directory after use. By default, the directory is deleted.
`aligned_fasta`	Input protein sequences are aligned or not.
`serve`	Start an HTTP server at the end to serve the result pages.
`host`	Hostname to use when serving the result pages. Default is `"localhost"`.
`port`	Port to use when serving the result pages. Default is `8000`.

Examples

Basic Usage

Generate embeddings and predictions using default settings:
```
python pegasus.py -i sequences.fasta
```

Using GPU for All Computations

python pegasus.py -i sequences.fasta -d gpu

Specify Devices for Specific Models

python pegasus.py -i sequences.fasta -d gpu --model_device_map esm2_t36_3B_UR50D:cpu prot_t5_xl_uniref50:cpu

Specify that input protein sequences (multifasta) are aligned
```
python pegasus.py -i sequences.fasta --aligned_fasta
```

Generate Interactive HTML Result Pages

python pegasus.py -i sequences.fasta --generate_html

Full Command with Custom Output and Models Directory

python pegasus.py -i sequences.fasta --output_dir /path/to/output --models_dir /path/to/models --generate_html

Full Command with Custom Output and Models Directory and serve the result pages

python pegasus.py -i sequences.fasta --output_dir /path/to/output --models_dir /path/to/models --generate_html --serve

Output Structure

After running PEGASUS, a unique output directory (e.g. output/0d6d0268/) will contain:

Embeddings: Stored in embeddings/ subdirectory, organized by model name, if requested to be kept with command line argument --keep-embeddings (deleted by default).
Predictions: Stored in predictions/ subdirectory, includes per-protein TSV files with predictions:
- {prot_id}_raw.tsv : contains all the predictions of the four LLMs for each predicted metric.
- {prot_id}_predictions.tsv : contains the mean and std. deviation values for each metric.
Result Pages: If --generate_html is used, interactive HTML pages are stored in result_pages/.
id_mapping.tsv: Two columns TSV file containing a mapping of the unique Generated_ID (P{1..n}) generated by Pegasus and the Original_ID fasta header of each input protein.

View html result pages

If --generate_html is used, interactive HTML pages are stored in result_pages/.

Using the Built-in HTTP Server

If you used the --serve flag, the result pages are automatically served via an HTTP server. By default, you can access the results at http://localhost:8000/results_overview.html.

Manual HTTP Server

cd output/{job_id}/result_pages
python -m http.server
Serving HTTP on 0.0.0.0 port 8000 (http://0.0.0.0:8000/) ...

Access the pages on your browser at this URL: http://0.0.0.0:8000

Models

PEGASUS utilizes several pre-trained models for embedding generation:

ANKH Base and Large paper
ProtT5-XL UniRef50 paper
ESM2 (esm2_t36_3B_UR50D) paper

Note: The models are automatically downloaded and saved to the --models_dir directory upon first use.

License

This project is licensed under the terms of the MIT license. See the LICENSE file for details.

Citation

If you use PEGASUS in your research, please cite:

Contact

For questions, feedback, or issues, please open an issue on the GitHub repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pegasus: MD-derived protein flexibility prediction

Webserver

Table of Contents

Features

Installation

Prerequisites

1. Clone the Repository

2. Download Pegasus weights (1.3 Gb)

3. Conda Installation (not necessary if using Docker)

3 (bis). Docker Installation

Usage

Command-Line Arguments

Examples

Output Structure

View html result pages

Models

License

Citation

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pegasus: MD-derived protein flexibility prediction

Webserver

Table of Contents

Features

Installation

Prerequisites

1. Clone the Repository

2. Download Pegasus weights (1.3 Gb)

3. Conda Installation (not necessary if using Docker)

3 (bis). Docker Installation

Usage

Command-Line Arguments

Examples

Output Structure

View html result pages

Models

License

Citation

Contact