Skip to content

This is the source code for the paper "Merino G., Raad J., Bugnon L.A., Yones C., Kamenetzky L., Claus J., Ariel F., Milone D.H. and Stegmayer G., Novel SARS-CoV-2 encoded small RNAs in the passage to humans, Bioinformatics, btaa1002, 2020

License

Notifications You must be signed in to change notification settings

sinc-lab/sarscov2-mirna-discovery

Repository files navigation

Pipeline for pre-miRNAs discovery in the SARS-CoV-2 virus

This readme contains a description of the steps to obtain the main results of:

G.A. Merino, J. Raad, L.A. Bugnon, C. Yones, L. Kamenetzky, J. Claus, F. Ariel, D.H. Milone, G. Stegmayer, “Novel SARS-CoV-2 encoded small RNAs in the passage to humans,” Bioinformatics, 2020, DOI.

The following sections contain the steps and the libraries used in each of them. All datasets and libraries are open sourced and free to use. If you use any of the following in your pipelines, please cite them properly.

Abstract

1. Layout

Several intermediate results are provided in this repository, thus you don’t need to run the whole pipeline.

sarscov2-mirna-discovery
│
├── genomes
│   └── NC_045512.2_Wuhan-Hu-1.fasta  -> The complete SARS-CoV2 isolate Wuhan-Hu-1 genome. 
│
├── sequences
│   ├── sars-cov2_hairpins.fasta      -> Hairpin sequences extracted from NC_045512.2_Wuhan-Hu-1.
│   ├── sars-cov2_hairpins.fold       -> sars-cov2_hairpins.fasta and its folding structure.
│   ├── pre-miRNAs_virus.fasta        -> Hairpin sequences extracted from known virus pre-miRNAs.
│   ├── pre-miRNAs_virus.fold         -> pre-miRNAs_virus.fasta and its folding structure. 
│   └── unlabeled_hairpins.fold       -> Hairpin sequences and its folding structure extracted from 
|                                           the human genome, sampled from sequences that do not code 
|   					    pre-miRNAs. Provided by external storage.
│
├── pre-miRNAs_features
│   ├── sars-cov2_hairpins.csv        -> Features extracted from sars-cov2_hairpins.fasta.
│   ├── pre-miRNAs_virus.csv          -> Features extracted from pre-miRNAs_virus.fasta.
│   └── unlabeled_hairpins.csv        -> Features extracted from unlabeled_hairpins.fasta. 
|					     Provided by external storage.
│
├── pre-miRNAs_models                 -> Trained models for pre-miRNA prediction.
│
├── pre-miRNAs_predictions            -> Predictions on SARS-CoV2 hairpin sequences by each model.
│
├── src
|   ├── train_pre-miRNA_models.ipynb  -> Source code to train the pre-miRNAs prediction models.
|   ├── predict_pre-miRNAs.ipynb      -> Source code to predict pre-miRNAs on SARS-CoV2 sequences.
|   ├── DifferentialExpression.Rmd    -> Notebook for differential expression analysis.
|   ├── deregulatedTargets.sh
|   ├── DETargetsExploration.Rmd
|   ├── DOWNDETargetsExploration.Rmd
|   ├── extractDianaIDs.sh
|   ├── joinTargets.sh
|   ├── mappingIDs.Rmd
|   ├── ORADEGenes.Rmd
|   ├── ORADETargets.Rmd
|   ├── targetsExploration.Rmd
|   └── link_Figs.R                   -> Notebook to generate manuscript figures
|
├── matures                           
|   └── miRNAs.csv                    -> Mature miRNAs sequences found. 
|
├── targets                           -> Target prediction files
|   ├── miRDB
|   ├── Diana
|   ├── TargetsDE
|   └── overlap_<mirna>.tab            -> Target overlap for each miRNA 
|
├── expression_data                    -> Gene expression data
|
└── functional_enrichment              -> Functional enrichment results

2. Data preparation

Note that all the results generated in this section are provided in sequences/ and pre-miRNAs_features/ directories.

2.1. Download complete genomes

SARS-CoV-2 genome was obtained from the NCBI GenBank on March 2020.

These ".fasta" files can be also found in this repository, in the genomes/ directory.

2.2. Extract hairpin-like sequences

The SARS-CoV-2 genome fasta file is cut in overlapping windows using the HExtractor toolbox.

Install HExtractor

The toolbox uses some external well-known packages. The following software must be installed:

  • Vienna RNA: this package is used to fold sequences.

Tested versions: 2.1.0 - 2.1.8, 2.4.18.

  • NCBI blast+: this package is used to identify known pre-miRNAs.

Tested versions: 2.2.30, 2.6.0-1.

  • Matlab Tested versions: R2014a, R2017a.

To run HExtractor, move to the HExtractor folder and run the following in Matlab:

HExtractor('NC_045512.2_Wuhan-Hu-1.fasta', 'sars-cov-2.yaml', 'pre-miRNAs_virus.fasta');

This function creates the sars-cov2_hairpins.fasta file in the "out" folder and automatically names them.

To model the positive class, well known pre-miRNAs from viruses hairpins were extracted from mirbase. These sequences were folded with RNAfold using the following commands:

RNAfold --noPS --infile=sequences/pre-miRNAs_virus.fasta --outfile=sequences/pre-miRNAs_virus.fold
RNAfold --noPS --infile=sequences/sars-cov2_hairpins.fasta --outfile=sequences/sars-cov2_hairpins.fold

Additional sequences were used to train the mirDNN. A set of 1M hairpin-like sequences from the human genome, which are not pre-miRNAs, were used to model the negative set. The folding structure (unlabeled_hairpins.fold) and features (unlabeled_hairpins.csv) for these sequences can be downloaded from this external repository. You will only need to download the unlabeled_hairpins.fold if you want to train the mirDNN. This data and further details are available in [3].

At this point, the sequence and folding prediction for each hairpin-like sequence in SARS-CoV-2, the known virus pre-miRNAs and some human non-pre-miRNA sequences are ready.

2.3. Hairpin features extraction

In order to extract meaningful features from the sequences, the miRNAfe package was used [4]. Once installed, you can run the feature extraction with:

miRNAfe('sars-cov2_hairpins.fasta', 'config/Example_prediction.yaml');
miRNAfe('pre-miRNAs_virus.fasta', 'config/Example_prediction.yaml');

The ".yaml" file is provided with the package.

3. Training pre-miRNAs prediction models

The training notebook is provided with instructions to train the OC-SVM [5], deeSOM [6] and mirDNN [7] classifiers. Doing so may take several hours.

If you want to directly predict the sequences, you can use the trained models that are provided in pre-miRNAs_models/ as explained in the following section.

4. Finding pre-miRNAs candidates in SARS-CoV-2

The prediction notebook is provided with instructions to use the trained models to rank the SARS-CoV-2 sequences (a high rank means a higher chance of being a pre-miRNA).

5. Predicting miRNAs targets

Once mature miRNAs were identified by combining MatureBayes predictions and the small RNA-seq reads profiles, their sequences (provided in matures/miRNAs.csv) must be submitted to Diana MR MicroT and miRDB (Custom prediction) for predicting human gene targets.

The prediction files from miRDB and Diana MR Micro T, one per SARS-CoV-2 miRNA, are provided in the targets/miRDB/ and targets/Diana/ directories, respectively. Once you have downloaded them, you can run the following bash scripts for extracting the Ensembl identifiers (ID) and the score for each transcript predicted by Diana as being targeted for the viral miRNAs, with a prediction score of 70 or higher.

sh ../src/extractIDs.sh

After running this script, a new directory targets/Diana/70 will be generated. Inside it, you should obtain the same files that are provided in this repository. Since Diana predicts transcripts and miRDB genes, Ensembl transcripts IDs were mapped to Gene names using the mapping IDs R notebook.

Once you have the Diana files ready (provided here in the targets/Diana/70/ folder, with the suffix gene), the following bash script will help you to combine them with miRDB predictions in order to obtain the set of targets that were predicted by both tools with scores predictions of 70 or higher.

sh ../src/jointargets.sh

After executing the code above, you will obtain eight files called overlap_<mirna>.tab with the set of human gene targets predicted for each mirna of the SARS-CoV-2 miRNA (provided here in the targets/ folder). You can explore the number of targets obtained for each miRNA and generate the Fig2A of [1] by using the targets exploration notebook.

6. Analyzing the down-regulation of the predicted targets

6.1 Differential expression and functional enrichment analyses

Expression data from RNA-seq experiments involving Calu3 cell-cultures infected with SARS-CoV-2 was downloaded from GEO-NCBI GSE148729. The differential expression notebook is provided with all the instructions to identify the set of differentially expressed genes.

After identifying the set of deregulated genes, you will be able to perform and analyze their functional enrichment by means of an overrepresentation analysis with PANTHER. For doing this, the list of deregulated genes should be used as the analyzed list and the list of expressed genes as the reference list. The overrepresentation analysis is conducted by Fisher's exact test with the Bonferroni adjustment for p-value correction. The results for PANTHER GO-slim biological process (BP), PANTHER Pathways and Reactome Pathways annotation datasets are provided in the ‘functional_enrichment/DEgenes’ folder. The overrepresentation analysis R notebook will guide you to obtain the graphical representation of the obtained results for the PANTHER GO-Slim BP set, shown in Fig2C of [1].

6.2 Identifying deregulated targets

Once DE analyses have been conducted, the following bash code is used to find the list of deregulated targets and the set of those potentially being silenced by the predicted viral miRNAs.

sh src/deregulatedTargets.sh

The previous bash script will create the ‘targets/targetsDE’ folder. Then, it will compare the miRNA targets and the RNA-seq differential expression (DE) results previously obtained (stored in the ‘expression_data/DE_SARS-CoV-2_Full_FC1.csv’ file). In the ‘targets/targetsDE’ folder, for each predicted SARS-CoV-2 miRNA, two files (‘overlapDE_overlap_SC2V-mir-.tab’ and ‘overlapDOWN_overlap_SC2V-mir-.tab’) will be generated, with the names of those genes that were found as deregulated and down-regulated for at least one of the differential expression analyses previously performed. The information contained in these files will be summarized, joining the results of all miRNA targets, in the files ‘targetsDE.txt’ and [‘targetsDEDOWN.txt’]](targets/targetsDE/targetsDEDOWN.txt). Finally, the script will also filter the ‘expression_data/DE_SARS-CoV-2_Full_FC1.csv’ file for keeping only those deregulated targets, generating the ‘targets/targetsDE/targetsDEgenes_SARS-CoV-2_Full_FC1.csv’ file.

The number of deregulated targets for each miRNA is explored to generate Fig3A of the manuscript by using the differentially expressed targets exploration notebook.

6.3 Functional characterization of deregulated targets

As for analyzing functional enrichment of deregulated genes, the overrepresentation analysis of deregulated targets is also conducted by using PANTHER. Particularly, in this case, the full set of deregulated genes is used as reference list (‘expression_data/DEgenes_SARS-CoV-2_Full_FC1.txt’) and the list of deregulated targets (‘targets/targetsDE/targetsDE.txt’) as analyzed list. The results for PANTHER GO-slim BP annotation dataset are provided in the ‘functionalEnrichment/DEtargets’ folder. The targets overrepresentation analysis R notebook will help you to obtain the graphical representation of the obtained results, shown in Fig3D of [1].

6.4 Analysis of down-regulated targets

The list of the 28 targets that were identified as down-regulated potentially being silenced by the novel SARS-CoV-2 miRNAs can be functionally characterized by using the functional classification tool of PANTHER. The results for this step, provided here in the ‘functional_enrichment/DEDOWNTargets/PANTHERClassification.txt’, were manually processed and combined with the DE results to generate a summary file, ‘functional_enrichment/DEDOWNtargets/FunctionalCharacterization.csv’.

The exploration of expression changes and functional characterization of down-regulated targets is carried out using the down-regulated targets exploration notebook. Following it, you will be able to obtain Fig3B and Fig3C of [1].

7. Multiple sequence alignment

The candidate sequence found in the previous point were aligned using the BLAST web platform against the following coronaviruses:

  • HKU1 (NCBI Reference Sequence DQ415897.1),
  • OC43 (NCBI Reference Sequence KY983585.1),
  • Bat-CoV RF1 (NCBI Reference Sequence DQ412042.1),
  • Bat-CoV RATG13 (NCBI Reference Sequence MN996532.1),
  • Malayan Pangolin (NCBI Reference Sequence MT072864.1),
  • SARS-CoV (NCBI Reference Sequence AY278741.1),
  • MERS-CoV (NCBI Reference Sequence MN120514.1),
  • Mice coronavirus (NCBI Reference Sequence GB KF294357.1),
  • Ferret coronavirus (NCBI Reference Sequence 1264898).

Select the mature miRNA extracted from the candidate sequence in the previous section. Then, perform the alignments with MUSCLE with ClustalW type output format and default settings.

8. Figures generation

The Figures notebook is provided with all the instructions to generate Figures 2, 3 and 4 of [1].

Fig3

References

[1] G.A. Merino, J. Raad, L.A. Bugnon, C. Yones, L. Kamenetzky, J. Claus, F. Ariel, D.H. Milone, G. Stegmayer, "Novel SARS-CoV-2 encoded small RNAs in the passage to humans," Bioinformatics, 2020, DOI.

[2] C. Yones, N. Macchiaroli, L. Kamenetzky, G. Stegmayer, D.H. Milone, "HextractoR: an R package for automatic extraction of hairpins from genome-wide data", 2020, DOI.

[3] L.A. Bugnon, C. Yones, J. Raad, D.H. Milone, G. Stegmayer, "Genome-wide hairpins datasets of animals and plants for novel miRNA prediction," Data in Brief, 2019, DOI.

[4] C. Yones, G. Stegmayer, L. Kamenetzky, D.H. Milone, "miRNAfe: a comprehensive tool for feature extraction in microRNA prediction," BioSystems, 2015, DOI.

[5] L.A. Bugnon, C. Yones, D.H. Milone, G. Stegmayer, "Genome-wide discovery of pre-miRNAs: comparison of recent approaches based on machine learning," Briefings in Bioinformatics, 2020, DOI.

[6] L.A. Bugnon, C. Yones, D.H. Milone, G. Stegmayer, "Deep neural architectures for highly imbalanced data in bioinformatics," IEEE Transactions on Neural Networks and Learning Systems, 2020, DOI.

[7] C. Yones, J. Raad, L.A. Bugnon, D.H. Milone, G. Stegmayer, "High precision in microRNA prediction: a novel genome-wide approach based on convolutional deep residual networks," 2020, DOI.

About

This is the source code for the paper "Merino G., Raad J., Bugnon L.A., Yones C., Kamenetzky L., Claus J., Ariel F., Milone D.H. and Stegmayer G., Novel SARS-CoV-2 encoded small RNAs in the passage to humans, Bioinformatics, btaa1002, 2020

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published