This readme contains a description of the steps to obtain the main results of:
G.A. Merino, J. Raad, L.A. Bugnon, C. Yones, L. Kamenetzky, J. Claus, F. Ariel, D.H. Milone, G. Stegmayer, “Novel SARS-CoV-2 encoded small RNAs in the passage to humans,” Bioinformatics, 2020, DOI.
The following sections contain the steps and the libraries used in each of them. All datasets and libraries are open sourced and free to use. If you use any of the following in your pipelines, please cite them properly.
Several intermediate results are provided in this repository, thus you don’t need to run the whole pipeline.
sarscov2-mirna-discovery
│
├── genomes
│ └── NC_045512.2_Wuhan-Hu-1.fasta -> The complete SARS-CoV2 isolate Wuhan-Hu-1 genome.
│
├── sequences
│ ├── sars-cov2_hairpins.fasta -> Hairpin sequences extracted from NC_045512.2_Wuhan-Hu-1.
│ ├── sars-cov2_hairpins.fold -> sars-cov2_hairpins.fasta and its folding structure.
│ ├── pre-miRNAs_virus.fasta -> Hairpin sequences extracted from known virus pre-miRNAs.
│ ├── pre-miRNAs_virus.fold -> pre-miRNAs_virus.fasta and its folding structure.
│ └── unlabeled_hairpins.fold -> Hairpin sequences and its folding structure extracted from
| the human genome, sampled from sequences that do not code
| pre-miRNAs. Provided by external storage.
│
├── pre-miRNAs_features
│ ├── sars-cov2_hairpins.csv -> Features extracted from sars-cov2_hairpins.fasta.
│ ├── pre-miRNAs_virus.csv -> Features extracted from pre-miRNAs_virus.fasta.
│ └── unlabeled_hairpins.csv -> Features extracted from unlabeled_hairpins.fasta.
| Provided by external storage.
│
├── pre-miRNAs_models -> Trained models for pre-miRNA prediction.
│
├── pre-miRNAs_predictions -> Predictions on SARS-CoV2 hairpin sequences by each model.
│
├── src
| ├── train_pre-miRNA_models.ipynb -> Source code to train the pre-miRNAs prediction models.
| ├── predict_pre-miRNAs.ipynb -> Source code to predict pre-miRNAs on SARS-CoV2 sequences.
| ├── DifferentialExpression.Rmd -> Notebook for differential expression analysis.
| ├── deregulatedTargets.sh
| ├── DETargetsExploration.Rmd
| ├── DOWNDETargetsExploration.Rmd
| ├── extractDianaIDs.sh
| ├── joinTargets.sh
| ├── mappingIDs.Rmd
| ├── ORADEGenes.Rmd
| ├── ORADETargets.Rmd
| ├── targetsExploration.Rmd
| └── link_Figs.R -> Notebook to generate manuscript figures
|
├── matures
| └── miRNAs.csv -> Mature miRNAs sequences found.
|
├── targets -> Target prediction files
| ├── miRDB
| ├── Diana
| ├── TargetsDE
| └── overlap_<mirna>.tab -> Target overlap for each miRNA
|
├── expression_data -> Gene expression data
|
└── functional_enrichment -> Functional enrichment results
Note that all the results generated in this section are provided in sequences/
and pre-miRNAs_features/
directories.
SARS-CoV-2 genome was obtained from the NCBI GenBank on March 2020.
These ".fasta" files can be also found in this repository, in the genomes/
directory.
The SARS-CoV-2 genome fasta file is cut in overlapping windows using the HExtractor toolbox.
Install HExtractor
The toolbox uses some external well-known packages. The following software must be installed:
- Vienna RNA: this package is used to fold sequences.
Tested versions: 2.1.0 - 2.1.8, 2.4.18.
- NCBI blast+: this package is used to identify known pre-miRNAs.
Tested versions: 2.2.30, 2.6.0-1.
- Matlab Tested versions: R2014a, R2017a.
To run HExtractor, move to the HExtractor folder and run the following in Matlab:
HExtractor('NC_045512.2_Wuhan-Hu-1.fasta', 'sars-cov-2.yaml', 'pre-miRNAs_virus.fasta');
This function creates the sars-cov2_hairpins.fasta file in the "out" folder and automatically names them.
To model the positive class, well known pre-miRNAs from viruses hairpins were extracted from mirbase. These sequences were folded with RNAfold using the following commands:
RNAfold --noPS --infile=sequences/pre-miRNAs_virus.fasta --outfile=sequences/pre-miRNAs_virus.fold
RNAfold --noPS --infile=sequences/sars-cov2_hairpins.fasta --outfile=sequences/sars-cov2_hairpins.fold
Additional sequences were used to train the mirDNN. A set of 1M hairpin-like sequences from the human genome, which are not pre-miRNAs, were used to model the negative set. The folding structure (unlabeled_hairpins.fold) and features (unlabeled_hairpins.csv) for these sequences can be downloaded from this external repository. You will only need to download the unlabeled_hairpins.fold if you want to train the mirDNN. This data and further details are available in [3].
At this point, the sequence and folding prediction for each hairpin-like sequence in SARS-CoV-2, the known virus pre-miRNAs and some human non-pre-miRNA sequences are ready.
In order to extract meaningful features from the sequences, the miRNAfe package was used [4]. Once installed, you can run the feature extraction with:
miRNAfe('sars-cov2_hairpins.fasta', 'config/Example_prediction.yaml');
miRNAfe('pre-miRNAs_virus.fasta', 'config/Example_prediction.yaml');
The ".yaml" file is provided with the package.
The training notebook is provided with instructions to train the OC-SVM [5], deeSOM [6] and mirDNN [7] classifiers. Doing so may take several hours.
If you want to directly predict the sequences, you can use the trained models that are provided in pre-miRNAs_models/
as explained in the following section.
The prediction notebook is provided with instructions to use the trained models to rank the SARS-CoV-2 sequences (a high rank means a higher chance of being a pre-miRNA).
Once mature miRNAs were identified by combining MatureBayes predictions and the small RNA-seq reads profiles, their sequences (provided in matures/miRNAs.csv
) must be submitted to Diana MR MicroT and miRDB (Custom prediction) for predicting human gene targets.
The prediction files from miRDB and Diana MR Micro T, one per SARS-CoV-2 miRNA, are provided in the targets/miRDB/
and targets/Diana/
directories, respectively. Once you have downloaded them, you can run the following bash scripts for extracting the Ensembl identifiers (ID) and the score for each transcript predicted by Diana as being targeted for the viral miRNAs, with a prediction score of 70 or higher.
sh ../src/extractIDs.sh
After running this script, a new directory targets/Diana/70
will be generated. Inside it, you should obtain the same files that are provided in this repository. Since Diana predicts transcripts and miRDB genes, Ensembl transcripts IDs were mapped to Gene names using the mapping IDs R notebook.
Once you have the Diana files ready (provided here in the targets/Diana/70/
folder, with the suffix gene
), the following bash script will help you to combine them with miRDB predictions in order to obtain the set of targets that were predicted by both tools with scores predictions of 70 or higher.
sh ../src/jointargets.sh
After executing the code above, you will obtain eight files called overlap_<mirna>.tab
with the set of human gene targets predicted for each mirna
of the SARS-CoV-2 miRNA (provided here in the targets/
folder). You can explore the number of targets obtained for each miRNA and generate the Fig2A of [1] by using the targets exploration notebook.
Expression data from RNA-seq experiments involving Calu3 cell-cultures infected with SARS-CoV-2 was downloaded from GEO-NCBI GSE148729. The differential expression notebook is provided with all the instructions to identify the set of differentially expressed genes.
After identifying the set of deregulated genes, you will be able to perform and analyze their functional enrichment by means of an overrepresentation analysis with PANTHER. For doing this, the list of deregulated genes should be used as the analyzed list and the list of expressed genes as the reference list. The overrepresentation analysis is conducted by Fisher's exact test with the Bonferroni adjustment for p-value correction. The results for PANTHER GO-slim biological process (BP), PANTHER Pathways and Reactome Pathways annotation datasets are provided in the ‘functional_enrichment/DEgenes’ folder. The overrepresentation analysis R notebook will guide you to obtain the graphical representation of the obtained results for the PANTHER GO-Slim BP set, shown in Fig2C of [1].
Once DE analyses have been conducted, the following bash code is used to find the list of deregulated targets and the set of those potentially being silenced by the predicted viral miRNAs.
sh src/deregulatedTargets.sh
The previous bash script will create the ‘targets/targetsDE’ folder. Then, it will compare the miRNA targets and the RNA-seq differential expression (DE) results previously obtained (stored in the ‘expression_data/DE_SARS-CoV-2_Full_FC1.csv’ file). In the ‘targets/targetsDE’ folder, for each predicted SARS-CoV-2 miRNA, two files (‘overlapDE_overlap_SC2V-mir-.tab’ and ‘overlapDOWN_overlap_SC2V-mir-.tab’) will be generated, with the names of those genes that were found as deregulated and down-regulated for at least one of the differential expression analyses previously performed. The information contained in these files will be summarized, joining the results of all miRNA targets, in the files ‘targetsDE.txt’ and [‘targetsDEDOWN.txt’]](targets/targetsDE/targetsDEDOWN.txt). Finally, the script will also filter the ‘expression_data/DE_SARS-CoV-2_Full_FC1.csv’ file for keeping only those deregulated targets, generating the ‘targets/targetsDE/targetsDEgenes_SARS-CoV-2_Full_FC1.csv’ file.
The number of deregulated targets for each miRNA is explored to generate Fig3A of the manuscript by using the differentially expressed targets exploration notebook.
As for analyzing functional enrichment of deregulated genes, the overrepresentation analysis of deregulated targets is also conducted by using PANTHER. Particularly, in this case, the full set of deregulated genes is used as reference list (‘expression_data/DEgenes_SARS-CoV-2_Full_FC1.txt’) and the list of deregulated targets (‘targets/targetsDE/targetsDE.txt’) as analyzed list. The results for PANTHER GO-slim BP annotation dataset are provided in the ‘functionalEnrichment/DEtargets’ folder. The targets overrepresentation analysis R notebook will help you to obtain the graphical representation of the obtained results, shown in Fig3D of [1].
The list of the 28 targets that were identified as down-regulated potentially being silenced by the novel SARS-CoV-2 miRNAs can be functionally characterized by using the functional classification tool of PANTHER. The results for this step, provided here in the ‘functional_enrichment/DEDOWNTargets/PANTHERClassification.txt’, were manually processed and combined with the DE results to generate a summary file, ‘functional_enrichment/DEDOWNtargets/FunctionalCharacterization.csv’.
The exploration of expression changes and functional characterization of down-regulated targets is carried out using the down-regulated targets exploration notebook. Following it, you will be able to obtain Fig3B and Fig3C of [1].
The candidate sequence found in the previous point were aligned using the BLAST web platform against the following coronaviruses:
- HKU1 (NCBI Reference Sequence DQ415897.1),
- OC43 (NCBI Reference Sequence KY983585.1),
- Bat-CoV RF1 (NCBI Reference Sequence DQ412042.1),
- Bat-CoV RATG13 (NCBI Reference Sequence MN996532.1),
- Malayan Pangolin (NCBI Reference Sequence MT072864.1),
- SARS-CoV (NCBI Reference Sequence AY278741.1),
- MERS-CoV (NCBI Reference Sequence MN120514.1),
- Mice coronavirus (NCBI Reference Sequence GB KF294357.1),
- Ferret coronavirus (NCBI Reference Sequence 1264898).
Select the mature miRNA extracted from the candidate sequence in the previous section. Then, perform the alignments with MUSCLE with ClustalW type output format and default settings.
The Figures notebook is provided with all the instructions to generate Figures 2, 3 and 4 of [1].
[1] G.A. Merino, J. Raad, L.A. Bugnon, C. Yones, L. Kamenetzky, J. Claus, F. Ariel, D.H. Milone, G. Stegmayer, "Novel SARS-CoV-2 encoded small RNAs in the passage to humans," Bioinformatics, 2020, DOI.
[2] C. Yones, N. Macchiaroli, L. Kamenetzky, G. Stegmayer, D.H. Milone, "HextractoR: an R package for automatic extraction of hairpins from genome-wide data", 2020, DOI.
[3] L.A. Bugnon, C. Yones, J. Raad, D.H. Milone, G. Stegmayer, "Genome-wide hairpins datasets of animals and plants for novel miRNA prediction," Data in Brief, 2019, DOI.
[4] C. Yones, G. Stegmayer, L. Kamenetzky, D.H. Milone, "miRNAfe: a comprehensive tool for feature extraction in microRNA prediction," BioSystems, 2015, DOI.
[5] L.A. Bugnon, C. Yones, D.H. Milone, G. Stegmayer, "Genome-wide discovery of pre-miRNAs: comparison of recent approaches based on machine learning," Briefings in Bioinformatics, 2020, DOI.
[6] L.A. Bugnon, C. Yones, D.H. Milone, G. Stegmayer, "Deep neural architectures for highly imbalanced data in bioinformatics," IEEE Transactions on Neural Networks and Learning Systems, 2020, DOI.
[7] C. Yones, J. Raad, L.A. Bugnon, D.H. Milone, G. Stegmayer, "High precision in microRNA prediction: a novel genome-wide approach based on convolutional deep residual networks," 2020, DOI.