Skip to content

Conversation

nakib103
Copy link
Contributor

@nakib103 nakib103 commented Feb 11, 2025

ENSVAR-6567

Extend ProteinFunction nextflow pipeline to be able to write to a SQLite db. It introduces the following params -

params.offline - if set, no ensembl database connection would be made (and not storing data in MySQL db)
params.sqlite - if set, a SQLite db would be created with the results. (if params.offline is set params.sqlite is automatically set too, otherwise there would be no output)
params.sqlite_dir - directory location where sqlite db should be stored. By default it is the param.outdir. The db name is set to ${params.species}_PolyPhen_SIFT.db
params.sqlite_db - give full path of the SQLite db with name. Alternative way to give db name.

Test:

Nextflow command -

nextflow run $ENSEMBL_ROOT_DIR/ensembl-variation/nextflow/ProteinFunction \
-profile $profile \
--sift_run_type FULL \
--outdir $PWD/temp \
--species felis_catus \
--offline 1 \
--gtf <gtf file>
--fasta <fasta file>

Test the generated SQLite db ($PWD/temp/felis_catus_PolyPhen_SIFT.db) against what we have in Ensembl database.

Test script
use DBI;
use Data::Dumper;
use Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix;
use Bio::EnsEMBL::Registry;

$sqlite_db = "/some/path/felis_catus_PolyPhen_SIFT.db";

my $dbh = DBI->connect("dbi:SQLite:dbname=$sqlite_db","","");

my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
    -host => 'HOST',
    -user => 'USER',
    -port => PORT,
    -pass => '',
    -db_version   => 110
);
$registry->add_alias('felis_catus', 'cat');

my $vdb = $registry->get_DBAdaptor('cat', 'variation');
my $pfpma = $vdb->get_ProteinFunctionPredictionMatrixAdaptor();

my $ITER = 5;
my $sth = $dbh->prepare("SELECT * from predictions LIMIT $ITER;");
$sth->execute();

foreach my $iter (1..$ITER) { 
    my $row = $sth->fetch();
    my $pfpm = Bio::EnsEMBL::Variation::ProteinFunctionPredictionMatrix->new(
        -analysis     => 'sift',
        -matrix       => $row->[2],
    );

    my $pfpm_test = $pfpma->fetch_sift_predictions_by_translation_md5($row->[0]);


    my $AAs = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y'];
    my $times = 0;
    my $POS = scalar keys $pfpm->deserialize;
    foreach my $aa (@{ $AAs }) {
        foreach my $pos (1..$POS) {
            my ($pred, $score) = $pfpm->prediction_from_matrix($pos, $aa);
            my ($pred_test, $score_test) = $pfpm_test->prediction_from_matrix($pos, $aa);

            if ($pred ne $pred_test && $score ne $score_test) {
                my $pred_match = $pred eq $pred_test ? "pred_match" : "pred_nomatch";
                print "$pos - $aa; $pred_match ; ", $score - $score_test, "\n";
            }
        }
    }
}

if ( params.sift_run_type != "NONE" ) {
errors = errors.concat(run_sift_pipeline( translated, sqlite_db_prep ))
} else {
sift_run = "done"
Copy link
Contributor

@likhitha-surapaneni likhitha-surapaneni Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nakib103 , we may need to define sift_run when params.sift_run_type != "NONE"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! it has been fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants