Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calibrate DNA built profiles #19

Open
Ebedthan opened this issue Mar 23, 2021 · 10 comments
Open

Calibrate DNA built profiles #19

Ebedthan opened this issue Mar 23, 2021 · 10 comments

Comments

@Ebedthan
Copy link

Hello,

I am trying to build PSSM profiles for DNA sequences. The profile construction ran smoothly. Now I need to calibrate the profile with a database and I really cannot find a way to do that. Can you please show me a way or a database to use?

P.S. The profile was built with partial bacterial DNA sequences from NCBI.

Thanks in advance.

@smoretti
Copy link
Member

Hi

to calibrate your PSSM profiles from DNA sequences we recommend to shuffle your partial bacterial DNA sequences using a method such as "windows 20" shuffling, and then use this shuffle database for the calibration.

@Ebedthan
Copy link
Author

Okay, thanks @smoretti. But by the way, I've taken time to read the article from Pagni and Jogeneel but I don't clearly know what step or tools to use to shuffle DNA sequence with a method like "windows 20". Please can you help me with the process to create such a shuffle database or point me to interesting resources? I really need it. Thanks in advance.

@smoretti
Copy link
Member

In the distribution a script (src/Perl/scramble_fasta.pl) is provided to do it.
It can run several types of shuffling.
More information with
perl scramble_fasta.pl -h

The "windows 20" method should be run with
perl scramble_fasta.pl -m window -P 20 a_file_with_all_your_partial_bacterial_DNA_sequences_in_fasta_format

@Ebedthan
Copy link
Author

Great thanks to you for your help! I'm trying it.

@Ebedthan
Copy link
Author

Again thank you @smoretti for the help and point me to the Perl script. I'll further explore all the files in the pftools2 package.

@Ebedthan
Copy link
Author

Hello @smoretti,

I have ran perl scramble_fasta.pl -m window -P 20 bacterial_dna.fa > mywindow20.seq and got the database for the profile calibration. Nevertheless the profiles obtained have a score for both cut off values like SCORE=-2147483648. And running pfscanV3 or pfsearchV3 I got the following error:

Error: Inconsistent alignment found in alignment 3 - no list produced.
       Alignement should be from 1431 to 1!
Thread 0 : Internal error xalip reported no possible alignment for sequence 0(0) (nali=-1)!

It is the first time I see a negative SCORE and I'm trying to know what I'm doing wrong.

Thanks in advance for the help.

@smoretti
Copy link
Member

Negative SCORE are possible, mainly when global (not local) profiles are used.

Your case is more tricky.
Such very large SCOREs look to be a memory issue:
To optimize speed and memory storage, matches in pftoolsv3 are stored on 32bits in memory. When very large profiles are used the storage is exceeded.

Could you retry with less long sequences (and profiles)?

@smoretti smoretti reopened this Mar 26, 2021
@Ebedthan
Copy link
Author

Ebedthan commented Apr 2, 2021

I want to but I'll lose important gene information. I have already used partial gene sequences lower than the full gene size. Is it not possible to find another way? Or perhaps increase the memory storage for DNA profiles?

@smoretti
Copy link
Member

Sorry, I missed your message.

In fact by default profiles should be stored in 16bits. If you rebuild pftools3 with this option
cmake -DUSE_32BIT_INTEGER=ON
profiles will be stored in 32bits. Maybe it will solve your issue.

If it does not solve it, you can try to use less long profiles by splitting them, and build overlapping profiles.

@Ebedthan
Copy link
Author

Ebedthan commented Apr 21, 2021

Thanks for your response. While waiting for your response I have taken the option to try to split sequences to build less long profiles and overlapping profiles. I have not gone far meanwhile. Definitively, I'll try both options and see which one can lead me to meaningful results. I'll let you know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants