Skip to content
Megabyte edited this page Jun 9, 2023 · 10 revisions

Welcome to the Phylogenetic-tree-study wiki!

We are trying to figure out the similarities and differences between microbial sequences specifically the 16S ribosomal Ribonucleic acid (RNA) region. This is a very important region of the bacteria since it has a conserved and variable region. It is approximately 1500 nucleotides long but it can vary.

In the past, the person @Shuyib who was doing the study was mostly interested in comparing sequences using conventional methods: character based methods and distance based methods. These are commonly used in bioinformatics/ Genomic Data Science. However, @Shuyib noticed that machine learning based methods could be a method called Hierarchical clustering can be substituted to help with the estimation of the phylogenetic tree. In addition, we noticed that we are almost close to meeting the tree of life just with the sequences used. But, for the few sequences we had.

At the moment, we have updated the sequences to meet the constraint. That is, only Drug resistant sequences preferably obtained from human beings. Have been used to find common metrics e.g GC content, GA content and motifs. Surprisingly, the motifs annotated by MEME. Were of plant ancestors. Which is quite intriguing. On the other hand, using UniProt gave us more clues that the genes related to the sequences were not annotated yet.

Other objectives we'd like to achieve is figuring out motifs with Natural language based techniques for example DNA BERT, embeddings with visualization and similarity matching algorithms e.g with cosine similarity after NLP processing pipelines.

FAQ

Q: What is a phylogenetic tree (aka phylogeny)? A: According to Baum in Nature, this is a diagram that shows lines of evolutionary descent of different species, organisms or genes from a common ancestor. It is useful for organizing knowledge of biological diversity, for structuring classifications, or understanding evolutionary events.

Q: What is antimicrobial resistance? A: This is when a microorganism is no longer sensitive to action of antibiotics. Antibiotics are compounds produced by the natural metabollic processes of microbes for instance fungi that kill or inhibit other microbes. Key people who discovered the first named them are Alexander Fleming & Selman Waksman. Now, these microbes are not affected by these antibiotics and they can lose/gain function via misspelling their genetic material to also bypass the mechanism of action of antibiotics.Some microbes can also share genes and may get this resistance via horizontal gene transfer.

Q: What are nitrogenous bases? A: Adenine, Guanine, Cytosine and Thymine (Uracil, if in messanger ribonucleic acid form). These are the building blocks of DNA (Deoxyribonucleic acid) or the variant (Ribonucleic acid). Notice the sugar is how distinguish them deoxyribose and ribose.

Q: What is 16S rRNA (ribosomal ribonucleic acid)? A: It is component of the 30S subunit of the prokaryotic ribosome. It consists about 1500 nucleotides. It possesses a conserved and variable region which is important in Phylogenetic studies due to slow evolution of the area. Similar regions of interest in other organisms are Internal transcribed spacer for fungi, 18S for microbial eukaryotes and for you + me 28S and 18S fragments.

Q: What are GC and GA content? A: These are common bioinformatic metrics that can enable us to find out the origin of a DNA we don't know the origin of. Higher GC content could mean contamination of a sample with a microbe. For most organisms its about 50 percent with some exceptions in some areas. Read more.

Q: What are Kmers? A: A nucleotide sequence of a certain length. Read more here & here. They are important for the read assembly problem that is, moving from small reads then bringing them together using graphs.

Q: What are motifs? A: patterns/regions which are conserved over evolutionary time and are presumed to be important in function / biologically important region of a protein. N.B the sequences we are using are especially crucial for searching for these regions which we can study further.

Clone this wiki locally