Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

wtni-gidle · 2024-11-15T18:13:45Z

Thanks for providing the AF3 source!
To test AlphaFold3 using the example 7BBV provided by AlphaFold3 server, I used the following JSON file as input:

{
    "name": "7BBV",
    "modelSeeds": [1],
    "sequences": [
        {
            "protein": {
                "id": "A",
                "sequence": "TPTPTIQEDGSPALIAKRASVTESCNIGYASTNGGTTGGKGGATTTVSTLAQFTKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH",
                "modifications": [
                    {"ptmType": "MAN", "ptmPosition": 22},
                    {"ptmType": "MAN", "ptmPosition": 44},
                    {"ptmType": "MAN", "ptmPosition": 45},
                    {"ptmType": "MAN", "ptmPosition": 46},
                    {"ptmType": "MAN", "ptmPosition": 48},
                    {"ptmType": "MAN", "ptmPosition": 54}
                ]
            }
        },
        {
            "ligand": {
                "id": "B",
                "ccdCodes": ["CA"]
            }
        }

    ],
    "dialect": "alphafold3",
    "version": 1
}

However, during the run_inference stage, the following error occurred:

ValueError: First MSA sequence TPTPTIQEDGSPALIAKRASVTESCNIGYASTNGGTTGGKGGATTTVSTLAQFTKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH is not the query_sequence='TPTPTIQEDGSPALIAKRASVXESCNIGYASTNGGTTGGKGGAXXXVXTLAQFXKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH'

Upon inspecting the error, I observed that:

Modified residues in the query sequence are converted to X.
However, the first MSA retains the original unmodified sequence.

Is it necessary for me to modify the input file, or is there a bug in the source code?

Thanks.

The text was updated successfully, but these errors were encountered:

wtni-gidle · 2024-11-16T14:39:00Z

I spent some time analyzing the source code and found that:

During the run_inference stage, the query_sequence is generated as follows:

alphafold3/src/alphafold3/model/features.py

Lines 444 to 445 in 2ffe43f

    
           if chain_type in mmcif_names.POLYMER_CHAIN_TYPES: 
        
             sequence = substruct.chain_single_letter_sequence()[b_chain_id]

This relies on struct:

alphafold3/src/alphafold3/model/pipeline/pipeline.py

Line 166 in 2ffe43f

struct = fold_input.to_structure(ccd=ccd)

sequence in struct:

alphafold3/src/alphafold3/common/folding_input.py

Lines 912 to 914 in 2ffe43f

    
           match chain: 
        
             case ProteinChain(): 
        
               sequences.append('(' + ')('.join(chain.to_ccd_sequence()) + ')')

to_ccd_sequence replaces the original residues with ccd from the ptm:

alphafold3/src/alphafold3/common/folding_input.py

Lines 235 to 236 in 2ffe43f

    
           for ptm_code, ptm_index in self.ptms: 
        
             ccd_coded_seq[ptm_index - 1] = ptm_code

This means that the query_sequence is not exactly the same as the original input. It first replaces the original residues based on the provided modifications. Then, the 'actual' sequence is extracted using the chain_single_letter_sequence function.

In the chain_single_letter_sequence function, the residue_names.CCD_NAME_TO_ONE_LETTER dictionary is used to convert CCD residues to their corresponding one-letter codes.

alphafold3/src/alphafold3/structure/structure.py

Lines 1940 to 1945 in 2ffe43f

    
           chain_res = string_array.remap( 
        
               chain_res, 
        
               mapping=residue_names.CCD_NAME_TO_ONE_LETTER, 
        
               inplace=False, 
        
               default_value=unknown_default, 
        
           )

So, if I understand correctly, if a CCD code from the modifications is present in the CCD_NAME_TO_ONE_LETTER dictionary and the converted residue matches the original residue, there will be no sequence mismatch error. However, if the converted residue does not match the original (i.e. CCD code is transformed into X, N, or some other residue), this would result in the same error as mentioned in the issue.

Then I further checked CCD_NAME_TO_ONE_LETTER and found that there is no corresponding single letter for MAN. And when I tried running the given example in https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#protein, it ran successfully, and the CCD code had a corresponding single letter in CCD_NAME_TO_ONE_LETTER. These experiments should have confirmed my thoughts.

Therefore, I think there should be a restriction on which CCD codes can be used in PTM (e.g. they must be recorded in CCD_NAME_TO_ONE_LETTER), or the source code should be improved to better handle the identification of query_sequence.

Please let me know if I’m wrong.

Thanks.

joshabramson · 2024-11-18T10:02:33Z

"MAN" is a glycan and should be defined as a bonded ligand, see https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#bonds

Please note that converting AlphaFold-Server JSONs containing glycans is not currently supported, see https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#glycans

For PDB examples where a cif already exists, one can create the input json from the cif using from_mmcif in the folding_input class: https://github.com/google-deepmind/alphafold3/blob/main/src/alphafold3/common/folding_input.py#L795C7-L795C17 (we will add this info to the input docs soon)

joshabramson · 2024-11-18T10:05:19Z

On, your follow up message, thanks for digging into the code!

I think there should be a restriction on which CCD codes can be used in PTM (e.g. they must be recorded in CCD_NAME_TO_ONE_LETTER), or the source code should be improved to better handle the identification of query_sequence

Good find - we will look into this.

zhaisilong · 2024-11-28T13:47:35Z

@wtni-gidle I encountered a similar question while working on my modified sequence. Following your guidance, I successfully ran a modified example. Here’s the case:

If you want to apply modifications at positions 2 and 5, changing them to HY3 and P1L for a sequence like "S(N)AD(E)VTVGKFYATFLIQEYFRKFKKRKEQGLVGKPS" (from PDB entry 6DAE, chain C), you might encounter issues. Specifically, this approach would result in an incorrect sequence.

To resolve this, you can refer to the reversed mapping dictionary from CCD_NAME_TO_ONE_LETTER. This dictionary shows that HY3 maps back to P and P1L maps back to C. Thus, the correct input sequence should be "S(P)AD(C)VTVGKFYATFLIQEYFRKFKKRKEQGLVGKPS". When provided in this format, AF3 works correctly.

Daniel-Halvey · 2024-11-30T14:13:46Z

I am just looking at this for the first time today, I am confident we will have an elegant solution with the x mutation in the run_inferance in due time. I am just good at typing creative solutions, my Old Man is M.D.. Let me load up pycharm.

joshabramson · 2024-12-09T10:57:40Z

Small update: thanks again for reporting this. We are working on a fix, which should land by the end of the week.

shevsea · 2024-12-10T15:36:13Z

If one manually adds an entry to the dictionary (e.g. 'TYC': 'Y'), will the error be fixed?

ackbar03 · 2024-12-15T14:52:00Z

any luck on this? Thanks for looking at this btw, also facing this issue

Augustin-Zidek · 2024-12-17T10:14:21Z

This has been fixed in 44dee65, sorry for the delay!

Please clone the latest commit and rebuild your AlphaFold 3 container to get the fix.

YoavShamir5 · 2024-12-24T18:36:33Z

Hello @Augustin-Zidek, I would appreciate your input - I just tried to predict the structure of a protein with a single PTM (my user-provided CCD) and no additional molecules, leading to the same error message:

File "/alphafold3_venv/lib/python3.11/site-packages/alphafold3/data/msa.py", line 220, in from_a3m
    return cls(
           ^^^^
  File "/alphafold3_venv/lib/python3.11/site-packages/alphafold3/data/msa.py", line 108, in __init__
    raise ValueError(
ValueError: First MSA sequence GKNAPSTAGLGYGSWEIDPKDLTFLKELGTGQFGVVKYGKWRGQYDVAIKMIKEGSMSEDEFIEEAKVMMNLSHEKLVQLYGVCTKQRPIFIITEYMANGCLLNYLREMRHRFQTQQLLEMCKDVCEAMEYLESKQFLHRDLAARNCLVNDQGVVKVSDFGLSRYVLDDEYTSSVGSKFPVRWSPPEVLMYSKFSSKSDIWAFGVLMWEIYSLGKMPYERFTNSETAEHIAQGLRLYRPHLASEKVYTIMYSCWHEKADERPTFKILLSNILDVMDEES is not the query_sequence='GKNAPSTAGLGYGSWEIDPKDLTFLKELGTGQFGVVKYGKWRGQYDVAIKMIKEGSMSEDEFIEEAKVMMNLSHEKLVQLYGVCTKQRPIFIITEYMANGXLLNYLREMRHRFQTQQLLEMCKDVCEAMEYLESKQFLHRDLAARNCLVNDQGVVKVSDFGLSRYVLDDEYTSSVGSKFPVRWSPPEVLMYSKFSSKSDIWAFGVLMWEIYSLGKMPYERFTNSETAEHIAQGLRLYRPHLASEKVYTIMYSCWHEKADERPTFKILLSNILDVMDEES'

I am not sure if there is an issue with my syntax, or if this issue persists. Here is the json I provide from my inference-only script, which fails with the above error (delete the MSA and template segments here due to size):

{
    "dialect": "alphafold3",
    "version": 1,
    "name": "5P9J",
    "sequences": [
      {
        "protein": {
          "id": "A",
          "sequence": "GKNAPSTAGLGYGSWEIDPKDLTFLKELGTGQFGVVKYGKWRGQYDVAIKMIKEGSMSEDEFIEEAKVMMNLSHEKLVQLYGVCTKQRPIFIITEYMANGCLLNYLREMRHRFQTQQLLEMCKDVCEAMEYLESKQFLHRDLAARNCLVNDQGVVKVSDFGLSRYVLDDEYTSSVGSKFPVRWSPPEVLMYSKFSSKSDIWAFGVLMWEIYSLGKMPYERFTNSETAEHIAQGLRLYRPHLASEKVYTIMYSCWHEKADERPTFKILLSNILDVMDEES",
          "modifications": [ {"ptmType": "M1", "ptmPosition": 101}],
        }
      }
    ],
    "modelSeeds": [
      1
    ],
    "bondedAtomPairs": null,
    "userCCD": "data_M1\n#\n_chem_comp.id          M1\n_chem_comp.name    af3_enrich\n_chem_comp.type    non-polymer\n_chem_comp.formula    ?\n_chem_comp.mon_nstd_parent_comp_id    ?\n_chem_comp.pdbx_synonyms    ?\n_chem_comp.formula_weight    ?\n_chem_comp.pdbx_smiles '[H]OC(=O)C([H])(N([H])[H])C([H])([H])SC([H])([H])C([H])([H])C(=O)N1C([H])([H])C([H])([H])C([H])([H])[C@@]([H])(n2nc(-c3c([H])c([H])c(Oc4c([H])c([H])c([H])c([H])c4[H])c([H])c3[H])c3c(N([H])[H])nc([H])nc32)C1([H])[H]'\n#\nloop_\n_chem_comp_atom.atom_id\n_chem_comp_atom.charge\n_chem_comp_atom.pdbx_leaving_atom_flag\n_chem_comp_atom.comp_id\n_chem_comp_atom.pdbx_model_Cartn_x_ideal\n_chem_comp_atom.pdbx_model_Cartn_y_ideal\n_chem_comp_atom.pdbx_model_Cartn_z_ideal\n_chem_comp_atom.type_symbol\nC1 0 N M1 -4.963 -0.007 1.670 C\nCov 0 N M1 -6.443 0.329 1.834 C\nSG 0 N M1 -7.505 -0.328 0.492 S\nCB 0 N M1 -7.027 0.750 -0.914 C\nCA 0 N M1 -7.374 2.235 -0.729 C\nC 0 N M1 -7.097 2.949 -2.052 C\nOXT 0 N M1 -5.792 3.234 -2.234 O\nO 0 N M1 -7.915 3.223 -2.918 O\nN 0 N M1 -8.782 2.460 -0.320 N\nC2 0 N M1 -4.621 -1.440 2.057 C\nO1 0 N M1 -5.447 -2.169 2.607 O\nN1 0 N M1 -3.312 -1.871 1.810 N\nC3 0 N M1 -3.074 -3.312 1.938 C\nC4 0 N M1 -2.382 -1.142 0.932 C\nC5 0 N M1 -1.615 -3.647 2.227 C\nC6 0 N M1 -0.908 -1.473 1.216 C\nC7 0 N M1 -0.678 -2.987 1.220 C\nN2 0 N M1 0.048 -0.795 0.350 N\nC8 0 N M1 -0.052 -0.526 -0.987 C\nN3 0 N M1 1.225 -0.399 0.884 N\nC9 0 N M1 1.135 0.094 -1.347 C\nN4 0 N M1 -1.092 -0.795 -1.796 N\nC10 0 N M1 1.918 0.101 -0.154 C\nC11 0 N M1 1.241 0.519 -2.678 C\nC12 0 N M1 -0.866 -0.373 -3.050 C\nC13 0 N M1 3.293 0.561 0.036 C\nN5 0 N M1 2.315 1.220 -3.208 N\nN6 0 N M1 0.217 0.271 -3.521 N\nC14 0 N M1 3.624 1.347 1.149 C\nC15 0 N M1 4.302 0.205 -0.870 C\nC16 0 N M1 4.937 1.788 1.338 C\nC17 0 N M1 5.614 0.648 -0.684 C\nC18 0 N M1 5.935 1.429 0.430 C\nO2 0 N M1 7.191 1.938 0.673 O\nC19 0 N M1 8.270 1.160 0.322 C\nC20 0 N M1 9.208 1.710 -0.550 C\nC21 0 N M1 8.471 -0.107 0.872 C\nC22 0 N M1 10.335 0.969 -0.910 C\nC23 0 N M1 9.599 -0.846 0.512 C\nC24 0 N M1 10.527 -0.309 -0.383 C\nH1 0 N M1 -4.391 0.641 2.345 H\nH2 0 N M1 -4.639 0.198 0.647 H\nH3 0 N M1 -6.579 1.413 1.884 H\nH4 0 N M1 -6.828 -0.072 2.777 H\nH5 0 N M1 -5.960 0.622 -1.116 H\nH6 0 N M1 -7.551 0.356 -1.793 H\nH7 0 N M1 -6.737 2.695 0.034 H\nH8 0 N M1 -5.777 3.693 -3.102 H\nH9 0 N M1 -9.059 1.693 0.297 H\nH10 0 N M1 -9.379 2.361 -1.147 H\nH11 0 N M1 -3.697 -3.735 2.732 H\nH12 0 N M1 -3.389 -3.777 0.995 H\nH13 0 N M1 -2.679 -1.407 -0.087 H\nH14 0 N M1 -2.523 -0.063 1.044 H\nH15 0 N M1 -1.360 -3.301 3.236 H\nH16 0 N M1 -1.473 -4.733 2.214 H\nH17 0 N M1 -0.692 -1.100 2.229 H\nH18 0 N M1 -0.848 -3.404 0.221 H\nH19 0 N M1 0.361 -3.215 1.486 H\nH20 0 N M1 -1.654 -0.567 -3.770 H\nH21 0 N M1 2.048 1.670 -4.076 H\nH22 0 N M1 2.809 1.808 -2.549 H\nH23 0 N M1 2.861 1.617 1.877 H\nH24 0 N M1 4.080 -0.424 -1.731 H\nH25 0 N M1 5.182 2.403 2.201 H\nH26 0 N M1 6.372 0.376 -1.414 H\nH27 0 N M1 9.057 2.707 -0.953 H\nH28 0 N M1 7.758 -0.521 1.581 H\nH29 0 N M1 11.063 1.389 -1.598 H\nH30 0 N M1 9.757 -1.836 0.933 H\nH31 0 N M1 11.406 -0.884 -0.662 H\n#\nloop_\n_chem_comp_bond.atom_id_1\n_chem_comp_bond.atom_id_2\n_chem_comp_bond.comp_id\n_chem_comp_bond.pdbx_aromatic_flag\n_chem_comp_bond.pdbx_stereo_config\n_chem_comp_bond.value_order\nC1  Cov M1 N N SING\nCov SG  M1 N N SING\nSG  CB  M1 N N SING\nCB  CA  M1 N N SING\nCA  C   M1 N N SING\nC   OXT M1 N N SING\nC   O   M1 N N DOUB\nCA  N   M1 N N SING\nC1  C2  M1 N N SING\nC2  O1  M1 N N DOUB\nC2  N1  M1 N N SING\nN1  C3  M1 N N SING\nC4  N1  M1 N N SING\nC3  C5  M1 N N SING\nC6  C4  M1 N N SING\nC5  C7  M1 N N SING\nC7  C6  M1 N N SING\nC6  N2  M1 N N SING\nN2  C8  M1 Y N SING\nN3  N2  M1 Y N SING\nC8  C9  M1 Y N DOUB\nN4  C8  M1 Y N SING\nC10 N3  M1 Y N DOUB\nC9  C11 M1 Y N SING\nC9  C10 M1 Y N SING\nC12 N4  M1 Y N DOUB\nC10 C13 M1 N N SING\nC11 N5  M1 N N SING\nC11 N6  M1 Y N DOUB\nN6  C12 M1 Y N SING\nC13 C14 M1 Y N DOUB\nC15 C13 M1 Y N SING\nC14 C16 M1 Y N SING\nC17 C15 M1 Y N DOUB\nC16 C18 M1 Y N DOUB\nC18 C17 M1 Y N SING\nC18 O2  M1 N N SING\nO2  C19 M1 N N SING\nC19 C20 M1 Y N DOUB\nC21 C19 M1 Y N SING\nC20 C22 M1 Y N SING\nC23 C21 M1 Y N DOUB\nC22 C24 M1 Y N DOUB\nC24 C23 M1 Y N SING\nC1  H1  M1 N N SING\nC1  H2  M1 N N SING\nCov H3  M1 N N SING\nCov H4  M1 N N SING\nCB  H5  M1 N N SING\nCB  H6  M1 N N SING\nCA  H7  M1 N N SING\nOXT H8  M1 N N SING\nN   H9  M1 N N SING\nN   H10 M1 N N SING\nC3  H11 M1 N N SING\nC3  H12 M1 N N SING\nC4  H13 M1 N N SING\nC4  H14 M1 N N SING\nC5  H15 M1 N N SING\nC5  H16 M1 N N SING\nC6  H17 M1 N N SING\nC7  H18 M1 N N SING\nC7  H19 M1 N N SING\nC12 H20 M1 N N SING\nN5  H21 M1 N N SING\nN5  H22 M1 N N SING\nC14 H23 M1 N N SING\nC15 H24 M1 N N SING\nC16 H25 M1 N N SING\nC17 H26 M1 N N SING\nC20 H27 M1 N N SING\nC21 H28 M1 N N SING\nC22 H29 M1 N N SING\nC23 H30 M1 N N SING\nC24 H31 M1 N N SING\n#\n"
  }

Augustin-Zidek · 2025-01-06T11:36:17Z

@YoavShamir5 thanks for reporting, this is a bug I will fix.

The workaround for now is to change the query sequence in position 101 from C to X.

alephreish · 2025-01-21T10:41:19Z

@Augustin-Zidek This does not sound like a good workaround as the substitution to 'X' has to be made before the search phase and will thus influence the scoring. The better alternatives are either to change the residue to 'X' it in the alignments (the search phase has to be done as a separate call) or to name the PTM with one of the ccd names mapping to the amino acid you are modifying from CCD_NAME_TO_ONE_LETTER. To be entirely honest though, it feels like a very straightforward bug to fix.

Augustin-Zidek added question Further information is requested bug Something isn't working labels Nov 18, 2024

Augustin-Zidek mentioned this issue Nov 29, 2024

First MSA sequence is not the {query_sequence=}' #132

Closed

This was referenced Dec 8, 2024

Error occurred when modeling post-translational modification #192

Closed

Covalent ligands: (non)leaving atoms and unrealistic bond lengths #159

Closed

snufoodbiochem mentioned this issue Dec 11, 2024

Json file for Protein lactylation snufoodbiochem/Alphafold3_tools#2

Closed

Augustin-Zidek closed this as completed Dec 17, 2024

Augustin-Zidek mentioned this issue Dec 17, 2024

Question: Is AF3 temporarily not supported for input involving glycosylation modifications? #216

Closed

google-deepmind deleted a comment from 408404405501 Jan 6, 2025

Augustin-Zidek reopened this Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

wtni-gidle commented Nov 15, 2024

wtni-gidle commented Nov 16, 2024

joshabramson commented Nov 18, 2024

joshabramson commented Nov 18, 2024

zhaisilong commented Nov 28, 2024 •

edited

Loading

Daniel-Halvey commented Nov 30, 2024

joshabramson commented Dec 9, 2024

shevsea commented Dec 10, 2024

ackbar03 commented Dec 15, 2024

Augustin-Zidek commented Dec 17, 2024

YoavShamir5 commented Dec 24, 2024 •

edited by Augustin-Zidek

Loading

Augustin-Zidek commented Jan 6, 2025

alephreish commented Jan 21, 2025 •

edited

Loading

Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

Comments

wtni-gidle commented Nov 15, 2024

wtni-gidle commented Nov 16, 2024

joshabramson commented Nov 18, 2024

joshabramson commented Nov 18, 2024

zhaisilong commented Nov 28, 2024 • edited Loading

Daniel-Halvey commented Nov 30, 2024

joshabramson commented Dec 9, 2024

shevsea commented Dec 10, 2024

ackbar03 commented Dec 15, 2024

Augustin-Zidek commented Dec 17, 2024

YoavShamir5 commented Dec 24, 2024 • edited by Augustin-Zidek Loading

Augustin-Zidek commented Jan 6, 2025

alephreish commented Jan 21, 2025 • edited Loading

zhaisilong commented Nov 28, 2024 •

edited

Loading

YoavShamir5 commented Dec 24, 2024 •

edited by Augustin-Zidek

Loading

alephreish commented Jan 21, 2025 •

edited

Loading