Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about Post-Translational Modifications (PTMs) in Protein Prediction #54

Open
wtni-gidle opened this issue Nov 15, 2024 · 12 comments
Labels
bug Something isn't working question Further information is requested

Comments

@wtni-gidle
Copy link

Thanks for providing the AF3 source!
To test AlphaFold3 using the example 7BBV provided by AlphaFold3 server, I used the following JSON file as input:

{
    "name": "7BBV",
    "modelSeeds": [1],
    "sequences": [
        {
            "protein": {
                "id": "A",
                "sequence": "TPTPTIQEDGSPALIAKRASVTESCNIGYASTNGGTTGGKGGATTTVSTLAQFTKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH",
                "modifications": [
                    {"ptmType": "MAN", "ptmPosition": 22},
                    {"ptmType": "MAN", "ptmPosition": 44},
                    {"ptmType": "MAN", "ptmPosition": 45},
                    {"ptmType": "MAN", "ptmPosition": 46},
                    {"ptmType": "MAN", "ptmPosition": 48},
                    {"ptmType": "MAN", "ptmPosition": 54}
                ]
            }
        },
        {
            "ligand": {
                "id": "B",
                "ccdCodes": ["CA"]
            }
        }

    ],
    "dialect": "alphafold3",
    "version": 1
}

However, during the run_inference stage, the following error occurred:

ValueError: First MSA sequence TPTPTIQEDGSPALIAKRASVTESCNIGYASTNGGTTGGKGGATTTVSTLAQFTKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH is not the query_sequence='TPTPTIQEDGSPALIAKRASVXESCNIGYASTNGGTTGGKGGAXXXVXTLAQFXKAAESSGKLNIVVKGKISGGAKVRVQSDKTIIGQKGSELVGTGLYINKVKNVIVRNMKISKVKDSNGDAIGIQASKNVWVDHCDLSSDLKSGKDYYDGLLDITHGSDWVTVSNTFLHDHFKASLIGHTDSNAKEDKGKLHVTYANNYWYNVNSRNPSVRFGTVHIYNNYYLEVGSSAVNTRMGAQVRVESTVFDKSTKNGIISVDSKEKGYATVGDISWGSSTNTAPKGTLGSSNIPYSYNLYGKNNVKARVYGTAGQTLGFAAASFLEQKLISEEDLNSAVDHHHHHH'

Upon inspecting the error, I observed that:

  1. Modified residues in the query sequence are converted to X.
  2. However, the first MSA retains the original unmodified sequence.

Is it necessary for me to modify the input file, or is there a bug in the source code?

Thanks.

@wtni-gidle
Copy link
Author

I spent some time analyzing the source code and found that:

During the run_inference stage, the query_sequence is generated as follows:

if chain_type in mmcif_names.POLYMER_CHAIN_TYPES:
sequence = substruct.chain_single_letter_sequence()[b_chain_id]

This relies on struct:

struct = fold_input.to_structure(ccd=ccd)

sequence in struct:

match chain:
case ProteinChain():
sequences.append('(' + ')('.join(chain.to_ccd_sequence()) + ')')

to_ccd_sequence replaces the original residues with ccd from the ptm:

for ptm_code, ptm_index in self.ptms:
ccd_coded_seq[ptm_index - 1] = ptm_code

This means that the query_sequence is not exactly the same as the original input. It first replaces the original residues based on the provided modifications. Then, the 'actual' sequence is extracted using the chain_single_letter_sequence function.

In the chain_single_letter_sequence function, the residue_names.CCD_NAME_TO_ONE_LETTER dictionary is used to convert CCD residues to their corresponding one-letter codes.

chain_res = string_array.remap(
chain_res,
mapping=residue_names.CCD_NAME_TO_ONE_LETTER,
inplace=False,
default_value=unknown_default,
)

So, if I understand correctly, if a CCD code from the modifications is present in the CCD_NAME_TO_ONE_LETTER dictionary and the converted residue matches the original residue, there will be no sequence mismatch error. However, if the converted residue does not match the original (i.e. CCD code is transformed into X, N, or some other residue), this would result in the same error as mentioned in the issue.

Then I further checked CCD_NAME_TO_ONE_LETTER and found that there is no corresponding single letter for MAN. And when I tried running the given example in https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#protein, it ran successfully, and the CCD code had a corresponding single letter in CCD_NAME_TO_ONE_LETTER. These experiments should have confirmed my thoughts.

Therefore, I think there should be a restriction on which CCD codes can be used in PTM (e.g. they must be recorded in CCD_NAME_TO_ONE_LETTER), or the source code should be improved to better handle the identification of query_sequence.

Please let me know if I’m wrong.

Thanks.

@joshabramson
Copy link
Collaborator

"MAN" is a glycan and should be defined as a bonded ligand, see https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#bonds

Please note that converting AlphaFold-Server JSONs containing glycans is not currently supported, see https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md#glycans

For PDB examples where a cif already exists, one can create the input json from the cif using from_mmcif in the folding_input class: https://github.com/google-deepmind/alphafold3/blob/main/src/alphafold3/common/folding_input.py#L795C7-L795C17 (we will add this info to the input docs soon)

@joshabramson
Copy link
Collaborator

On, your follow up message, thanks for digging into the code!

I think there should be a restriction on which CCD codes can be used in PTM (e.g. they must be recorded in CCD_NAME_TO_ONE_LETTER), or the source code should be improved to better handle the identification of query_sequence

Good find - we will look into this.

@Augustin-Zidek Augustin-Zidek added question Further information is requested bug Something isn't working labels Nov 18, 2024
@zhaisilong
Copy link

zhaisilong commented Nov 28, 2024

@wtni-gidle I encountered a similar question while working on my modified sequence. Following your guidance, I successfully ran a modified example. Here’s the case:

If you want to apply modifications at positions 2 and 5, changing them to HY3 and P1L for a sequence like "S(N)AD(E)VTVGKFYATFLIQEYFRKFKKRKEQGLVGKPS" (from PDB entry 6DAE, chain C), you might encounter issues. Specifically, this approach would result in an incorrect sequence.

To resolve this, you can refer to the reversed mapping dictionary from CCD_NAME_TO_ONE_LETTER. This dictionary shows that HY3 maps back to P and P1L maps back to C. Thus, the correct input sequence should be "S(P)AD(C)VTVGKFYATFLIQEYFRKFKKRKEQGLVGKPS". When provided in this format, AF3 works correctly.

@Daniel-Halvey
Copy link

I am just looking at this for the first time today, I am confident we will have an elegant solution with the x mutation in the run_inferance in due time. I am just good at typing creative solutions, my Old Man is M.D.. Let me load up pycharm.

@joshabramson
Copy link
Collaborator

Small update: thanks again for reporting this. We are working on a fix, which should land by the end of the week.

@shevsea
Copy link

shevsea commented Dec 10, 2024

If one manually adds an entry to the dictionary (e.g. 'TYC': 'Y'), will the error be fixed?

@ackbar03
Copy link

any luck on this? Thanks for looking at this btw, also facing this issue

@Augustin-Zidek
Copy link
Collaborator

This has been fixed in 44dee65, sorry for the delay!

Please clone the latest commit and rebuild your AlphaFold 3 container to get the fix.

@YoavShamir5
Copy link

YoavShamir5 commented Dec 24, 2024

Hello @Augustin-Zidek, I would appreciate your input - I just tried to predict the structure of a protein with a single PTM (my user-provided CCD) and no additional molecules, leading to the same error message:

File "/alphafold3_venv/lib/python3.11/site-packages/alphafold3/data/msa.py", line 220, in from_a3m
    return cls(
           ^^^^
  File "/alphafold3_venv/lib/python3.11/site-packages/alphafold3/data/msa.py", line 108, in __init__
    raise ValueError(
ValueError: First MSA sequence GKNAPSTAGLGYGSWEIDPKDLTFLKELGTGQFGVVKYGKWRGQYDVAIKMIKEGSMSEDEFIEEAKVMMNLSHEKLVQLYGVCTKQRPIFIITEYMANGCLLNYLREMRHRFQTQQLLEMCKDVCEAMEYLESKQFLHRDLAARNCLVNDQGVVKVSDFGLSRYVLDDEYTSSVGSKFPVRWSPPEVLMYSKFSSKSDIWAFGVLMWEIYSLGKMPYERFTNSETAEHIAQGLRLYRPHLASEKVYTIMYSCWHEKADERPTFKILLSNILDVMDEES is not the query_sequence='GKNAPSTAGLGYGSWEIDPKDLTFLKELGTGQFGVVKYGKWRGQYDVAIKMIKEGSMSEDEFIEEAKVMMNLSHEKLVQLYGVCTKQRPIFIITEYMANGXLLNYLREMRHRFQTQQLLEMCKDVCEAMEYLESKQFLHRDLAARNCLVNDQGVVKVSDFGLSRYVLDDEYTSSVGSKFPVRWSPPEVLMYSKFSSKSDIWAFGVLMWEIYSLGKMPYERFTNSETAEHIAQGLRLYRPHLASEKVYTIMYSCWHEKADERPTFKILLSNILDVMDEES'

I am not sure if there is an issue with my syntax, or if this issue persists. Here is the json I provide from my inference-only script, which fails with the above error (delete the MSA and template segments here due to size):

{
    "dialect": "alphafold3",
    "version": 1,
    "name": "5P9J",
    "sequences": [
      {
        "protein": {
          "id": "A",
          "sequence": "GKNAPSTAGLGYGSWEIDPKDLTFLKELGTGQFGVVKYGKWRGQYDVAIKMIKEGSMSEDEFIEEAKVMMNLSHEKLVQLYGVCTKQRPIFIITEYMANGCLLNYLREMRHRFQTQQLLEMCKDVCEAMEYLESKQFLHRDLAARNCLVNDQGVVKVSDFGLSRYVLDDEYTSSVGSKFPVRWSPPEVLMYSKFSSKSDIWAFGVLMWEIYSLGKMPYERFTNSETAEHIAQGLRLYRPHLASEKVYTIMYSCWHEKADERPTFKILLSNILDVMDEES",
          "modifications": [ {"ptmType": "M1", "ptmPosition": 101}],
        }
      }
    ],
    "modelSeeds": [
      1
    ],
    "bondedAtomPairs": null,
    "userCCD": "data_M1\n#\n_chem_comp.id          M1\n_chem_comp.name    af3_enrich\n_chem_comp.type    non-polymer\n_chem_comp.formula    ?\n_chem_comp.mon_nstd_parent_comp_id    ?\n_chem_comp.pdbx_synonyms    ?\n_chem_comp.formula_weight    ?\n_chem_comp.pdbx_smiles '[H]OC(=O)C([H])(N([H])[H])C([H])([H])SC([H])([H])C([H])([H])C(=O)N1C([H])([H])C([H])([H])C([H])([H])[C@@]([H])(n2nc(-c3c([H])c([H])c(Oc4c([H])c([H])c([H])c([H])c4[H])c([H])c3[H])c3c(N([H])[H])nc([H])nc32)C1([H])[H]'\n#\nloop_\n_chem_comp_atom.atom_id\n_chem_comp_atom.charge\n_chem_comp_atom.pdbx_leaving_atom_flag\n_chem_comp_atom.comp_id\n_chem_comp_atom.pdbx_model_Cartn_x_ideal\n_chem_comp_atom.pdbx_model_Cartn_y_ideal\n_chem_comp_atom.pdbx_model_Cartn_z_ideal\n_chem_comp_atom.type_symbol\nC1 0 N M1 -4.963 -0.007 1.670 C\nCov 0 N M1 -6.443 0.329 1.834 C\nSG 0 N M1 -7.505 -0.328 0.492 S\nCB 0 N M1 -7.027 0.750 -0.914 C\nCA 0 N M1 -7.374 2.235 -0.729 C\nC 0 N M1 -7.097 2.949 -2.052 C\nOXT 0 N M1 -5.792 3.234 -2.234 O\nO 0 N M1 -7.915 3.223 -2.918 O\nN 0 N M1 -8.782 2.460 -0.320 N\nC2 0 N M1 -4.621 -1.440 2.057 C\nO1 0 N M1 -5.447 -2.169 2.607 O\nN1 0 N M1 -3.312 -1.871 1.810 N\nC3 0 N M1 -3.074 -3.312 1.938 C\nC4 0 N M1 -2.382 -1.142 0.932 C\nC5 0 N M1 -1.615 -3.647 2.227 C\nC6 0 N M1 -0.908 -1.473 1.216 C\nC7 0 N M1 -0.678 -2.987 1.220 C\nN2 0 N M1 0.048 -0.795 0.350 N\nC8 0 N M1 -0.052 -0.526 -0.987 C\nN3 0 N M1 1.225 -0.399 0.884 N\nC9 0 N M1 1.135 0.094 -1.347 C\nN4 0 N M1 -1.092 -0.795 -1.796 N\nC10 0 N M1 1.918 0.101 -0.154 C\nC11 0 N M1 1.241 0.519 -2.678 C\nC12 0 N M1 -0.866 -0.373 -3.050 C\nC13 0 N M1 3.293 0.561 0.036 C\nN5 0 N M1 2.315 1.220 -3.208 N\nN6 0 N M1 0.217 0.271 -3.521 N\nC14 0 N M1 3.624 1.347 1.149 C\nC15 0 N M1 4.302 0.205 -0.870 C\nC16 0 N M1 4.937 1.788 1.338 C\nC17 0 N M1 5.614 0.648 -0.684 C\nC18 0 N M1 5.935 1.429 0.430 C\nO2 0 N M1 7.191 1.938 0.673 O\nC19 0 N M1 8.270 1.160 0.322 C\nC20 0 N M1 9.208 1.710 -0.550 C\nC21 0 N M1 8.471 -0.107 0.872 C\nC22 0 N M1 10.335 0.969 -0.910 C\nC23 0 N M1 9.599 -0.846 0.512 C\nC24 0 N M1 10.527 -0.309 -0.383 C\nH1 0 N M1 -4.391 0.641 2.345 H\nH2 0 N M1 -4.639 0.198 0.647 H\nH3 0 N M1 -6.579 1.413 1.884 H\nH4 0 N M1 -6.828 -0.072 2.777 H\nH5 0 N M1 -5.960 0.622 -1.116 H\nH6 0 N M1 -7.551 0.356 -1.793 H\nH7 0 N M1 -6.737 2.695 0.034 H\nH8 0 N M1 -5.777 3.693 -3.102 H\nH9 0 N M1 -9.059 1.693 0.297 H\nH10 0 N M1 -9.379 2.361 -1.147 H\nH11 0 N M1 -3.697 -3.735 2.732 H\nH12 0 N M1 -3.389 -3.777 0.995 H\nH13 0 N M1 -2.679 -1.407 -0.087 H\nH14 0 N M1 -2.523 -0.063 1.044 H\nH15 0 N M1 -1.360 -3.301 3.236 H\nH16 0 N M1 -1.473 -4.733 2.214 H\nH17 0 N M1 -0.692 -1.100 2.229 H\nH18 0 N M1 -0.848 -3.404 0.221 H\nH19 0 N M1 0.361 -3.215 1.486 H\nH20 0 N M1 -1.654 -0.567 -3.770 H\nH21 0 N M1 2.048 1.670 -4.076 H\nH22 0 N M1 2.809 1.808 -2.549 H\nH23 0 N M1 2.861 1.617 1.877 H\nH24 0 N M1 4.080 -0.424 -1.731 H\nH25 0 N M1 5.182 2.403 2.201 H\nH26 0 N M1 6.372 0.376 -1.414 H\nH27 0 N M1 9.057 2.707 -0.953 H\nH28 0 N M1 7.758 -0.521 1.581 H\nH29 0 N M1 11.063 1.389 -1.598 H\nH30 0 N M1 9.757 -1.836 0.933 H\nH31 0 N M1 11.406 -0.884 -0.662 H\n#\nloop_\n_chem_comp_bond.atom_id_1\n_chem_comp_bond.atom_id_2\n_chem_comp_bond.comp_id\n_chem_comp_bond.pdbx_aromatic_flag\n_chem_comp_bond.pdbx_stereo_config\n_chem_comp_bond.value_order\nC1  Cov M1 N N SING\nCov SG  M1 N N SING\nSG  CB  M1 N N SING\nCB  CA  M1 N N SING\nCA  C   M1 N N SING\nC   OXT M1 N N SING\nC   O   M1 N N DOUB\nCA  N   M1 N N SING\nC1  C2  M1 N N SING\nC2  O1  M1 N N DOUB\nC2  N1  M1 N N SING\nN1  C3  M1 N N SING\nC4  N1  M1 N N SING\nC3  C5  M1 N N SING\nC6  C4  M1 N N SING\nC5  C7  M1 N N SING\nC7  C6  M1 N N SING\nC6  N2  M1 N N SING\nN2  C8  M1 Y N SING\nN3  N2  M1 Y N SING\nC8  C9  M1 Y N DOUB\nN4  C8  M1 Y N SING\nC10 N3  M1 Y N DOUB\nC9  C11 M1 Y N SING\nC9  C10 M1 Y N SING\nC12 N4  M1 Y N DOUB\nC10 C13 M1 N N SING\nC11 N5  M1 N N SING\nC11 N6  M1 Y N DOUB\nN6  C12 M1 Y N SING\nC13 C14 M1 Y N DOUB\nC15 C13 M1 Y N SING\nC14 C16 M1 Y N SING\nC17 C15 M1 Y N DOUB\nC16 C18 M1 Y N DOUB\nC18 C17 M1 Y N SING\nC18 O2  M1 N N SING\nO2  C19 M1 N N SING\nC19 C20 M1 Y N DOUB\nC21 C19 M1 Y N SING\nC20 C22 M1 Y N SING\nC23 C21 M1 Y N DOUB\nC22 C24 M1 Y N DOUB\nC24 C23 M1 Y N SING\nC1  H1  M1 N N SING\nC1  H2  M1 N N SING\nCov H3  M1 N N SING\nCov H4  M1 N N SING\nCB  H5  M1 N N SING\nCB  H6  M1 N N SING\nCA  H7  M1 N N SING\nOXT H8  M1 N N SING\nN   H9  M1 N N SING\nN   H10 M1 N N SING\nC3  H11 M1 N N SING\nC3  H12 M1 N N SING\nC4  H13 M1 N N SING\nC4  H14 M1 N N SING\nC5  H15 M1 N N SING\nC5  H16 M1 N N SING\nC6  H17 M1 N N SING\nC7  H18 M1 N N SING\nC7  H19 M1 N N SING\nC12 H20 M1 N N SING\nN5  H21 M1 N N SING\nN5  H22 M1 N N SING\nC14 H23 M1 N N SING\nC15 H24 M1 N N SING\nC16 H25 M1 N N SING\nC17 H26 M1 N N SING\nC20 H27 M1 N N SING\nC21 H28 M1 N N SING\nC22 H29 M1 N N SING\nC23 H30 M1 N N SING\nC24 H31 M1 N N SING\n#\n"
  }

@google-deepmind google-deepmind deleted a comment from 408404405501 Jan 6, 2025
@Augustin-Zidek
Copy link
Collaborator

@YoavShamir5 thanks for reporting, this is a bug I will fix.

The workaround for now is to change the query sequence in position 101 from C to X.

@alephreish
Copy link

alephreish commented Jan 21, 2025

@Augustin-Zidek This does not sound like a good workaround as the substitution to 'X' has to be made before the search phase and will thus influence the scoring. The better alternatives are either to change the residue to 'X' it in the alignments (the search phase has to be done as a separate call) or to name the PTM with one of the ccd names mapping to the amino acid you are modifying from CCD_NAME_TO_ONE_LETTER. To be entirely honest though, it feels like a very straightforward bug to fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

No branches or pull requests

9 participants