By searching the keywords such as “neuropeptide” from UniProt/KB database and filtering these protein terms without the precursor of flags and the signal peptide, we collected 1194 complete reviewed precursors of neuropeptide. Here, we chose 31 test precursors as the independent test dataset which was integrated into the UniProt database after 2014. To guarantee a fair comparison on the independent test dataset, the collected sequences that are similar to the test precursors at a threshold of 40% using CD-HIT would be dropped. By the above steps, the remaining precursors from the training dataset included 717 precursors (training: validation=4:1). All training data and test data are freely available at https://github.com/isyslab-hust/DeepNeuropePred.
The model architecture consists of 4 parts, pre-trained language model, convolutional layer, average pooling layer, position-wise fully-connected layers. The pre-trained language model (ESM-12) can obtain the global feature representation of the precursor because the input is the full length of the precursor sequence rather than the window sequence of the cleavage site. Using the convolution layers with different scale kernels (1 and 3) could obtain local features of the windows of 18 amino acids in two different scales and the average pooling layer was used to obtain the glob representation of the windows. In the next stage, position-wise fully connected layers were used for mapping the embedding of cleavage sites and non-cleavage sites to the classification space.