Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arbitrary output when inferencing on hg_38 #1

Open
DreamAtDawn opened this issue Jan 13, 2025 · 3 comments
Open

Arbitrary output when inferencing on hg_38 #1

DreamAtDawn opened this issue Jan 13, 2025 · 3 comments

Comments

@DreamAtDawn
Copy link

DreamAtDawn commented Jan 13, 2025

Hello there. I was interested in your work and tried to test for an inference baseline. However, after finishing and debugging the script according to the given .ipynb tutorial, I find that the model would output .csv file with NaNs. Confusingly, after rerun the code, it would sometimes give out seemingly valid output (I mean, .csv with floats inside). So since I'm still a rookie in this area, I'm wondering if you could offer some advice on it.

I'm working under conda env. with python=3.8.20, pytorch=2.4.1+cu118 (since the old version is not suitable for RTX 3090 of sm_86 arch), pytorch-lightning=1.2.10.

Thank you so much!

Here is my script:

# prediction_hg38.py

import torch
import os
from pyfaidx import Fasta
import numpy as np
import pandas as pd
os.environ['CUDA_VISIBLE_DEVICES']='5'
from model import EpiGePT
from model.config import *
from model.utils import *
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
from pytorch_lightning.callbacks.model_checkpoint import ModelCheckpoint

# Allowlist the globals
torch.serialization.add_safe_globals([EarlyStopping])
torch.serialization.add_safe_globals([ModelCheckpoint])

model = EpiGePT.EpiGePT(WORD_NUM,TF_DIM,BATCH_SIZE)
model.load_state_dict(torch.load('pretrain_models_hg38/pretrained_epigept.ckpt', 
                                 map_location='cuda:0', weights_only=False)['state_dict'], strict=False)
genome = Fasta('hg38.fa')
model.eval()
model = model.cuda()

acgt2num = {'A': 0,
            'C': 1,
            'G': 2,
            'T': 3}

def seq2mat(seq):
    seq = seq.upper()
    h = 4
    w = len(seq)
    mat = np.zeros((h, w), dtype=bool)  # True or false in mat
    for i in range(w):
        if seq[i] != 'N':
            mat[acgt2num[seq[i]], i] = 1.
    return mat

def model_predict(model, chrom, start, end, tf_feature):
    seq = genome[chrom][start:end].seq
    
    seq_embeds = seq2mat(seq)
    seq_embeds = np.array(seq_embeds, dtype='float16')  # (50,)
    seq_embeds = np.expand_dims(seq_embeds, axis=0)
    seq_embeds = torch.from_numpy(seq_embeds).to(torch.float32)
    
    tf_feature = np.pad(tf_feature, ((0, 0), (0, 1)), 'constant', constant_values=(0, 0))
    tf_feature = np.expand_dims(tf_feature, axis=0)
    tf_feature = torch.from_numpy(tf_feature)
    
    seq_embeds = seq_embeds.type(torch.FloatTensor)
    tf_feature = tf_feature.type(torch.FloatTensor)
    seq_embeds = seq_embeds.to('cuda')
    tf_feature = tf_feature.to('cuda')
    print(seq_embeds.dtype, tf_feature.dtype)
    
    signals = model(seq_embeds, tf_feature)
    np_signals = signals.cpu().detach().numpy()
    return np_signals

# get information about 711 TFs
tf_list, motif2tf, isContain = get_tf_motif_match()
print(tf_list[:10])

ref_tf_expression = pd.read_csv('reference_TPM_tf_expr.csv', header=0, sep='\t', index_col=[0])

def quantile_norm(df,ref_exp):
    """quantile normalization
    Arg:
        df: Pandas DataFrame with TFs x cells
    Return:
        normalized Pandas DataFrame 
    """

    rank_mean = ref_exp.stack().groupby(ref_exp.rank(method='first').stack().astype(int)).mean()
    return df.rank(method='min').stack().astype(int).map(rank_mean).unstack()

cell_id = 0  # index for the cell type
N = 64  # number for cell types
tf_TPM_expression = pd.read_csv('EXAMPLE_TPM.csv', header=0, sep='\t', index_col=[0])  # TPM value of 711 TFs

tf_expression = quantile_norm(tf_TPM_expression,ref_tf_expression)
tf_expression = np.array(tf_expression.iloc[:,cell_id])  # quntile transformed TPM values

if not os.path.exists('homer_motifscore.txt'):
    # scan the motif binding score using Homer tool
    os.system('findMotifsGenome.pl bins.bed hg38.fa ./ -find motifdata/all_motif_rmdup.motif -p 32 -cache 320000 > homer_motifscore.txt')

"""parse the motifscan results by homer tool.
    Args:
        bin_file: path to bed file that was used for motif scanning
        motifscan_file: path to the motifscan file from Homer tool (e.g. homer_motifscore.txt)
"""
bin_file = "bins.bed"
motifscan_file = "homer_motifscore.txt"

tf_motifscore = get_motif_score(tf_list, motif2tf, isContain, bin_file, motifscan_file, save=True)

# TF feature
tf_feature = tf_motifscore * np.log(tf_expression + 1)
print(tf_feature.shape)

chrom, start, end = 'chr1', 768000, 896000
region = [chrom, start, end]
bins = [chrom+':'+str(start+i*128)+'-'+str(start+(i+1)*128) for i in range(1000)]

# bin level prediction
predict = model_predict(model, region[0], region[1], region[2], tf_feature)
pd_predict = pd.DataFrame(predict[0, :, :], index=bins, 
                          columns = ['DNase', 'CTCF', 'H3K27ac', 'H3K4me3', 'H3K36me3', 'H3K27me3', 'H3K9me3', 'H3K4me1'])
print(pd_predict)

pd_predict.to_csv('demo_region_prediction.csv')
print('Done!')

And here are several outputs of rerunning the code:

anaconda3/envs/epigept/lib/python3.8/site-packages/torch/__init__.py:955: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:432.)
  _C._set_default_tensor_type(t)
['AFP', 'AHR', 'AIRE', 'ALX1', 'ALX3', 'ALX4', 'AR', 'ARHGEF12', 'ARID3A', 'ARID3B']
PositionID      Offset  Sequence        Motif Name      Strand  MotifScore

1000    51      CAGTCAATAC      HNF6(Homeobox)/Liver-Hnf6-ChIP-Seq(ERP000394)/Homer     -       4.867507

(1000, 711)
torch.float32
                       DNase      CTCF   H3K27ac   H3K4me3  H3K36me3  H3K27me3   H3K9me3   H3K4me1
chr1:768000-768128  0.171321  0.072641  0.000000  0.000000  0.042667  0.023973  0.002948  0.000000
chr1:768128-768256  1.558629  0.825079  0.640414  0.556037  1.014686  0.652755  0.547468  0.654196
chr1:768256-768384  0.989475  0.706329  0.600062  0.482069  0.788652  0.576407  0.408723  0.566097
chr1:768384-768512  1.184669  0.517184  0.362211  0.221935  0.466018  0.347607  0.224645  0.388150
chr1:768512-768640  0.934541  0.675771  0.604801  0.435896  0.778163  0.528984  0.396869  0.582172
...                      ...       ...       ...       ...       ...       ...       ...       ...
chr1:895360-895488  1.978261  0.766934  0.661205  0.535452  0.850547  0.580888  0.540998  0.649658
chr1:895488-895616  2.286676  1.030839  0.925649  0.719848  1.244636  0.855910  0.708852  0.932244
chr1:895616-895744  1.876076  0.847776  0.728561  0.620082  0.965098  0.641261  0.480885  0.680833
chr1:895744-895872  1.712706  0.846297  0.763886  0.661768  1.026974  0.699856  0.510227  0.694445
chr1:895872-896000  1.057157  0.882725  0.747024  0.694877  1.120599  0.746271  0.572334  0.751222

[1000 rows x 8 columns]
Done!

This above seems good, but after rerun the code I got:

anaconda3/envs/epigept/lib/python3.8/site-packages/torch/__init__.py:955: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:432.)
  _C._set_default_tensor_type(t)
['AFP', 'AHR', 'AIRE', 'ALX1', 'ALX3', 'ALX4', 'AR', 'ARHGEF12', 'ARID3A', 'ARID3B']
PositionID      Offset  Sequence        Motif Name      Strand  MotifScore

1000    51      CAGTCAATAC      HNF6(Homeobox)/Liver-Hnf6-ChIP-Seq(ERP000394)/Homer     -       4.867507

(1000, 711)
torch.float32 torch.float32
                    DNase  CTCF  H3K27ac  H3K4me3  H3K36me3  H3K27me3  H3K9me3  H3K4me1
chr1:768000-768128    NaN   NaN      NaN      NaN       NaN       NaN      NaN      NaN
chr1:768128-768256    NaN   NaN      NaN      NaN       NaN       NaN      NaN      NaN
chr1:768256-768384    NaN   NaN      NaN      NaN       NaN       NaN      NaN      NaN
chr1:768384-768512    NaN   NaN      NaN      NaN       NaN       NaN      NaN      NaN
chr1:768512-768640    NaN   NaN      NaN      NaN       NaN       NaN      NaN      NaN
...                   ...   ...      ...      ...       ...       ...      ...      ...
chr1:895360-895488    NaN   NaN      NaN      NaN       NaN       NaN      NaN      NaN
chr1:895488-895616    NaN   NaN      NaN      NaN       NaN       NaN      NaN      NaN
chr1:895616-895744    NaN   NaN      NaN      NaN       NaN       NaN      NaN      NaN
chr1:895744-895872    NaN   NaN      NaN      NaN       NaN       NaN      NaN      NaN
chr1:895872-896000    NaN   NaN      NaN      NaN       NaN       NaN      NaN      NaN

[1000 rows x 8 columns]
Done!
@ZjGaothu
Copy link
Owner

Thank you for this issue. Based on your description, it seems that rerunning the same code occasionally produces different results. During our own testing, we did not encounter such behavior. Given the code you shared, the data inputs should remain consistent across runs. Therefore, I suspect that the issue may arise from the initialization of the model weights.

I noticed that the strict=False parameter was used when loading the model weights. This suggests that the loaded weights might not fully align with the model's architecture. Specifically, the pretrained models for hg19 and hg38 versions have slight differences in the final Linear layer. For example, the EpiGePT.py file in the model (hg19) folder is structurally different from the one in the model_hg38 folder.

Currently, your code imports EpiGePT from the model folder (hg19 version) but loads pretrained weights for the hg38 version. Using strict=False may cause some weights to remain uninitialized, defaulting to random values. This could lead to inconsistencies, such as NaN outputs.

To address this issue, I recommend the following changes:

  1. Replace:

    from model import EpiGePT
    from model.config import *
    from model.utils import *

    with:

    from model_hg38 import EpiGePT
    from model_hg38.config import *
    from model_hg38.utils import *
  2. When loading the model weights, set strict=True to ensure all parameters are properly initialized:

    model.load_state_dict(torch.load('pretrain_models_hg38/pretrained_epigept.ckpt', 
                                     map_location='cuda:0', weights_only=False)['state_dict'], strict=True)

These changes should ensure that the model architecture matches the pretrained weights, minimizing the likelihood of inconsistent behavior.

Thank you again for bringing this to our attention, and we appreciate your interest in our work!

@xiaoxiao6630
Copy link

Hello,

I have also met some problems when testing inference pipeline based on tutorial_prediction_hg38.ipynb.
I am working with with python= 3.9.18, pytorch = 2.5.1+cu124, pytorch-lightning = 2.4.0, Pyfasta=0.5.2.

1. Cannot import GenomicData from dataset
My code: from model_hg38 import EpiGePT
The error:

File EpiGePT/model_hg38/EpiGePT.py", line 22, in <module>
    from dataset import GenomicData
ImportError: cannot import name 'GenomicData' from 'dataset'

I checked model_hg38/dataset.py and downloaded the datasets required. But I still have this error when import EpiGePT.
Then I commented out the line from dataset import GenomicData of code model_hg38/EpiGePT.py.

2. Cannot load pretrained model weights`
Then I had new error when load pretrained weights.

My code:

import torch
import os
from pyfasta import Fasta
import numpy as np
import pandas as pd
from model_hg38 import EpiGePT
from model_hg38.config import *
from model_hg38.utils import *

model = EpiGePT.EpiGePT(WORD_NUM,TF_DIM,BATCH_SIZE)
model = load_weights(model,'pretrain_models_hg38/pretrained_epigept.ckpt')

The error:

model.load_state_dict(torch.load('pretrain_models_hg38/pretrained_epigept.ckpt',map_location='cuda:0')['state_dict'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "EpiGePT/model_hg38/utils.py", line 40, in load_weights
  model.load_state_dict(torch.load('pretrain_models_hg38/pretrained_epigept.ckpt',map_location='cuda:0')['state_dict'])
File "python3.9/site-packages/torch/nn/modules/module.py", line 2584, in load_state_dict
  raise RuntimeError(
RuntimeError: Error(s) in loading state_dict for EpiGePT:
  Unexpected key(s) in state_dict: "encoder.embeddings.position_ids". 

Then I changed the model load weights into

model.load_state_dict(torch.load('pretrain_models_hg38/pretrained_epigept.ckpt', 
                                 map_location='cuda:0', weights_only=False)['state_dict'], strict=False)

It works for weights loading.

3. model_predict() error
My code:

SEQ_LENGTH = 128000
input_tf_feature = np.random.rand(1000, 711) # 711 TFs
input_seq_feature = np.zeros((1,4,SEQ_LENGTH))
predict = model_predict(model,input_seq_feature,input_tf_feature)

The error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "EpiGePT/model_hg38/utils.py", line 34, in model_predict
    signals = model(seq_embeds,tf_feature)
  File "python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "EpiGePT/model_hg38/EpiGePT.py", line 99, in forward
    x = self.convmodule(batch_inputs_seq)
  File "python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "EpiGePT/model_hg38/EpiGePT.py", line 49, in forward
    x = self.pool1(self.relu(self.conv1(x)))
  File "python3.9/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "python3.9/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "python3.9/site-packages/torch/nn/modules/conv.py", line 375, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "python3.9/site-packages/torch/nn/modules/conv.py", line 370, in _conv_forward
    return F.conv1d(
RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Could you give some suggestion about these problems?

Thanks a lot.

@ZjGaothu
Copy link
Owner

@xiaoxiao6630 Hello, thank you for raising these problems. We have rechecked and tested the tutorial_prediction_hg38.ipynb notebook, and we did not encounter the issues you mentioned. Therefore, it seems like the problems might be related to the versions of Python, PyTorch-Lightning, Transformers, or other Python packages you’re using.

  1. The issue seems to indicate that you were unable to successfully import GenomicData from dataset. This might be because the import statement did not correctly reference the dataset.py file in the model_hg38 directory and instead pointed to a script from another path. However, you can safely comment out the line from dataset import GenomicData without affecting the functionality of the code.

  2. This issue might be related to the version of the Transformers library. In newer versions, the structure of BertModel and the parameter (key) names may differ from older versions, which can result in a mismatch when loading weights from some pretrained models. The versions we used were python=3.6.9, pytorch-lightning=1.2.10, and transformers==4.5.1. I believe this version mismatch might be the cause of the problem.

  3. The last issue might be related to point 2, or it could be because the GPU device wasn’t properly specified. The tutorial we provided loads the model weights and data onto the GPU by default, but based on the error message, it seems some model weights might still be on the CPU. If that’s the case, the input data being moved to the GPU would result in a device mismatch. Please double-check that both the model and the input data are on the same device.

I hope this helps you in using EpiGePT effectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants