Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different performance between model saved as fine-tuned PLM and state_dict #389

Open
zedavid opened this issue Feb 29, 2024 · 3 comments
Open
Labels
bug Something isn't working

Comments

@zedavid
Copy link

zedavid commented Feb 29, 2024

Version
PyABSA = 2.3.4rc0
Torch = 2.1.1
Transformers = 4.35.2

Describe the bug
I've fine-tuned a model with config FAST_LSA_S_V2 using the same dataset using the APCTrainer. In one of the runs I saved it as a state_dict file and in the other a saved as PLM. I've then used the model on sample data using the APC.SentimentClassifier and the HF text-classification pipelines, but I get different results despite the model being trained the same way with the same data.

Code To Reproduce

Loading and testing the state_dict version:

sentiment_model = APC.SentimentClassifier('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/')
examples = [
    "ty images city officials are standing their ground in defense of the law delivery workers like all workers deserve fair pay for their labor and we are disappointed that [B-ASP]Uber[E-ASP] doordash grubhub and relay disagree vilda vera mayuga head of the city's department of consumer and worker protection said in",
    "as prepared just a mile away for it to be delivered to me the food arrived stone cold i complained to [B-ASP]Uber[E-ASP] and [B-ASP]Uber[E-ASP] told me to go pound sand readers purporting to work for [B-ASP]Uber[E-ASP] left dozens of comments castigating me and others for not preemptively tipping arguing that such deliveries are not worth"
]

sentiment_model.predict(
    text=examples,
    eval_batch_size=32,
)

output:

[{'text': "ty images city officials are standing their ground in defense of the law delivery workers like all workers deserve fair pay for their labor and we are disappointed that Uber doordash grubhub and relay disagree vilda vera mayuga head of the city's department of consumer and worker protection said in",
  'aspect': ['Uber'],
  'sentiment': ['Negative'],
  'confidence': [0.9339152574539185],
  'probs': [array([0.93391526, 0.05876274, 0.007322  ], dtype=float32)],
  'ref_sentiment': ['-100'],
  'ref_check': [''],
  'perplexity': 'N.A.'},
 {'text': 'as prepared just a mile away for it to be delivered to me the food arrived stone cold i complained to Uber and Uber told me to go pound sand readers purporting to work for Uber left dozens of comments castigating me and others for not preemptively tipping arguing that such deliveries are not worthwh',
  'aspect': ['Uber', 'Uber', 'Uber'],
  'sentiment': ['Negative', 'Negative', 'Negative'],
  'confidence': [0.9557020664215088, 0.9557020664215088, 0.9557020664215088],
  'probs': [array([0.95570207, 0.03284235, 0.01145565], dtype=float32),
   array([0.95570207, 0.03284235, 0.01145565], dtype=float32),
   array([0.95570207, 0.03284235, 0.01145565], dtype=float32)],
  'ref_sentiment': ['-100', '-100', '-100'],
  'ref_check': ['', '', ''],
  'perplexity': 'N.A.'}]

With the HF text-classification pipeline

model_tokenizer = AutoTokenizer.from_pretrained('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/fine-tuned-pretrained-model/')
model = AutoModelForSequenceClassification.from_pretrained('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/fine-tuned-pretrained-model/')
sentiment_pipeline = pipeline('text-classification', model=model, tokenizer=model_tokenizer, device=1)
examples_no_tag = [{'text':re.sub(r"\[B-ASP\](.+?)\[E-ASP\]", r"\1", ex), 'text_pair': 'Uber'} for ex in examples]
sentiment_pipeline(examples_no_tag, top_k = 3)

Output:

[[{'label': 'Neutral', 'score': 0.38777175545692444},
  {'label': 'Positive', 'score': 0.3418353199958801},
  {'label': 'Negative', 'score': 0.27039292454719543}],
 [{'label': 'Neutral', 'score': 0.3863997459411621},
  {'label': 'Positive', 'score': 0.3450266420841217},
  {'label': 'Negative', 'score': 0.2685735821723938}]]

Expected behavior
I would expect there would be some correspondence between the output probability in both versions of the model.

Thanks!

@zedavid zedavid added the bug Something isn't working label Feb 29, 2024
@yangheng95
Copy link
Owner

The model saved as huggingface format is not intended as instant inference but further finetuing and the state_dict is the recommended save mode. If you want to run the model on pipeline, there is a model have been released at: https://huggingface.co/yangheng/deberta-v3-base-absa-v1.1

@zedavid
Copy link
Author

zedavid commented Mar 4, 2024

I see. What is required to make that model available to be run with huggingface pipeline?
Also, is there a checkpoint for the huggingface model? I would like to replicate the results I get with the pipeline with PyABSA.

@yangheng95
Copy link
Owner

I am sorry for that, it is tricky to train models compatible with huggingface pipeline, and I have cleaned the original materials such as codes so I am afraid that I cannot provide detailed help for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants