Different performance between model saved as fine-tuned PLM and state_dict #389

zedavid · 2024-02-29T12:10:23Z

Version
PyABSA = 2.3.4rc0
Torch = 2.1.1
Transformers = 4.35.2

Describe the bug
I've fine-tuned a model with config FAST_LSA_S_V2 using the same dataset using the APCTrainer. In one of the runs I saved it as a state_dict file and in the other a saved as PLM. I've then used the model on sample data using the APC.SentimentClassifier and the HF text-classification pipelines, but I get different results despite the model being trained the same way with the same data.

Code To Reproduce

Loading and testing the state_dict version:

sentiment_model = APC.SentimentClassifier('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/')
examples = [
    "ty images city officials are standing their ground in defense of the law delivery workers like all workers deserve fair pay for their labor and we are disappointed that [B-ASP]Uber[E-ASP] doordash grubhub and relay disagree vilda vera mayuga head of the city's department of consumer and worker protection said in",
    "as prepared just a mile away for it to be delivered to me the food arrived stone cold i complained to [B-ASP]Uber[E-ASP] and [B-ASP]Uber[E-ASP] told me to go pound sand readers purporting to work for [B-ASP]Uber[E-ASP] left dozens of comments castigating me and others for not preemptively tipping arguing that such deliveries are not worth"
]

sentiment_model.predict(
    text=examples,
    eval_batch_size=32,
)

output:

[{'text': "ty images city officials are standing their ground in defense of the law delivery workers like all workers deserve fair pay for their labor and we are disappointed that Uber doordash grubhub and relay disagree vilda vera mayuga head of the city's department of consumer and worker protection said in",
  'aspect': ['Uber'],
  'sentiment': ['Negative'],
  'confidence': [0.9339152574539185],
  'probs': [array([0.93391526, 0.05876274, 0.007322  ], dtype=float32)],
  'ref_sentiment': ['-100'],
  'ref_check': [''],
  'perplexity': 'N.A.'},
 {'text': 'as prepared just a mile away for it to be delivered to me the food arrived stone cold i complained to Uber and Uber told me to go pound sand readers purporting to work for Uber left dozens of comments castigating me and others for not preemptively tipping arguing that such deliveries are not worthwh',
  'aspect': ['Uber', 'Uber', 'Uber'],
  'sentiment': ['Negative', 'Negative', 'Negative'],
  'confidence': [0.9557020664215088, 0.9557020664215088, 0.9557020664215088],
  'probs': [array([0.95570207, 0.03284235, 0.01145565], dtype=float32),
   array([0.95570207, 0.03284235, 0.01145565], dtype=float32),
   array([0.95570207, 0.03284235, 0.01145565], dtype=float32)],
  'ref_sentiment': ['-100', '-100', '-100'],
  'ref_check': ['', '', ''],
  'perplexity': 'N.A.'}]

With the HF text-classification pipeline

model_tokenizer = AutoTokenizer.from_pretrained('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/fine-tuned-pretrained-model/')
model = AutoModelForSequenceClassification.from_pretrained('checkpoints/fast_lsa_s_v2_uber_acc_98.2_f1_98.14/fine-tuned-pretrained-model/')
sentiment_pipeline = pipeline('text-classification', model=model, tokenizer=model_tokenizer, device=1)
examples_no_tag = [{'text':re.sub(r"\[B-ASP\](.+?)\[E-ASP\]", r"\1", ex), 'text_pair': 'Uber'} for ex in examples]
sentiment_pipeline(examples_no_tag, top_k = 3)

Output:

[[{'label': 'Neutral', 'score': 0.38777175545692444},
  {'label': 'Positive', 'score': 0.3418353199958801},
  {'label': 'Negative', 'score': 0.27039292454719543}],
 [{'label': 'Neutral', 'score': 0.3863997459411621},
  {'label': 'Positive', 'score': 0.3450266420841217},
  {'label': 'Negative', 'score': 0.2685735821723938}]]

Expected behavior
I would expect there would be some correspondence between the output probability in both versions of the model.

Thanks!

The text was updated successfully, but these errors were encountered:

yangheng95 · 2024-02-29T13:10:43Z

The model saved as huggingface format is not intended as instant inference but further finetuing and the state_dict is the recommended save mode. If you want to run the model on pipeline, there is a model have been released at: https://huggingface.co/yangheng/deberta-v3-base-absa-v1.1

zedavid · 2024-03-04T12:14:47Z

I see. What is required to make that model available to be run with huggingface pipeline?
Also, is there a checkpoint for the huggingface model? I would like to replicate the results I get with the pipeline with PyABSA.

yangheng95 · 2024-03-09T17:03:26Z

I am sorry for that, it is tricky to train models compatible with huggingface pipeline, and I have cleaned the original materials such as codes so I am afraid that I cannot provide detailed help for that.

zedavid added the bug Something isn't working label Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different performance between model saved as fine-tuned PLM and state_dict #389

Different performance between model saved as fine-tuned PLM and state_dict #389

zedavid commented Feb 29, 2024

yangheng95 commented Feb 29, 2024

zedavid commented Mar 4, 2024

yangheng95 commented Mar 9, 2024

Different performance between model saved as fine-tuned PLM and state_dict #389

Different performance between model saved as fine-tuned PLM and state_dict #389

Comments

zedavid commented Feb 29, 2024

Loading and testing the state_dict version:

With the HF text-classification pipeline

yangheng95 commented Feb 29, 2024

zedavid commented Mar 4, 2024

yangheng95 commented Mar 9, 2024