Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird cuda OutOfMemoryError error #561

Open
Nevermetyou65 opened this issue Feb 14, 2025 · 0 comments
Open

Weird cuda OutOfMemoryError error #561

Nevermetyou65 opened this issue Feb 14, 2025 · 0 comments

Comments

@Nevermetyou65
Copy link

Hi

I need some helps.

I am trying to evaluate my pre-trained model with Thai fine tasks. Here is my command

export CUDA_VISIBLE_DEVICES="0,1"
echo "Running lighteval for model: meta-llama/Llama-3.2-3B"
lighteval accelerate \
"pretrained=meta-llama/Llama-3.2-3B,dtype=bfloat16,model_parallel=True" \
"examples/tasks/fine_tasks/mcf/th.txt" \
--custom-tasks "src/lighteval/tasks/multilingual/tasks.py" \
--dataset-loading-processes 8 \
--cache-dir "./le_cache" \
--no-use-chat-template \
--override-batch-size 4

When I ran this command I got error

OutOfMemoryError: CUDA out of memory. Tried to allocate 6.02 GiB. GPU 0 has a total capacity of 39.59 GiB of which 5.52 GiB is free. 
Process 35878 has 674.00 MiB memory in use. Process 32242 has 33.41 GiB memory in use. Of the allocated memory 28.39 GiB is allocated by 
PyTorch, and 3.70 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting 
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  
(https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Which is kind of strange to me. My batch size is very small and here is my GPU machines spec

Image

The only thing I can guess is that I have 7 dataset to evaluate but I still got no idea

# mcf.th.txt
# General Knowledge (GK)
lighteval|meta_mmlu_tha_mcf|5|1
lighteval|m3exams_tha_mcf|5|1

# Reading Comprehension (RC)
lighteval|belebele_tha_Thai_mcf|5|1
lighteval|thaiqa_tha|5|1
lighteval|xquad_tha|5|1

# Natural Language Understanding (NLU)
lighteval|community_hellaswag_tha_mcf|5|1
lighteval|xnli2.0_tha_mcf|5|1

Any Ideas?

I use lighteval 0.6.0.dev0 and torch 2.2.2+cu121. I clone this repo and pip install -e .[dev]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant