Skip to content

Weird cuda OutOfMemoryError error #561

Closed
@Nevermetyou65

Description

@Nevermetyou65

Hi

I need some helps.

I am trying to evaluate my pre-trained model with Thai fine tasks. Here is my command

export CUDA_VISIBLE_DEVICES="0,1"
echo "Running lighteval for model: meta-llama/Llama-3.2-3B"
lighteval accelerate \
"pretrained=meta-llama/Llama-3.2-3B,dtype=bfloat16,model_parallel=True" \
"examples/tasks/fine_tasks/mcf/th.txt" \
--custom-tasks "src/lighteval/tasks/multilingual/tasks.py" \
--dataset-loading-processes 8 \
--cache-dir "./le_cache" \
--no-use-chat-template \
--override-batch-size 4

When I ran this command I got error

OutOfMemoryError: CUDA out of memory. Tried to allocate 6.02 GiB. GPU 0 has a total capacity of 39.59 GiB of which 5.52 GiB is free. 
Process 35878 has 674.00 MiB memory in use. Process 32242 has 33.41 GiB memory in use. Of the allocated memory 28.39 GiB is allocated by 
PyTorch, and 3.70 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting 
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  
(https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Which is kind of strange to me. My batch size is very small and here is my GPU machines spec

Image

The only thing I can guess is that I have 7 dataset to evaluate but I still got no idea

# mcf.th.txt
# General Knowledge (GK)
lighteval|meta_mmlu_tha_mcf|5|1
lighteval|m3exams_tha_mcf|5|1

# Reading Comprehension (RC)
lighteval|belebele_tha_Thai_mcf|5|1
lighteval|thaiqa_tha|5|1
lighteval|xquad_tha|5|1

# Natural Language Understanding (NLU)
lighteval|community_hellaswag_tha_mcf|5|1
lighteval|xnli2.0_tha_mcf|5|1

Any Ideas?

I use lighteval 0.6.0.dev0 and torch 2.2.2+cu121. I clone this repo and pip install -e .[dev]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions