A problem when the sample_size is not divisible by the batch_size #5

MounirB · 2020-09-10T10:48:18Z

I adapted DataGenerator to my Deep Learning pipeline.
When the sample size is not divisible by the batch_size, the DataGenerator seems to return to the first batch without taking into account the last (smaller) batch.

Example
Let A be an array of train samples, and batch_size = 4.
A = [4,7,8,7,9,78,8,4,78,51,6,5,1,0]. Here A.size = 14
It is clear, in this situation, that A.size is not divisible by batch_size.

The batches the DataGenerator yields during the training process are the following :

Batch_0 = [4,7,8,7],
Batch_1 = [9,78,8,4]
Batch_2 = [78,51,6,5]
Batch_3 = [4,7,8,7] This is where the problem lies. Instead of having Batch_3 = [1,0]. It goes back to the first batch

Here is a situation where an other generator behaves well when the sample_size is not divisible by the batch_size https://stackoverflow.com/questions/54159034/what-if-the-sample-size-is-not-divisible-by-batch-size-in-keras-model

For your information, I kept as is the following instruction
int(np.floor(len(self.list_IDs) / self.batch_size))
If I change np.floor to np.ceil, it seems to bug during the training/validation phases.

The text was updated successfully, but these errors were encountered:

gwirn · 2022-04-14T08:50:25Z

As far as I know, the problem lies in __data_generation : it creates X and y of the size self.batch_size. This isn‘t a problem when you use np.floor because then all batches are of the same size but you lose the last batch when the size of you data is not divisible by the batch_size. If it is not divisible X and y get „filled“ with as much data as the last batch provides but the rest stays empty. This empty rest is the problem because the model can‘t fit this (uninitialized (arbitrary)) data. To prevent this from happening when using np.ceil use e.g.
true_size = len(list_IDs_temp)
X = np.empty((true_size, *self.dim, self.channels))
y = np.empty((true_size), dtype=int)
in

keras-data-generator/my_classes.py

Lines 42 to 46 in 866cce8

    
           def __data_generation(self, list_IDs_temp): 
        
               'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels) 
        
               # Initialization 
        
               X = np.empty((self.batch_size, *self.dim, self.n_channels)) 
        
               y = np.empty((self.batch_size), dtype=int)

to always get a np.empty array of the matching size.

mbrukman · 2022-12-06T03:12:04Z

FWIW, I think you have 2 options:

use math.floor(len(...) / batch_size) — this keeps all batch sizes the same, but you skip over the data past the last full-size batch (if you shuffle the indices between every epoch, you'll probably cover them eventually)
use math.ceil(len(...) / batch_size) — if so, you want the last batch to be smaller, rather than index past the end of the array and wrap around to the beginning of the array (or cause out-of-bounds errors)

If you use approach (2), then you have to cap the upper bound of the batch, i.e., instead of:

keras-data-generator/my_classes.py

Lines 25 to 26 in 866cce8

    
           # Generate indexes of the batch 
        
           indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

which can be rewritten as:

low = index * self.batch_size
high = (index+1) * self.batch_size
indexes = self.indexes[low:high]

you may want to use something like:

low = index * self.batch_size
high = min((index + 1) * self.batch_size, len(self.list_IDs))
indexes = self.indexes[low:high]

to cap at the length of the array; the last batch may be smaller
if the total number of items is not a multiple of batch size.

And if you want to micro-optimize, you can replace a multiplication with an addition:

low = index * self.batch_size
high = min(low + self.batch_size, len(self.list_IDs))
indexes = self.indexes[low:high]

I wrote some helper classes (with tests) which can be dropped in to provide the data slicing, or you can incorporate them into your Sequence subclasses directly.

gwirn · 2022-12-06T07:18:57Z

@mbrukman Thank you for the nice indexing suggestion!
I have two questions:

Is the calculation of high really necessary as slicing over the end of an array/list like

a = np.array([1, 2, 3, 4])
print(a[2:6])

would still result in [2 3]

Aren't you still creating a too big X, y in

keras-data-generator/my_classes.py

Lines 45 to 46 in 866cce8

X = np.empty((self.batch_size, *self.dim, self.n_channels))

y = np.empty((self.batch_size), dtype=int)

when your last batch is smaller than self.batch_size since the high and low doesn't change the batch size. So when your last batch is smaller than self.batch_size the X and y will still be filled with (some) entries from the initial np.empty which is a problem as pointed out here: numpy empty

mbrukman · 2023-01-04T23:07:21Z

@gwirn wrote:

Is the calculation of high really necessary as slicing over the end of an array/list like
a = np.array([1, 2, 3, 4])
print(a[2:6])
would still result in [2 3]

I think it returns [3, 4] but your point is valid: indexing past the end of a native array in Python or a NumPy array is valid, but I'm used to computing my array bounds precisely because I use a variety of languages, in some of which, indexing outside of array bounds is invalid (and either throws an exception, or crashes, or causes undefined behavior), so I aim to be precise everywhere, for consistency (and peace of mind).

Aren't you still creating a too big X, y in

keras-data-generator/my_classes.py

Lines 45 to 46 in 866cce8

X = np.empty((self.batch_size, *self.dim, self.n_channels))

y = np.empty((self.batch_size), dtype=int)

when your last batch is smaller than self.batch_size since the high and low doesn't change the batch size. So when your last batch is smaller than self.batch_size the X and y will still be filled with (some) entries from the initial np.empty which is a problem as pointed out here: numpy empty

Just to clarify, I am not creating too big X or y because this is not my repo and it's not my code, but you're correct, if we're going to make the last batch smaller than the rest, then we have to also fix the initialization to create an array that's the correct size.

My implementation of Sequence creates a smaller last batch and doesn't initialize X and y to be of size batch_size; please see the code references I linked to in my earlier response.

I think some of the sample code here and in the blog post need to be updated to handle these cases; I opened a separate issue #7 to ask about the license for this repo (since there isn't one now) so that we can contribute some fixes for the code.

gwirn · 2023-01-05T06:58:32Z

@mbrukman Yes you are right [3, 4] not [2, 3] that‘s a typo.
Thank you for your detailed answer!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A problem when the sample_size is not divisible by the batch_size #5

A problem when the sample_size is not divisible by the batch_size #5

MounirB commented Sep 10, 2020 •

edited

Loading

gwirn commented Apr 14, 2022 •

edited

Loading

mbrukman commented Dec 6, 2022

gwirn commented Dec 6, 2022 •

edited

Loading

mbrukman commented Jan 4, 2023

gwirn commented Jan 5, 2023

A problem when the sample_size is not divisible by the batch_size #5

A problem when the sample_size is not divisible by the batch_size #5

Comments

MounirB commented Sep 10, 2020 • edited Loading

gwirn commented Apr 14, 2022 • edited Loading

mbrukman commented Dec 6, 2022

gwirn commented Dec 6, 2022 • edited Loading

mbrukman commented Jan 4, 2023

gwirn commented Jan 5, 2023

MounirB commented Sep 10, 2020 •

edited

Loading

gwirn commented Apr 14, 2022 •

edited

Loading

gwirn commented Dec 6, 2022 •

edited

Loading