Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A problem when the sample_size is not divisible by the batch_size #5

Open
MounirB opened this issue Sep 10, 2020 · 5 comments
Open

Comments

@MounirB
Copy link

MounirB commented Sep 10, 2020

I adapted DataGenerator to my Deep Learning pipeline.
When the sample size is not divisible by the batch_size, the DataGenerator seems to return to the first batch without taking into account the last (smaller) batch.

Example
Let A be an array of train samples, and batch_size = 4.
A = [4,7,8,7,9,78,8,4,78,51,6,5,1,0]. Here A.size = 14
It is clear, in this situation, that A.size is not divisible by batch_size.

The batches the DataGenerator yields during the training process are the following :

  • Batch_0 = [4,7,8,7],
  • Batch_1 = [9,78,8,4]
  • Batch_2 = [78,51,6,5]
  • Batch_3 = [4,7,8,7] This is where the problem lies. Instead of having Batch_3 = [1,0]. It goes back to the first batch

Here is a situation where an other generator behaves well when the sample_size is not divisible by the batch_size https://stackoverflow.com/questions/54159034/what-if-the-sample-size-is-not-divisible-by-batch-size-in-keras-model

For your information, I kept as is the following instruction
int(np.floor(len(self.list_IDs) / self.batch_size))
If I change np.floor to np.ceil, it seems to bug during the training/validation phases.

@gwirn
Copy link

gwirn commented Apr 14, 2022

As far as I know, the problem lies in __data_generation : it creates X and y of the size self.batch_size. This isn‘t a problem when you use np.floor because then all batches are of the same size but you lose the last batch when the size of you data is not divisible by the batch_size. If it is not divisible X and y get „filled“ with as much data as the last batch provides but the rest stays empty. This empty rest is the problem because the model can‘t fit this (uninitialized (arbitrary)) data. To prevent this from happening when using np.ceil use e.g.
true_size = len(list_IDs_temp)
X = np.empty((true_size, *self.dim, self.channels))
y = np.empty((true_size), dtype=int)
in

def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim, self.n_channels))
y = np.empty((self.batch_size), dtype=int)
to always get a np.empty array of the matching size.

@mbrukman
Copy link

mbrukman commented Dec 6, 2022

FWIW, I think you have 2 options:

  1. use math.floor(len(...) / batch_size) — this keeps all batch sizes the same, but you skip over the data past the last full-size batch (if you shuffle the indices between every epoch, you'll probably cover them eventually)
  2. use math.ceil(len(...) / batch_size) — if so, you want the last batch to be smaller, rather than index past the end of the array and wrap around to the beginning of the array (or cause out-of-bounds errors)

If you use approach (2), then you have to cap the upper bound of the batch, i.e., instead of:

# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

which can be rewritten as:

low = index * self.batch_size
high = (index+1) * self.batch_size
indexes = self.indexes[low:high]

you may want to use something like:

low = index * self.batch_size
high = min((index + 1) * self.batch_size, len(self.list_IDs))
indexes = self.indexes[low:high]

to cap at the length of the array; the last batch may be smaller
if the total number of items is not a multiple of batch size.

And if you want to micro-optimize, you can replace a multiplication with an addition:

low = index * self.batch_size
high = min(low + self.batch_size, len(self.list_IDs))
indexes = self.indexes[low:high]

I wrote some helper classes (with tests) which can be dropped in to provide the data slicing, or you can incorporate them into your Sequence subclasses directly.

@gwirn
Copy link

gwirn commented Dec 6, 2022

@mbrukman Thank you for the nice indexing suggestion!
I have two questions:

  • Is the calculation of high really necessary as slicing over the end of an array/list like
a = np.array([1, 2, 3, 4])
print(a[2:6])

would still result in [2 3]

  • Aren't you still creating a too big X, y in
    X = np.empty((self.batch_size, *self.dim, self.n_channels))
    y = np.empty((self.batch_size), dtype=int)

    when your last batch is smaller than self.batch_size since the high and low doesn't change the batch size. So when your last batch is smaller than self.batch_size the X and y will still be filled with (some) entries from the initial np.empty which is a problem as pointed out here: numpy empty

@mbrukman
Copy link

mbrukman commented Jan 4, 2023

@gwirn wrote:

  • Is the calculation of high really necessary as slicing over the end of an array/list like
a = np.array([1, 2, 3, 4])
print(a[2:6])

would still result in [2 3]

I think it returns [3, 4] but your point is valid: indexing past the end of a native array in Python or a NumPy array is valid, but I'm used to computing my array bounds precisely because I use a variety of languages, in some of which, indexing outside of array bounds is invalid (and either throws an exception, or crashes, or causes undefined behavior), so I aim to be precise everywhere, for consistency (and peace of mind).

  • Aren't you still creating a too big X, y in

    X = np.empty((self.batch_size, *self.dim, self.n_channels))
    y = np.empty((self.batch_size), dtype=int)

    when your last batch is smaller than self.batch_size since the high and low doesn't change the batch size. So when your last batch is smaller than self.batch_size the X and y will still be filled with (some) entries from the initial np.empty which is a problem as pointed out here: numpy empty

Just to clarify, I am not creating too big X or y because this is not my repo and it's not my code, but you're correct, if we're going to make the last batch smaller than the rest, then we have to also fix the initialization to create an array that's the correct size.

My implementation of Sequence creates a smaller last batch and doesn't initialize X and y to be of size batch_size; please see the code references I linked to in my earlier response.

I think some of the sample code here and in the blog post need to be updated to handle these cases; I opened a separate issue #7 to ask about the license for this repo (since there isn't one now) so that we can contribute some fixes for the code.

@gwirn
Copy link

gwirn commented Jan 5, 2023

@mbrukman Yes you are right [3, 4] not [2, 3] that‘s a typo.
Thank you for your detailed answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants