Compute unary and some binary functions faster for PooledDataArray #127

davidavdav · 2014-10-16T18:41:51Z

This PR implements unary and some binary functions more efficiently for pooleddataarray, by only operating on the .pool rather than all entries of the matrix. This is noticeable for very large arrays, e.g.,

using MFCC ## for warp(): this gives us quantized floating point values
x = warp(randn(100000,10)); ## a relatively slow operation, unfortunately
p = PooledDataArray(x)
@time exp(p);

johnmyleswhite · 2014-10-20T00:46:45Z

I think this is potentially a really good idea, but I'm a little troubled by breaking the invariant that the pool isn't redundant. Right now, this trick breaks that invariant whenever the applied operator isn't one-to-one.

nalimilan · 2014-10-20T07:35:36Z

Right now, this trick breaks that invariant whenever the applied operator isn't one-to-one.

You mean, when the function is not pure? This issue was already raised when dealing with sparse arrays, about how structural zeros could be handled when calling e.g. exp(sparray). Regarding PDAs, the performance gain is so massive that this assumption seems hard to avoid.

simonster · 2014-10-20T12:59:47Z

I think the problem @johnmyleswhite is referring to is that, if you run round(@pda [1.01, 1.02]), the pool of the resulting PDA should be [1.] and not [1., 1.]. In fact a lot of operators aren't necessarily one-to-one because of floating point rounding.

nalimilan · 2014-10-20T13:14:44Z

Ah, OK. Basically any function which isn't an injection will have this problem. What you're saying is that even mathematically injective functions might not be injective in practice because of rounding? That would mean for all functions the pool must be checked for duplicates, and compacted if needed -- which comes with a large cost since the whole PDA will need to be adjusted. Hopefully in practice this situation should be rare enough.

johnmyleswhite · 2014-10-20T14:47:30Z

I actually only meant to refer to the lack of mathematical injectiveness, but @simonster's point that computational injectiveness is even more stringent is a really good one.

For me, the trick here is that this pool request pushes hard on one conception of PDA's that I'd like to eventually see split into a separate package: a compressed array in which repeated values are not represented repeatedly. I'd call this a UniqueArray or a CompressedArray. This is a really useful tool, but, given that its defining quality refers to uniqueness, it seems that one might one to ensure that all operations keep the pool minimal. In that case, this PR needs a slight revision to provide that guarantee.

On the other hand, you have PDA's being used as CategoricalArray objects in statistics. In that context, you also care about the minimality of the pool because the pool defines the valid categories the entries can take on. But you also want to unconditionally prohibit arithmetic operations on categories (since they're categories, not numbers), so you wouldn't need these operations at all.

davidavdav · 2014-10-21T08:20:58Z

For me, the trick here is that this pool request
pun intended?

davidavdav · 2014-10-21T08:35:36Z

OK, I agree that the injectiveness wasn't well thought out. I did leave out min(pda, ::Real) for that reason, and I also left out minimum(pda), because I wasn't sure if the maximum of the pool would actually also be used in the refs. (the pda could be a subset / slice, I assume you don't recompute the pool and refs for that).

Well, this all started as a new type WarpedArray in MFCC, when I realized that the functionality might have been covered by PooledDataArray. But the use of such quantized values in an array is completely different from PooledDataArrays indeed.

johnmyleswhite · 2014-10-21T16:10:25Z

No pun intended. Just typing fast.

I do think we should support something like UniqueArray. If we do, I think we need to formalize its semantics. Does a subset of a UniqueArray have a pool that guarantees uniqueness? Or does it guarantee connection to the pool for the whole? Is the pool always minimal? Can you include values in the pool that don't occur in the data?

nalimilan · 2014-10-21T16:34:05Z

If the goal is to use numeric values in UniqueArray, then for efficiency it may be better to allow the pool to contain duplicate values, i.e. not to need going over the whole vector when applying a (computationally) non-injective function generates duplicate values.

The question of whether to allow keeping unused values in the pool is less specific to UniqueArray, so I'm going to present the same argument as for PDAs: if you want array views without any allocation, then you cannot modify the pool when slicing, thus unused values must be allowed.

…rom Julia (#127) Update splice!() to reflect changes in the corresponding Base function. This fixes the tests on recent Julia 0.5 master.

Comput unary and some binary functions faster for PooledDataArray

ef188c8

davidavdav changed the title ~~Comput unary and some binary functions faster for PooledDataArray~~ Compute unary and some binary functions faster for PooledDataArray Oct 16, 2014

nalimilan added a commit that referenced this pull request Jul 3, 2016

Adapt to removal of _deleteat_(beg/end)!() and _growat_(beg/end)!() f…

9428f16

…rom Julia (#127) Update splice!() to reflect changes in the corresponding Base function. This fixes the tests on recent Julia 0.5 master.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compute unary and some binary functions faster for PooledDataArray #127

Compute unary and some binary functions faster for PooledDataArray #127

Uh oh!

davidavdav commented Oct 16, 2014

Uh oh!

johnmyleswhite commented Oct 20, 2014

Uh oh!

nalimilan commented Oct 20, 2014

Uh oh!

simonster commented Oct 20, 2014

Uh oh!

nalimilan commented Oct 20, 2014

Uh oh!

johnmyleswhite commented Oct 20, 2014

Uh oh!

davidavdav commented Oct 21, 2014

Uh oh!

davidavdav commented Oct 21, 2014

Uh oh!

johnmyleswhite commented Oct 21, 2014

Uh oh!

nalimilan commented Oct 21, 2014

Uh oh!

Uh oh!

Compute unary and some binary functions faster for PooledDataArray #127

Are you sure you want to change the base?

Compute unary and some binary functions faster for PooledDataArray #127

Uh oh!

Conversation

davidavdav commented Oct 16, 2014

Uh oh!

johnmyleswhite commented Oct 20, 2014

Uh oh!

nalimilan commented Oct 20, 2014

Uh oh!

simonster commented Oct 20, 2014

Uh oh!

nalimilan commented Oct 20, 2014

Uh oh!

johnmyleswhite commented Oct 20, 2014

Uh oh!

davidavdav commented Oct 21, 2014

Uh oh!

davidavdav commented Oct 21, 2014

Uh oh!

johnmyleswhite commented Oct 21, 2014

Uh oh!

nalimilan commented Oct 21, 2014

Uh oh!

Uh oh!