Skip to content
This repository was archived by the owner on May 4, 2019. It is now read-only.

Compute unary and some binary functions faster for PooledDataArray #127

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

davidavdav
Copy link

This PR implements unary and some binary functions more efficiently for pooleddataarray, by only operating on the .pool rather than all entries of the matrix. This is noticeable for very large arrays, e.g.,

using MFCC ## for warp(): this gives us quantized floating point values
x = warp(randn(100000,10)); ## a relatively slow operation, unfortunately
p = PooledDataArray(x)
@time exp(p);

@davidavdav davidavdav changed the title Comput unary and some binary functions faster for PooledDataArray Compute unary and some binary functions faster for PooledDataArray Oct 16, 2014
@johnmyleswhite
Copy link
Member

I think this is potentially a really good idea, but I'm a little troubled by breaking the invariant that the pool isn't redundant. Right now, this trick breaks that invariant whenever the applied operator isn't one-to-one.

@nalimilan
Copy link
Member

Right now, this trick breaks that invariant whenever the applied operator isn't one-to-one.

You mean, when the function is not pure? This issue was already raised when dealing with sparse arrays, about how structural zeros could be handled when calling e.g. exp(sparray). Regarding PDAs, the performance gain is so massive that this assumption seems hard to avoid.

@simonster
Copy link
Member

I think the problem @johnmyleswhite is referring to is that, if you run round(@pda [1.01, 1.02]), the pool of the resulting PDA should be [1.] and not [1., 1.]. In fact a lot of operators aren't necessarily one-to-one because of floating point rounding.

@nalimilan
Copy link
Member

Ah, OK. Basically any function which isn't an injection will have this problem. What you're saying is that even mathematically injective functions might not be injective in practice because of rounding? That would mean for all functions the pool must be checked for duplicates, and compacted if needed -- which comes with a large cost since the whole PDA will need to be adjusted. Hopefully in practice this situation should be rare enough.

@johnmyleswhite
Copy link
Member

I actually only meant to refer to the lack of mathematical injectiveness, but @simonster's point that computational injectiveness is even more stringent is a really good one.

For me, the trick here is that this pool request pushes hard on one conception of PDA's that I'd like to eventually see split into a separate package: a compressed array in which repeated values are not represented repeatedly. I'd call this a UniqueArray or a CompressedArray. This is a really useful tool, but, given that its defining quality refers to uniqueness, it seems that one might one to ensure that all operations keep the pool minimal. In that case, this PR needs a slight revision to provide that guarantee.

On the other hand, you have PDA's being used as CategoricalArray objects in statistics. In that context, you also care about the minimality of the pool because the pool defines the valid categories the entries can take on. But you also want to unconditionally prohibit arithmetic operations on categories (since they're categories, not numbers), so you wouldn't need these operations at all.

@davidavdav
Copy link
Author

For me, the trick here is that this pool request
pun intended?

@davidavdav
Copy link
Author

OK, I agree that the injectiveness wasn't well thought out. I did leave out min(pda, ::Real) for that reason, and I also left out minimum(pda), because I wasn't sure if the maximum of the pool would actually also be used in the refs. (the pda could be a subset / slice, I assume you don't recompute the pool and refs for that).

Well, this all started as a new type WarpedArray in MFCC, when I realized that the functionality might have been covered by PooledDataArray. But the use of such quantized values in an array is completely different from PooledDataArrays indeed.

@johnmyleswhite
Copy link
Member

No pun intended. Just typing fast.

I do think we should support something like UniqueArray. If we do, I think we need to formalize its semantics. Does a subset of a UniqueArray have a pool that guarantees uniqueness? Or does it guarantee connection to the pool for the whole? Is the pool always minimal? Can you include values in the pool that don't occur in the data?

@nalimilan
Copy link
Member

If the goal is to use numeric values in UniqueArray, then for efficiency it may be better to allow the pool to contain duplicate values, i.e. not to need going over the whole vector when applying a (computationally) non-injective function generates duplicate values.

The question of whether to allow keeping unused values in the pool is less specific to UniqueArray, so I'm going to present the same argument as for PDAs: if you want array views without any allocation, then you cannot modify the pool when slicing, thus unused values must be allowed.

nalimilan added a commit that referenced this pull request Jul 3, 2016
…rom Julia (#127)

Update splice!() to reflect changes in the corresponding Base function.
This fixes the tests on recent Julia 0.5 master.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants