Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow JIT compilation with an internal API #61032
ENH: Allow JIT compilation with an internal API #61032
Changes from all commits
7ec827e
bc2a178
8b420cc
6a9ee5a
444de67
7e1e855
58fb30d
c239fc9
9567152
2ff333f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numba doesn't support heterogeneous lists but does support heterogeneous tuples. I would say that it is so important to avoid 2D object arrays here that it should go in the first revision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, how is the mapping from column name in "func" to index in the tuple supposed to happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments, and for catching the typos.
Fully agree with your first comment. I don't want to change the existing behavior in the same PR I'm implementing a new interface, but converting the whole dataframe to a single type doesn't seem a great option. The raw parameter and this behavior is as old as pandas, I don't think we would design it this way as of today. It won't still be trivial, as passing
Series
is easy, but the internal data can be different things, in the simple case a single NumPy array, but also other structured such as an Arrow column, two NumPy arrays... I don't think it's trivial to decide how to pass the data in those cases.If I understand you correctly, this is great question, but also a bit tricky. For the simple case, the function is called for every column. In the case the same function needs to be applied to every column, I don't think there is an issue. If the function receives Series (
raw=False
), then this can be done:But when
raw=True
, in general the input of the function will be a numpy array, and it's not known which was the name of the column or the index. Nothing in this PR is preventing the engines from passing extra parameters to the function, but I don't think we should encourage that behavior, since users will in general expect that changing the value of engine doesn't require changing the function, and in particular their signature.This is when a column at a time is passed. When
axis=1
, a row at a time is passed. Whenraw=False
, then a Series is passed, which again is not ideal, since an upcast to probably object will happen. But this is how it works now. Forraw=True
and Numpy objects, the case it's exactly the same as before, just the axis changes.I think historical reasons made the signatures or
.map()
and.apply()
functions inconsistent, and in some cases too complex and with strange behaviors. There have been several improvements, like removingapplymap
for example. But several other things can be improved. I personally think the internal API here makes things simpler, and public APIs should follow the same idea. But this will take several iterations, and I think we don't want to make this PR too complex with any of the many possible improvements. I don't think you are proposing it, I really appreciate your feedback and question. Just commenting how I think we should move forward, given the amount of technical debt and complexity.