Skip to content

Fix TableScan.update to exclude cached properties #2178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jul 18, 2025

Conversation

smaheshwar-pltr
Copy link
Contributor

@smaheshwar-pltr smaheshwar-pltr commented Jul 7, 2025

Rationale for this change

Closes #2179.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes, the scenario shown in the test and described in the issue now works.

return type(self)(**{**self.__dict__, **overrides})
data = {**self.__dict__, **overrides}

# Cached properties are also stored in the __dict__, so must be removed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that normal methods annotated with property (not cached_property) aren't stored in dict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the nicest solution, though it feels like a minimally viable one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #2178 (comment), I'm no longer convinced

@smaheshwar-pltr smaheshwar-pltr marked this pull request as ready for review July 7, 2025 15:25
@jayceslesar
Copy link
Contributor

you could add **kwargs in the TableScan constructor too right as a way to "fix"? I dont really think there is a good way to deal with this hahahaha

@smaheshwar-pltr
Copy link
Contributor Author

you could add **kwargs in the TableScan constructor too right as a way to "fix"?

Yes, I think you're right.

I was a bit hesitant to change the constructor, and also technically subclasses of it would need to take **kwargs in their constructors too to avoid this issue themselves which feels easy to miss (though admittedly without #2031 we only have DataScan subclassing). But inspection doesn't feel nice either 🤔

@jayceslesar
Copy link
Contributor

Maybe its fine to duplicate the update method in the subclasses and leave a comment about why its there for the cached properties?

@smaheshwar-pltr
Copy link
Contributor Author

Maybe its fine to duplicate the update method in the subclasses and leave a comment about why its there for the cached properties?

Yeah I think maybe we should just make a larger change here and having subclasses implementing updates sounds good. Will need to think about how to go about that though

I realised that the current PR (that removes only cached properties) doesn't guard against the case where a subclass has a private member.

@kevinjqliu
Copy link
Contributor

interesting! good catch.

I think implicitly the update function only allows updating the constructor parameters. the cached_property is a side-effect of using self.__dict__

what if we just filter for only the constructor params? wdyt?

import inspect

def update(self: S, **overrides: Any) -> S:
    """Create a copy of this table scan with updated fields."""
    init_params = inspect.signature(type(self).__init__).parameters
    valid_keys = set(init_params) - {"self"}
    filtered_dict = {k: v for k, v in self.__dict__.items() if k in valid_keys}
    return type(self)(**{**filtered_dict, **overrides})

@smaheshwar-pltr
Copy link
Contributor Author

smaheshwar-pltr commented Jul 10, 2025

what if we just filter for only the constructor params? wdyt?

Yes, I thought about this and think it's a good idea! We achieve something similar to that too with @jayceslesar's **kwargs idea that I think is also worth considering.

Was also toying with a polymorphic solution (#2199) that doesn't assume that init parameters and members are named the same for TableScan's subclasses. (Hat-tip to #2178 (comment))

@smaheshwar-pltr smaheshwar-pltr changed the title Remove cached properties before updating table scans Allow updating table scans with cached properties and non-argument members Jul 10, 2025
@smaheshwar-pltr
Copy link
Contributor Author

After some thought, we can maybe narrow down to a few approaches:

  1. Inspect constructor of subclass and match attribute names to parameter names; I've implemented this here now (9edb166), in the spirit of Fix TableScan.update to exclude cached properties #2178 (comment)
  2. Lift the attribute name matching assumption by introducing a separate arguments property that subclasses can override: [Draft] Fix TableScan updating with arguments method #2199
  3. Use **kwargs as in Fix TableScan.update to exclude cached properties #2178 (comment) so the constructor ignores unrecognised arguments
  4. Spitballing: a larger change to the design to make update more fluid (when reading the code my mind goes to https://docs.python.org/3/library/dataclasses.html#dataclasses.replace), though might be hard and even harder to do so in a non-breaking way.

@kevinjqliu @jayceslesar, happy to take suggestions / hear thoughts! The first three work; I'm leaning towards (1) currently.

Comment on lines +1696 to +1697
params = signature(type(self).__init__).parameters.keys() - {"self"} # Skip "self" parameter
kwargs = {param: getattr(self, param) for param in params} # Assume parameters are attributes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinjqliu, I wrote things slightly differently to #2178 (comment), LMKWYT of this.

Preferred getattr over self.__dict__ since it feels less low-level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! this is great. lets add some comments to explain why we're doing this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! 776c433

@kevinjqliu
Copy link
Contributor

Thanks for the thoughtful explanation @jayceslesar

I prefer option 1 too. We're making the intent of **self.__dict__ more explicit by only using the parameters of __init__.

Option 2 introduces some maintenance burden; we'd have to remember updating _arguments whenever a new param is added to __init__.
As you pointed out, option 3 requires subclasses to follow the same signature. It might be bypassed by subclasses if **kwargs is not specified
Option 4 is interesting. Pydantic's model_dump by default ignored property

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im in favor of option 1, the current approach
Looks like theres a merge conflict too

Comment on lines +1696 to +1697
params = signature(type(self).__init__).parameters.keys() - {"self"} # Skip "self" parameter
kwargs = {param: getattr(self, param) for param in params} # Assume parameters are attributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! this is great. lets add some comments to explain why we're doing this

@smaheshwar-pltr
Copy link
Contributor Author

smaheshwar-pltr commented Jul 18, 2025

@kevinjqliu, merged and added comment! Are we good to merge?

@kevinjqliu kevinjqliu changed the title Allow updating table scans with cached properties and non-argument members Fix TableScan.update to exclude cached properties Jul 18, 2025
@kevinjqliu kevinjqliu merged commit 5c4b3b4 into apache:main Jul 18, 2025
10 checks passed
@kevinjqliu
Copy link
Contributor

Thanks for adding a fix this bug @smaheshwar-pltr. And thanks for the great discussion and review @jayceslesar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] scan.filter after reading it as an Arrow table throws
3 participants