Skip to content

ENH: more joins for DataFrame.update #21855

@h-vetinari

Description

@h-vetinari

This should be non-controversial, as even the source code for DataFrame.update literally says (https://github.com/pandas-dev/pandas/blob/v0.23.3/pandas/core/frame.py#L5054):
# TODO: Support other joins

I tried to look if a previous issue for this exists, but did not find one.

Some thoughts that arise:

  • default for join should clearly be 'left'
  • df1.update(df2, join='right', overwrite=some_boolean) would be the same as df2.update(df1, join='left', overwrite=not some_boolean). IMO this is not a terrible redundancy, as it allows each user to choose the order that more easily fits their thought pattern.
  • df1.combine_first(df2) would be the same as df1.update(df2, join='outer', overwrite=False), only that combine_first has much fewer options and controls (i.e. filter_func and raise_conflict). Actually, I'd very much like to deprecate combine_first before pandas 1.0. Only difference is that update returns None, which should be changed as well IMO -- relevant xrefs: ENH: add inplace-kwarg to update #21858 DEPR: combine_first (replace with update(..., join='outer'); for both Series/DF) #21859
  • this should IMO also support a way to control which axes are joined in what way (edit: the below was the original proposal; better variants are discussed in ENH: more joins for DataFrame.update #21855 (comment)).
    • The first way that came to mind would be with an axis=0|1|None-keyword, like in DataFrame.align. However, upon further investigation, I don't believe this to be a good choice, as anything other than axis=None would implicitly have to choose a join for the other axis to actually decide the index/columns of the result.
    • Since "explicit is better than implicit", I'd like to propose a version with just one kwarg, namely:
    join=['left', 'left']  # same as 'left' (and so on for 'inner'|'outer'|'right')
    join=['left', 'inner'] | ['left', 'outer'] etc. (for all other 12 combinations)
    
    • I'd say list and tuple would be reasonable to allow as containers, but not more.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions