Skip to content

Add defaults for parameters #1293

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

yangdanny97
Copy link
Contributor

@yangdanny97 yangdanny97 commented Jul 27, 2025

Closes #1292

This PR uses a semi-automated/LLM-assisted method to add a few hundred parameter defaults to pandas-stubs, replacing what was previously ....

There were 3 annotations that had to change because their type signatures are not compatible with the default.

The codemod that made these transformations is here: https://gist.github.com/yangdanny97/0d81e16f6374582413dd5bcf90d902fd

How it works is:

  1. parse pandas-stubs and find the module, method/function, & parameter name for any params w/ default ...
  2. parse the docstring of the method/function in pandas, and if the parameter's docstring contains "default" extract the part after that keyword
  3. look up the type annotation in a mapping of the extracted snippet to type annotation
  4. use libCST to apply the type annotation to the parameter

The main thing that needs careful reviewing is the default_type_map which maps the docstring snippet to the type annotation we replace the ... with. That part was generated with chatGPT's help (I gave it every snippet I got after step 2), and I commented out the mappings that seemed wrong or unclear. So, even though this was "LLM-assisted" the script is deterministic and will always give you the same thing each time, and the LLM is only involved in a one-time manual step.

start: TimedeltaConvertibleTypes = ...,
end: TimedeltaConvertibleTypes = ...,
periods: int | None = ...,
start: TimedeltaConvertibleTypes | None = None,
Copy link
Contributor Author

@yangdanny97 yangdanny97 Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring says the default is None so I had to change the annotation

https://github.com/pandas-dev/pandas/blob/e4a03b6e47a8ef9cd045902916289cbc976d3d33/pandas/core/indexes/timedeltas.py#L256-L259

Based on the docstring, this should probably be rewritten to have overloads:

Of the four parameters start, end, periods, and freq,
exactly three must be specified.

@@ -35,13 +35,22 @@ class TimelikeOps:
def unit(self) -> TimeUnit: ...
def as_unit(self, unit: TimeUnit) -> Self: ...
def round(
self, freq, ambiguous: TimeAmbiguous = ..., nonexistent: TimeNonexistent = ...
self,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why your PR would change the formatting? I don't see a change in the argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what happened there, either libCST changed the whitespace (which is weird but it's happened before) or the pre-commit did.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as the pre-commit passes that's not a big deal, I was trying to find a change and kept comparing the line but could not find any!

@loicdiridollou
Copy link
Member

Just for my own knowledge, what is adding the defaults instead of the ellipsis trying to achieve? Is it to help when looking at the docs in the IDE, for pyrefly, or something else?

@@ -802,8 +808,8 @@ class DataFrame(NDFrame, OpsMixin, _GetItemHack):
self,
other: NDFrameT,
join: AlignJoin = ...,
axis: Axis | None = ...,
level: Level | None = ...,
axis: Axis | None = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This annotation changed

Copy link
Contributor Author

@yangdanny97 yangdanny97 Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the kind of weird thing is, I can't actually find the docstring anywhere in the pandas repo. But at runtime pandas.core.frame.DataFrame.shift.__doc__ is:

Shift index by desired number of periods with an optional time `freq`.

When `freq` is not passed, shift the index without realigning the data.
If `freq` is passed (in this case, the index must be date or datetime,
or it will raise a `NotImplementedError`), the index will be
increased using the periods and the `freq`. `freq` can be inferred
when specified as "infer" as long as either freq or inferred_freq
attribute is set in the index.

Parameters
----------
periods : int or Sequence
    Number of periods to shift. Can be positive or negative.
    If an iterable of ints, the data will be shifted once by each int.
    This is equivalent to shifting by one value at a time and
    concatenating all resulting frames. The resulting columns will have
    the shift suffixed to their column names. For multiple periods,
    axis must not be 1.
freq : DateOffset, tseries.offsets, timedelta, or str, optional
    Offset to use from the tseries module or time rule (e.g. \'EOM\').
    If `freq` is specified then the index values are shifted but the
    data is not realigned. That is, use `freq` if you would like to
    extend the index when shifting and preserve the original data.
    If `freq` is specified as "infer" then it will be inferred from
    the freq or inferred_freq attributes of the index. If neither of
    those attributes exist, a ValueError is thrown.
axis : {0 or \'index\', 1 or \'columns\', None}, default None
    Shift direction. For `Series` this parameter is unused and defaults to 0.
fill_value : object, optional
    The scalar value to use for newly introduced missing values.
    the default depends on the dtype of `self`.
    For numeric data, ``np.nan`` is used.
    For datetime, timedelta, or period data, etc. :attr:`NaT` is used.
    For extension dtypes, ``self.dtype.na_value`` is used.
suffix : str, optional
    If str and periods is an iterable, this is added after the column
    name and before the shift value for each shifted column name.

Returns
-------
DataFrame
    Copy of input object, shifted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually when the functions are not in the correct file for example pd.Series.shift does not exist in the series.py file, that often means that they are in the class it inherits from (NDFrame which sits in the generic.py file).

): ...
def floor(
self, freq, ambiguous: TimeAmbiguous = ..., nonexistent: TimeNonexistent = ...
self,
Copy link
Contributor Author

@yangdanny97 yangdanny97 Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK why the pre-commit did these whitespace changes

@@ -1133,7 +1141,7 @@ class Series(IndexOpsMixin[S1], NDFrame):
self,
periods: int | Sequence[int] = ...,
freq: DateOffset | timedelta | _str | None = ...,
axis: Axis = ...,
axis: Axis | None = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this annotation changed

@yangdanny97
Copy link
Contributor Author

yangdanny97 commented Jul 27, 2025

Just for my own knowledge, what is adding the defaults instead of the ellipsis trying to achieve? Is it to help when looking at the docs in the IDE, for pyrefly, or something else?

It's to help with looking at docs/signature in the IDE. This work is unrelated to Pyrefly.

Here's a thread on typeshed w/ some discussion on the pros and cons of defaults: python/typeshed#8988

The consensus seems to be that if the default is simple (like None, or some literal) then it's fine, but if it's complex we use .... Mypy's stub generator has been updated to emit stubs containing simple defaults.

The second reason I made the PR is just to test out this way of doing LLM-assisted codemods. I have another PR in the works that uses the same approach to add a few hundred types for un-annotated parameters and returns, but that one requires more manual filtering to make sure the generated types are good.

@yangdanny97 yangdanny97 marked this pull request as ready for review July 27, 2025 03:05
@yangdanny97 yangdanny97 marked this pull request as draft July 27, 2025 18:21
@yangdanny97 yangdanny97 marked this pull request as ready for review July 27, 2025 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do we want to add trivial defaults to the stubs
2 participants