Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EHN] Add jointly option for min_max_scale #1112

Merged
merged 19 commits into from
Jun 14, 2022
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 96 additions & 37 deletions janitor/functions/min_max_scale.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,23 +23,18 @@ def min_max_scale(
df: pd.DataFrame,
feature_range: tuple[int | float, int | float] = (0, 1),
column_name: str | int | list[str | int] | pd.Index = None,
entire_data: bool = False,
Copy link
Member Author

@Zeroto521 Zeroto521 Jun 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think entire_data is a better name.
Need to be added to the changelog file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, though I also need a bit of time and space to think of a better name. I wonder if others on the dev team have ideas? @pyjanitor-devs/core-devs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure of a good name to use. maybe keep like in Pandas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean 'keep' its value could be 'column' or 'all'?

But this variable is better as a boolean type.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, horrible parameter name. maybe scale_all or scale_all_columns?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, @Zeroto521, on second thought, I think we need a bit better definition of the semantics. min_max_scale currently operates with the assumption of operating on one column, so the use of column_name makes sense here. Below is my attempt at reasoning through the multiple ways we could use min_max_scale.

  • Scale one column.
  • Scale multiple columns independently.
  • Scale multiple columns, but jointly (so they are all scaled to the same min and max)

entire_data is a special case of scaling multiple columns jointly.

Since we're working on this function, it's a good chance to change the API to be flexible yet also sensible. What if the API was, instead the following?

def min_max_scale(df, feature_range, column_names: Iterable[Hashable] | Callable, jointly: bool):

If jointly is True, then the column_names provided are jointly scaled; otherwise, they are not.

I wanted to point out a new behaviour that we might be able to support across the rest of the API -- by making column_names accept a Callable that has the signature:

def column_names_callable(df) -> Iterable[Hashable]:

we can enable min_max_scale on all columns by doing:

df.min_max_scale(feature_range=(0, 1), column_names = lambda df: df.columns, jointly=True)

This is pretty concise without resorting to needing to maintain string mappings for special-case behaviour.

I'm glad we didn't rush to merge this PR, giving us the time and space to think clearly about the semantics of the API.

@samukweku and @Zeroto521 what do you think about this?

Copy link
Member Author

@Zeroto521 Zeroto521 Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to point out a new behaviour that we might be able to support across the rest of the API -- by making column_names accept a Callable that has the signature:

def column_names_callable(df) -> Iterable[Hashable]:

I'm sorry I still can't get the point why column_names could accept a callback type argument.

The usage of column_names is to get dataframe's columns like df[column_names].
So if column_names has some column names which is not in df.columns.
There will raise an error. Whatever column_names is Iterable[Hashable] or callback type could return Iterable[Hashable].

we can enable min_max_scale on all columns by doing:

df.min_max_scale(feature_range=(0, 1), column_names = lambda df: df.columns, jointly=True)

To scale all columns, I thought we could use the default argument as None for column_names without no more inputting.

df.min_max_scale(feature_range=(0, 1), column_names=None, jointly=True)

Are there more examples to show the importance of the callback type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, @Zeroto521!

Regarding why we might want to allow column_names to be a Callable, I had the idea that it helps support being explicit over implicit, which is in the Zen of Python. Setting column_names=None makes selecting all columns implicit, whereas setting column_names=lambda df: df.columns makes selecting all columns explicit. In addition, it allows the selection of arbitrary subsets of column names programmatically, without needing to hard-code those names.

On further thought, I can see how column_names=None actually follows the pattern established in other places in the library, so I think, for now, we can:

  1. Use column_names=None to imply selection of all columns, and
  2. Talk more about column_names: Callable in the issue tracker, deferring the implementation till later.

What do you all think about jointly=True as the keyword for triggering whether to independently scale each column or to jointly scale all columns specified in column_names? @Zeroto521 if you're in agreement with the keyword argument, then I think, let's get that specified in this PR, then we can close out the PR!

Copy link
Member Author

@Zeroto521 Zeroto521 Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with using jointly.

About column_names whether could receive a callable type or not.
I can understand now. column_names=lambda df: df.columns is an implicit style and also a trick.

Select columns Using callable type Using Iterable type df.columns
Select the first three columns lambda df: df.columns[:3] ['a', 'b', 'c'] pd.Index(list('abcde'))
Select str type columns lambda df: [i for i in df.columns if instance(i, str)] ['a', 'b', 'c'] pd.Index(['a', 'b', 'c', 1])

As you said, we can put it aside at present.
Once the parameter column_names of min_max_scale could receive, other functions also need to do the same thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More discussions move to #1115

) -> pd.DataFrame:
"""
Scales data to between a minimum and maximum value.
Scales DataFrame to between a minimum and maximum value.

This method mutates the original DataFrame.
One can optionally set a new target **minimum** and **maximum** value
using the `feature_range` keyword argument.

If `minimum` and `maximum` are provided, the true min/max of the
`DataFrame` or column is ignored in the scaling process and replaced with
these values, instead.

One can optionally set a new target minimum and maximum value using the
`feature_range[0]` and `feature_range[1]` keyword arguments.
This will result in the transformed data being bounded between
`feature_range[0]` and `feature_range[1]`.

If a particular column name is specified, then only that column of data
are scaled. Otherwise, the entire dataframe is scaled.
If `column_name` is specified, then only that column(s) of data is scaled.
Otherwise, the entire dataframe is scaled.
If `entire_data` is `True`, the entire dataframe will be regnozied as
the one to scale. Otherwise, each column of data will be scaled sperately.

Example: Basic usage.

Expand All @@ -48,6 +43,10 @@ def min_max_scale(
>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1]})
>>> df.min_max_scale()
a b
0 0.0 0.0
1 1.0 1.0
>>> df.min_max_scale(entire_data=True)
a b
0 0.5 0.0
1 1.0 0.5

Expand All @@ -57,6 +56,10 @@ def min_max_scale(
>>> import janitor
>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1]})
>>> df.min_max_scale(feature_range=(0, 100))
a b
0 0.0 0.0
1 100.0 100.0
>>> df.min_max_scale(feature_range=(0, 100), entire_data=True)
a b
0 50.0 0.0
1 100.0 50.0
Expand All @@ -65,15 +68,26 @@ def min_max_scale(

>>> import pandas as pd
>>> import janitor
>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1]})
>>> df.min_max_scale(feature_range=(0, 100), column_name=['a', 'b'])
a b
0 0.0 0.0
1 100.0 100.0
>>> df = pd.DataFrame({'a':[1, 2], 'b':[0, 1], 'c': [1, 0]})
>>> df.min_max_scale(
... feature_range=(0, 100),
... column_name=["a", "c"],
... )
a b c
0 0.0 0 100.0
1 100.0 1 0.0
>>> df.min_max_scale(
... feature_range=(0, 100),
... column_name=["a", "c"],
... entire_data=True,
... )
a b c
0 50.0 0 50.0
1 100.0 1 0.0
>>> df.min_max_scale(feature_range=(0, 100), column_name='a')
a b
0 0.0 0
1 100.0 1
a b c
0 0.0 0 1
1 100.0 1 0

The aforementioned example might be applied to something like scaling the
isoelectric points of amino acids. While technically they range from
Expand All @@ -84,6 +98,7 @@ def min_max_scale(
:param df: A pandas DataFrame.
:param feature_range: (optional) Desired range of transformed data.
:param column_name: (optional) The column on which to perform scaling.
:param entire_data: (bool) Scale the entire data if Ture.
:returns: A pandas DataFrame with scaled data.
:raises ValueError: if `feature_range` isn't tuple type.
:raises ValueError: if the length of `feature_range` isn't equal to two.
Expand All @@ -102,23 +117,67 @@ def min_max_scale(
"the first element must be greater than the second one"
)

new_min, new_max = feature_range
new_range = new_max - new_min

if column_name is not None:
old_min = df[column_name].min()
old_max = df[column_name].max()
old_range = old_max - old_min

df = df.copy()
df[column_name] = (
df[column_name] - old_min
) * new_range / old_range + new_min
else:
old_min = df.min().min()
old_max = df.max().max()
old_range = old_max - old_min
df = df.copy() # Avoid to change the original DataFrame.

df = (df - old_min) * new_range / old_range + new_min
old_feature_range = df[column_name].pipe(min_max_value, entire_data)
df[column_name] = df[column_name].pipe(
apply_min_max,
*old_feature_range,
*feature_range,
)
else:
old_feature_range = df.pipe(min_max_value, entire_data)
df = df.pipe(
apply_min_max,
*old_feature_range,
*feature_range,
)

return df


def min_max_value(df: pd.DataFrame, entire_data: bool) -> tuple:
"""
Return the minimum and maximum of DataFrame.

Use the `entire_data` flag to control returning entire data or each column.

.. # noqa: DAR101
.. # noqa: DAR201
"""

if entire_data:
mmin = df.min().min()
mmax = df.max().max()
else:
mmin = df.min()
mmax = df.max()

return mmin, mmax


def apply_min_max(
df: pd.DataFrame,
old_min: int | float | pd.Series,
old_max: int | float | pd.Series,
new_min: int | float | pd.Series,
new_max: int | float | pd.Series,
) -> pd.DataFrame:
"""
Apply minimax scaler to DataFrame.

Notes
-----
- Inputting minimum and maximum type
- int or float : It will apply minimax to the entire DataFrame.
- Series : It will apply minimax to each column.

.. # noqa: DAR101
.. # noqa: DAR201
"""

old_range = old_max - old_min
new_range = new_max - new_min

return (df - old_min) * new_range / old_range + new_min
47 changes: 44 additions & 3 deletions tests/functions/test_min_max_scale.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,42 +4,83 @@

@pytest.mark.functions
@pytest.mark.parametrize(
"df, column_name, excepted",
"df, column_name, entire_data, excepted",
[
# test default parameter
(
pd.DataFrame({"a": [5, 10], "b": [0, 5]}),
None,
True,
pd.DataFrame({"a": [0.5, 1], "b": [0, 0.5]}),
),
# test default parameter
(
pd.DataFrame({"a": [5, 10], "b": [0, 5]}),
None,
False,
pd.DataFrame({"a": [0, 1.0], "b": [0, 1.0]}),
),
# test list condition
(
pd.DataFrame({"a": [5, 10], "b": [0, 5]}),
["a", "b"],
True,
pd.DataFrame({"a": [0.5, 1.0], "b": [0, 0.5]}),
),
# test list condition
(
pd.DataFrame({"a": [5, 10], "b": [0, 5]}),
["a", "b"],
False,
pd.DataFrame({"a": [0, 1.0], "b": [0, 1.0]}),
),
# test Index condition
(
pd.DataFrame({"a": [5, 10], "b": [0, 5]}),
pd.Index(["a", "b"]),
False,
pd.DataFrame({"a": [0, 1.0], "b": [0, 1.0]}),
),
# test Index condition
(
pd.DataFrame({"a": [5, 10], "b": [0, 5]}),
pd.Index(["a", "b"]),
True,
pd.DataFrame({"a": [0.5, 1], "b": [0, 0.5]}),
),
# test str condition
(
pd.DataFrame({"a": [5, 10], "b": [0, 5]}),
"a",
True,
pd.DataFrame({"a": [0, 1.0], "b": [0, 5]}),
),
(
pd.DataFrame({"a": [5, 10], "b": [0, 5]}),
"a",
False,
pd.DataFrame({"a": [0, 1.0], "b": [0, 5]}),
),
# test int condition
(
pd.DataFrame({1: [5, 10], "b": [0, 5]}),
1,
True,
pd.DataFrame({1: [0, 1.0], "b": [0, 5]}),
),
# test int condition
(
pd.DataFrame({1: [5, 10], "b": [0, 5]}),
1,
False,
pd.DataFrame({1: [0, 1.0], "b": [0, 5]}),
),
],
)
def test_min_max_scale_column_name(df, column_name, excepted):
result = df.min_max_scale(column_name=column_name)
def test_min_max_scale_column_name_type(
df, column_name, entire_data, excepted
):
result = df.min_max_scale(column_name=column_name, entire_data=entire_data)

assert result.equals(excepted)

Expand Down