Skip to content

Commit 8445dc0

Browse files
samukwekupre-commit-ci[bot]ericmjl
authored
[ENH] select_rows function implementation (#1173)
* add changelog * select_rows implementation * multiindex level selection implementation * tests added * updates to docs and tests * Merge branch 'samukweku/select_rows' of https://github.com/pyjanitor-devs/pyjanitor into samukweku/select_rows * updates to changelog * Update select_columns.ipynb * remove unnecessary file * add select_rows to janitor/__init__.py * update select_rows docs * updates to select links * add more tests * move utils/test__select_columns to functions/test_select_columns * change columns_to_select to cols * remove print * updates * spelling fix * Update CHANGELOG.md * Update utils.py * more tests * explicit label selection in pivot_longer and pivot_wider * spelling fix * tuple selection added * update logic for pivot_wider * improve performance when single value passed to select_* * fix for boolean array for single select_* * dict support for MultiIndex indexing * changelog * changelog * changelog * changelog * fix column selection via dictionary in conditional_join * Update pivot.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updates to docs * simplify logic * simplify level_labels logic * add regex and callable options to dict * cleanup * test for callable errors * callable applied across entire dataframe for performance * add tests for MultiIndex dictionary * explicit support for pandas/numpy objects * add test for boolean callable length mismatch * fix test fails for conditional_join * Update select_columns.ipynb * edit on conditional join; improve on Pandas/numpy object selection on a multiindex * update * spelling fix * strip irrelevance from slice dispatch * fix for IndexLabel and dict * use loc directly if possible, else pass to _select_index * keep dict as-is in conditional_join * logic for when dictionary is used * logic for fnmatch/regex selection on multiindex * add tests for regex/fnmatch on multiindex * remove shortcut to loc * pass responsibility of slice to pandas * remove print * keys for dict for multiindex should be strings/integers only * remove IndexLabel class * changelog * improve error reporting for fnmatch * cleanup docs * cleanup docs * fix links * add notes for users * fix grammar * shortcut to get_indexer for performance, if possible * undo last commit * add dispatch for range * fix grammar * update docs Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Ma <[email protected]>
1 parent 5ebf799 commit 8445dc0

14 files changed

+1249
-819
lines changed

CHANGELOG.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,20 @@
66
- [DOC] Updated developer guide docs.
77
- [ENH] Allow column selection/renaming within conditional_join. Issue #1102. Also allow first or last match. Issue #1020 @samukweku.
88
- [ENH] New decorator `deprecated_kwargs` for breaking API. #1103 @Zeroto521
9-
- [ENH] Extend select_columns to support non-string columns. Also allow selection on MultiIndex columns via level parameter. Issue #1105 @samukweku
9+
- [ENH] Extend select_columns to support non-string columns. Issue #1105 @samukweku
1010
- [ENH] Performance improvement for groupby_topk. Issue #1093 @samukweku
1111
- [ENH] `min_max_scale` drop `old_min` and `old_max` to fit sklearn's method API. Issue #1068 @Zeroto521
1212
- [ENH] Add `jointly` option for `min_max_scale` support to transform each column values or entire values. Default transform each column, similar behavior to `sklearn.preprocessing.MinMaxScaler`. (Issue #1067, PR #1112, PR #1123) @Zeroto521
1313
- [INF] Require pyspark minimal version is v3.2.0 to cut duplicates codes. Issue #1110 @Zeroto521
14-
- [ENH] Added support for extension arrays in `expand_grid`. Issue #1121 @samukweku
14+
- [ENH] Add support for extension arrays in `expand_grid`. Issue #1121 @samukweku
1515
- [ENH] Add `names_expand` and `index_expand` parameters to `pivot_wider` for exposing missing categoricals. Issue #1108 @samukweku
16-
- [ENH] Add fix for slicing error when selecting columns in `pivot_wider`. Issue #1134 @samukweku
16+
- [ENH] Add fix for slicing error when selecting columns in `pivot_wider`. Issue #1134 @samukweku
1717
- [ENH] `dropna` parameter added to `pivot_longer`. Issue #1132 @samukweku
1818
- [INF] Update `mkdocstrings` version and to fit its new coming features. PR #1138 @Zeroto521
1919
- [BUG] Force `math.softmax` returning `Series`. PR #1139 @Zeroto521
2020
- [INF] Set independent environment for building documentation. PR #1141 @Zeroto521
2121
- [DOC] Add local documentation preview via github action artifact. PR #1149 @Zeroto521
2222
- [ENH] Enable `encode_categorical` handle 2 (or more ) dimensions array. PR #1153 @Zeroto521
23-
- [ENH] Faster computation for a single non-equi join, with a numba engine. Issue #1102 @samukweku
2423
- [TST] Fix testcases failing on Window. Issue #1160 @Zeroto521, and @samukweku
2524
- [INF] Cancel old workflow runs via Github Action `concurrency`. PR #1161 @Zeroto521
2625
- [ENH] Faster computation for non-equi join, with a numba engine. Speed improvement for left/right joins when `sort_by_appearance` is False. Issue #1102 @samukweku
@@ -29,6 +28,7 @@
2928
- [ENH] Fix error when `sort_by_appearance=True` is combined with `dropna=True`. Issue #1168 @samukweku
3029
- [ENH] Add explicit default parameter to `case_when` function. Issue #1159 @samukweku
3130
- [BUG] pandas 1.5.x `_MergeOperation` doesn't have `copy` keyword anymore. Issue #1174 @Zeroto521
31+
- [ENH] `select_rows` function added for flexible row selection. Add support for MultiIndex selection via dictionary. Issue #1124 @samukweku
3232
- [TST] Compat with macos and window, to fix `FailedHealthCheck` Issue #1181 @Zeroto521
3333
- [INF] Merge two docs CIs (`docs-preview.yml` and `docs.yml`) to one. And add `documentation` pytest mark. PR #1183 @Zeroto521
3434

examples/notebooks/select_columns.ipynb

+1-1
Original file line numberDiff line numberDiff line change
@@ -433,7 +433,7 @@
433433
"name": "python",
434434
"nbconvert_exporter": "python",
435435
"pygments_lexer": "ipython3",
436-
"version": "3.9.10"
436+
"version": "3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21) \n[GCC 10.3.0]"
437437
},
438438
"orig_nbformat": 4
439439
},

janitor/functions/__init__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@
6464
from .reorder_columns import reorder_columns
6565
from .round_to_fraction import round_to_fraction
6666
from .row_to_names import row_to_names
67-
from .select_columns import select_columns
67+
from .select import select_columns, select_rows
6868
from .shuffle import shuffle
6969
from .sort_column_value_order import sort_column_value_order
7070
from .sort_naturally import sort_naturally

janitor/functions/coalesce.py

+4-3
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
import pandas_flavor as pf
55

66
from janitor.utils import check, deprecated_alias
7-
from janitor.functions.utils import _select_column_names
7+
from janitor.functions.utils import _select_index
88

99

1010
@pf.register_dataframe_method
@@ -95,7 +95,8 @@ def coalesce(
9595
"The number of columns to coalesce should be a minimum of 2."
9696
)
9797

98-
column_names = _select_column_names([*column_names], df)
98+
indices = _select_index([*column_names], df, axis="columns")
99+
column_names = df.columns[indices]
99100

100101
if target_column_name:
101102
check("target_column_name", target_column_name, [str])
@@ -106,7 +107,7 @@ def coalesce(
106107
if target_column_name is None:
107108
target_column_name = column_names[0]
108109

109-
outcome = df.filter(column_names).bfill(axis="columns").iloc[:, 0]
110+
outcome = df.loc(axis=1)[column_names].bfill(axis="columns").iloc[:, 0]
110111
if outcome.hasnans and (default_value is not None):
111112
outcome = outcome.fillna(default_value)
112113

janitor/functions/conditional_join.py

+5-4
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ def conditional_join(
4747
especially if the intervals do not overlap.
4848
4949
Column selection in `df_columns` and `right_columns` is possible using the
50-
[`select_columns`][janitor.functions.select_columns.select_columns] syntax.
50+
[`select_columns`][janitor.functions.select.select_columns] syntax.
5151
5252
For strictly non-equi joins,
5353
involving either `>`, `<`, `>=`, `<=` operators,
@@ -143,7 +143,7 @@ def conditional_join(
143143
:param keep: Choose whether to return the first match,
144144
last match or all matches. Default is `all`.
145145
:param use_numba: Use numba, if installed, to accelerate the computation.
146-
Default is `False`.
146+
Applicable only to strictly non-equi joins. Default is `False`.
147147
:returns: A pandas DataFrame of the two merged Pandas objects.
148148
"""
149149

@@ -1214,10 +1214,11 @@ def _cond_join_select_columns(columns: Any, df: pd.DataFrame):
12141214
Returns a Pandas DataFrame.
12151215
"""
12161216

1217-
df = df.select_columns(columns)
1218-
12191217
if isinstance(columns, dict):
1218+
df = df.select_columns([*columns])
12201219
df.columns = [columns.get(name, name) for name in df]
1220+
else:
1221+
df = df.select_columns(columns)
12211222

12221223
return df
12231224

janitor/functions/pivot.py

+107-23
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
from pandas.core.dtypes.concat import concat_compat
1616

1717
from janitor.functions.utils import (
18-
_select_column_names,
18+
_select_index,
1919
_computations_expand_grid,
2020
)
2121
from janitor.utils import check
@@ -52,7 +52,7 @@ def pivot_longer(
5252
row axis.
5353
5454
Column selection in `index` and `column_names` is possible using the
55-
[`select_columns`][janitor.functions.select_columns.select_columns] syntax.
55+
[`select_columns`][janitor.functions.select.select_columns] syntax.
5656
5757
Example:
5858
@@ -382,17 +382,35 @@ def _data_checks_pivot_longer(
382382
"when the columns are a MultiIndex."
383383
)
384384

385+
is_multi_index = isinstance(df.columns, pd.MultiIndex)
386+
indices = None
385387
if column_names is not None:
386-
if is_list_like(column_names):
387-
column_names = list(column_names)
388-
column_names = _select_column_names(column_names, df)
389-
column_names = list(column_names)
388+
if is_multi_index:
389+
column_names = _check_tuples_multiindex(
390+
df.columns, column_names, "column_names"
391+
)
392+
else:
393+
if is_list_like(column_names):
394+
column_names = list(column_names)
395+
indices = _select_index(column_names, df, axis="columns")
396+
column_names = df.columns[indices]
397+
if not is_list_like(column_names):
398+
column_names = [column_names]
399+
else:
400+
column_names = list(column_names)
390401

391402
if index is not None:
392-
if is_list_like(index):
393-
index = list(index)
394-
index = _select_column_names(index, df)
395-
index = list(index)
403+
if is_multi_index:
404+
index = _check_tuples_multiindex(df.columns, index, "index")
405+
else:
406+
if is_list_like(index):
407+
index = list(index)
408+
indices = _select_index(index, df, axis="columns")
409+
index = df.columns[indices]
410+
if not is_list_like(index):
411+
index = [index]
412+
else:
413+
index = list(index)
396414

397415
if index is None:
398416
if column_names is None:
@@ -1181,7 +1199,7 @@ def pivot_wider(
11811199
11821200
Column selection in `index`, `names_from` and `values_from`
11831201
is possible using the
1184-
[`select_columns`][janitor.functions.select_columns.select_columns] syntax.
1202+
[`select_columns`][janitor.functions.select.select_columns] syntax.
11851203
11861204
A ValueError is raised if the combination
11871205
of the `index` and `names_from` is not unique.
@@ -1455,27 +1473,69 @@ def _data_checks_pivot_wider(
14551473
checking happens.
14561474
"""
14571475

1476+
is_multi_index = isinstance(df.columns, pd.MultiIndex)
1477+
indices = None
14581478
if index is not None:
1459-
if is_list_like(index):
1460-
index = list(index)
1461-
index = _select_column_names(index, df)
1462-
index = list(index)
1479+
if is_multi_index:
1480+
if not isinstance(index, list):
1481+
raise TypeError(
1482+
"For a MultiIndex column, pass a list of tuples "
1483+
"to the index argument."
1484+
)
1485+
index = _check_tuples_multiindex(df.columns, index, "index")
1486+
else:
1487+
if is_list_like(index):
1488+
index = list(index)
1489+
indices = _select_index(index, df, axis="columns")
1490+
index = df.columns[indices]
1491+
if not is_list_like(index):
1492+
index = [index]
1493+
else:
1494+
index = list(index)
14631495

14641496
if names_from is None:
14651497
raise ValueError(
14661498
"pivot_wider() is missing 1 required argument: 'names_from'"
14671499
)
14681500

1469-
if is_list_like(names_from):
1470-
names_from = list(names_from)
1471-
names_from = _select_column_names(names_from, df)
1472-
names_from = list(names_from)
1501+
if is_multi_index:
1502+
if not isinstance(names_from, list):
1503+
raise TypeError(
1504+
"For a MultiIndex column, pass a list of tuples "
1505+
"to the names_from argument."
1506+
)
1507+
names_from = _check_tuples_multiindex(
1508+
df.columns, names_from, "names_from"
1509+
)
1510+
else:
1511+
if is_list_like(names_from):
1512+
names_from = list(names_from)
1513+
indices = _select_index(names_from, df, axis="columns")
1514+
names_from = df.columns[indices]
1515+
if not is_list_like(names_from):
1516+
names_from = [names_from]
1517+
else:
1518+
names_from = list(names_from)
14731519

14741520
if values_from is not None:
1475-
if is_list_like(values_from):
1476-
values_from = list(values_from)
1477-
out = _select_column_names(values_from, df)
1478-
out = list(out)
1521+
if is_multi_index:
1522+
if not isinstance(values_from, list):
1523+
raise TypeError(
1524+
"For a MultiIndex column, pass a list of tuples "
1525+
"to the values_from argument."
1526+
)
1527+
out = _check_tuples_multiindex(
1528+
df.columns, values_from, "values_from"
1529+
)
1530+
else:
1531+
if is_list_like(values_from):
1532+
values_from = list(values_from)
1533+
indices = _select_index(values_from, df, axis="columns")
1534+
out = df.columns[indices]
1535+
if not is_list_like(out):
1536+
out = [out]
1537+
else:
1538+
out = list(out)
14791539
# hack to align with pd.pivot
14801540
if values_from == out[0]:
14811541
values_from = out[0]
@@ -1550,3 +1610,27 @@ def _expand(indexer, retain_categories):
15501610
ordered=indexer.ordered,
15511611
)
15521612
return indexer
1613+
1614+
1615+
def _check_tuples_multiindex(indexer, args, param):
1616+
"""
1617+
Check entries for tuples,
1618+
if indexer is a MultiIndex.
1619+
1620+
Returns a list of tuples.
1621+
"""
1622+
all_tuples = (isinstance(arg, tuple) for arg in args)
1623+
if not all(all_tuples):
1624+
raise TypeError(
1625+
f"{param} must be a list of tuples "
1626+
"when the columns are a MultiIndex."
1627+
)
1628+
1629+
not_found = set(args).difference(indexer)
1630+
if any(not_found):
1631+
raise KeyError(
1632+
f"Tuples {*not_found,} in the {param} "
1633+
"argument do not exist in the dataframe's columns."
1634+
)
1635+
1636+
return args

0 commit comments

Comments
 (0)