Template request | Bug report | Generate Data Product
Tags: #pandas #snippet #datacleaning #operations
Author: Sunny Chugh
Description: This notebook shows how to drop duplicates in a DataFrame.
References:
import pandas as pd
subset
: column label or sequence of labels. Only consider certain columns for identifying duplicates, by default use all of the columns.keep
: {‘first’, ‘last’, False}, default ‘first’. Determines which duplicates (if any) to keep: ‘first’ : Drop duplicates except for the first occurrence; ‘last’ : Drop duplicates except for the last occurrence; False : Drop all duplicates.
subset = None
keep = "first"
# create DataFrame with duplicate columns
df = pd.DataFrame(
{
"team": ["A", "A", "A", "A", "B", "B", "B", "B"],
"points1": [25, 25, 15, 14, 19, 23, 25, 29],
"points2": [25, 25, 15, 14, 19, 23, 25, 29],
"rebounds": [11, 11, 10, 6, 6, 5, 9, 12],
}
)
# view DataFrame
print("Columns fetched:", len(df.columns))
print("Rows fetched:", len(df))
df
df1 = df.drop_duplicates(subset, keep=keep, ignore_index=True)
print("Rows fetched:", len(df1))
df1
df2 = df.T.drop_duplicates().T
print("Columns fetched:", len(df2.columns))
df2