Skip to content

Latest commit

 

History

History
72 lines (49 loc) · 2.49 KB

Pandas_Drop_duplicates.md

File metadata and controls

72 lines (49 loc) · 2.49 KB



Template request | Bug report | Generate Data Product

Tags: #pandas #snippet #datacleaning #operations

Author: Sunny Chugh

Description: This notebook shows how to drop duplicates in a DataFrame.

References:

Input

Import libraries

import pandas as pd

Setup Variables

  • subset: column label or sequence of labels. Only consider certain columns for identifying duplicates, by default use all of the columns.
  • keep: {‘first’, ‘last’, False}, default ‘first’. Determines which duplicates (if any) to keep: ‘first’ : Drop duplicates except for the first occurrence; ‘last’ : Drop duplicates except for the last occurrence; False : Drop all duplicates.
subset = None
keep = "first"

Model

Create DataFrame

# create DataFrame with duplicate columns
df = pd.DataFrame(
    {
        "team": ["A", "A", "A", "A", "B", "B", "B", "B"],
        "points1": [25, 25, 15, 14, 19, 23, 25, 29],
        "points2": [25, 25, 15, 14, 19, 23, 25, 29],
        "rebounds": [11, 11, 10, 6, 6, 5, 9, 12],
    }
)

# view DataFrame
print("Columns fetched:", len(df.columns))
print("Rows fetched:", len(df))
df

Output

Drop Duplicated rows

df1 = df.drop_duplicates(subset, keep=keep, ignore_index=True)
print("Rows fetched:", len(df1))
df1

Drop Duplicated columns even if column names are different

df2 = df.T.drop_duplicates().T
print("Columns fetched:", len(df2.columns))
df2