Skip to content

Commit d1daff4

Browse files
authoredJan 8, 2024
01.01.24
1 parent 1ed2b86 commit d1daff4

File tree

1 file changed

+880
-1
lines changed

1 file changed

+880
-1
lines changed
 

‎README.md

+880-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,880 @@
1-
# pandas-interview-questions
1+
# 45 Fundamental Pandas Interview Questions
2+
3+
<div>
4+
<p align="center">
5+
<a href="https://devinterview.io/questions/machine-learning-and-data-science/">
6+
<img src="https://firebasestorage.googleapis.com/v0/b/dev-stack-app.appspot.com/o/github-blog-img%2Fmachine-learning-and-data-science-github-img.jpg?alt=media&token=c511359d-cb91-4157-9465-a8e75a0242fe" alt="machine-learning-and-data-science" width="100%">
7+
</a>
8+
</p>
9+
10+
#### You can also find all 45 answers here 👉 [Devinterview.io - Pandas](https://devinterview.io/questions/machine-learning-and-data-science/pandas-interview-questions)
11+
12+
<br>
13+
14+
## 1. What is _Pandas_ in _Python_ and why is it used for data analysis?
15+
16+
**Pandas** is a powerful Python library for data analysis. In a nutshell, it's designed to make the manipulation and analysis of structured data intuitive and efficient.
17+
18+
### Key Features
19+
20+
- **Data Structures:** Offers two primary data structures: `Series` for one-dimensional data and `DataFrame` for two-dimensional tabular data.
21+
22+
- **Data Munging Tools:** Provides rich toolsets for data cleaning, transformation, and merging.
23+
24+
- **Time Series Support:** Extensive functionality for working with time-series data, including date range generation and frequency conversion.
25+
26+
- **Data Input/Output:** Facilitates effortless interaction with a variety of data sources, such as CSV, Excel, SQL databases, and REST APIs.
27+
28+
- **Flexible Indexing:** Dynamically alters data alignments and joins based on row/column index labeling.
29+
30+
### Ecosystem Integration
31+
32+
Pandas works collaboratively with several other Python libraries like:
33+
34+
- **Visualization Libraries**: Seamlessly integrates with Matplotlib and Seaborn for data visualization.
35+
36+
- **Statistical Libraries**: Works in tandem with statsmodels and SciPy for advanced data analysis and statistics.
37+
38+
### Performance and Scalability
39+
40+
Pandas is optimized for fast execution, making it reliable for small to medium-sized datasets. For large datasets, it provides tools to optimize or work with the data in chunks.
41+
42+
### Common Data Operations
43+
44+
- **Loading Data**: Read data from files like CSV, Excel, or databases using the built-in functions.
45+
46+
- **Data Exploration**: Get a quick overview of the data using methods like `describe`, `head`, and `tail`.
47+
48+
- **Filtering and Sorting**: Use logical indexing to filter data or the `sort_values` method to order the data.
49+
50+
- **Missing Data**: Offers methods like `isnull`, `fillna`, and `dropna` to handle missing data efficiently.
51+
52+
- **Grouping and Aggregating**: Group data by specific variables and apply aggregations like sum, mean, or count.
53+
54+
- **Merging and Joining**: Provide several merge or join methods to combine datasets, similar to SQL.
55+
56+
- **Pivoting**: Reshape data, often for easier visualization or reporting.
57+
58+
- **Time Series Operations**: Includes functionality for date manipulations, resampling, and time-based queries.
59+
60+
- **Data Export**: Save processed data back to files or databases.
61+
62+
### Code Example
63+
64+
Here is the Python code:
65+
66+
```python
67+
import pandas as pd
68+
69+
# Create a DataFrame from a dictionary
70+
data = {
71+
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
72+
'Age': [25, 30, 35, 40],
73+
'Department': ['HR', 'Finance', 'IT', 'Marketing']
74+
}
75+
df = pd.DataFrame(data)
76+
77+
# Explore the data
78+
print(df)
79+
print(df.describe()) # Numerical summary
80+
81+
# Filter and sort the data
82+
filtered_df = df[df['Department'].isin(['HR', 'IT'])]
83+
sorted_df = df.sort_values(by='Age', ascending=False)
84+
85+
# Handle missing data
86+
df.at[2, 'Age'] = None # Simulate missing age for 'Charlie'
87+
df.dropna(inplace=True) # Drop rows with any missing data
88+
89+
# Group, aggregate, and visualize
90+
grouped_df = df.groupby('Department')['Age'].mean()
91+
grouped_df.plot(kind='bar')
92+
93+
# Export the processed data
94+
df.to_csv('processed_data.csv', index=False)
95+
```
96+
<br>
97+
98+
## 2. Explain the difference between a _Series_ and a _DataFrame_ in _Pandas_.
99+
100+
**Pandas**, a popular data manipulation and analysis library, primarily operates on two data structures: **Series**, for one-dimensional data, and **DataFrame**, for two-dimensional data.
101+
102+
### Series Structure
103+
104+
- **Data Model**: Each Series consists of a one-dimensional array and an associated array of labels, known as the index.
105+
- **Memory Representation**: Data is stored in a single ndarray.
106+
- **Indexing**: Series offers simple, labeled indexing.
107+
- **Homogeneity**: Data is homogeneous, meaning it's of a consistent data type.
108+
109+
### DataFrame Structure
110+
111+
- **Data Model**: A DataFrame is composed of data, which is in a 2D tabular structure, and an index, which can be a row or column header or both. Extending its table-like structure, a DataFrame can also contain an index for both its rows and columns.
112+
- **Memory Representation**: Internally, a DataFrame is made up of one or more Series structures, in one or two dimensions. Data is 2D structured.
113+
- **Indexing**: Data can be accessed via row index or column name, favoring loc and iloc for multi-axes indexing.
114+
- **Columnar Data**: Columns can be of different, heterogeneous data types.
115+
- **Missing or NaN Values**: DataFrames can accommodate missing or NaN entries.
116+
117+
### Common Features of Series and DataFrame
118+
119+
Both Series and DataFrame share some common characteristics:
120+
121+
- **Mutability**: They are both mutable in content, but not in length of the structure.
122+
- **Size and Shape Changes**: Both the Series and DataFrame can change in size. For the Series, you can add, delete, or modify elements. For the DataFrame, you can add or remove columns or rows.
123+
- **Sliceability**: Both structures support slicing operations and allow slicing with different styles of indexers.
124+
125+
### Practical Distinctions
126+
127+
- **Initial Construction**: Series can be built from a scalar, list, tuple, or dictionary. DataFrame, on the other hand, is commonly constructed using a dictionary of lists or another DataFrame.
128+
- **External Data Source Interaction**: DataFrames can more naturally interact with external data sources, like CSV or Excel files, due to their tabular nature.
129+
<br>
130+
131+
## 3. How can you read and write data from and to a _CSV file_ in _Pandas_?
132+
133+
**Pandas** makes reading from and writing to **CSV files** straightforward.
134+
135+
### Reading a CSV File
136+
137+
You can read a CSV file into a DataFrame using the `.read_csv()` method. Here is the code:
138+
139+
```python
140+
import pandas as pd
141+
df = pd.read_csv('filename.csv')
142+
```
143+
144+
### Configuring the Read Operation
145+
146+
- **Header**: By default, the first row of the file is used as column names. If your file doesn't have a header, you should set `header=None`.
147+
- **Index Column**: Select a column to be used as the row index. Pass the column name or position using the `index_col` parameter.
148+
149+
- **Data Types**: Let Pandas infer data types or specify them explicitly through `dtype` parameter.
150+
151+
- **Text Parsing**: Handle non-standard delimiters or separators using `sep` and `delimiter`.
152+
153+
- **Date Parsing**: Specify date formats for more efficient parsing using the `parse_dates` parameter.
154+
155+
### Writing to a CSV File
156+
157+
You can export your DataFrame to a CSV file using the `.to_csv()` method.
158+
159+
```python
160+
df.to_csv('output.csv', index=False)
161+
162+
```
163+
164+
- **Index**: If you don't want to save the index, set `index` to `False`.
165+
166+
- **Specifying Delimiters**: If you need to use a different delimiter, e.g., tabs, use `sep` parameter.
167+
168+
- **Handling Missing Values**: Choose a representation for missing values, such as `na_rep='NA'`.
169+
170+
- **Encoding**: Use the `encoding` parameter to specify the file encoding, such as 'utf-8' or 'latin1'.
171+
172+
- **Date Format**: When writing dates to the file, choose 'ISO8601', 'epoch', or specify your custom format with `date_format`.
173+
174+
- **Compression**: If your data is large, use the `compression` parameter to save disk space, e.g., `compression='gzip'` for compressed files.
175+
176+
#### Example: Writing a Dataframe to CSV
177+
178+
Here is a code example:
179+
180+
```python
181+
import pandas as pd
182+
183+
data = {
184+
'name': ['Alice', 'Bob', 'Charlie'],
185+
'age': [25, 22, 28]
186+
}
187+
188+
df = pd.DataFrame(data)
189+
190+
df.to_csv('people.csv', index=False, header=True)
191+
```
192+
<br>
193+
194+
## 4. What are _Pandas indexes_, and how are they used?
195+
196+
In **Pandas**, an index serves as a unique identifier for each row in a DataFrame, making it powerful for data retrieval, alignment, and manipulation.
197+
198+
### Types of Indexes
199+
200+
1. **Default, Implicit Index**: Automatically generated for each row (e.g., 0, 1, 2, ...).
201+
202+
2. **Explicit Index**: A user-defined index where labels don't need to be unique.
203+
204+
3. **Unique Index**: All labels are unique, typically seen in primary key columns of databases.
205+
206+
4. **Non-Unique Index**: Labels don't have to be unique, can have duplicate values.
207+
208+
5. **Hierarchical (MultiLevel) Index**: Uses multiple columns to form a unique identifier for each row.
209+
210+
### Key Functions for Indexing
211+
212+
- **Loc**: Uses labels to retrieve rows and columns.
213+
- **Iloc**: Uses integer indices for row and column selection.
214+
215+
### Operations with Indexes
216+
217+
- **Reindexing**: Changing the index while maintaining data integrity.
218+
- **Set operations**: Union, intersection, difference, etc.
219+
- **Alignment**: Matches rows based on index labels.
220+
221+
### Code Example: Index Types
222+
223+
```python
224+
# Creating DataFrames with different types of indexes
225+
import pandas as pd
226+
227+
# Implicit index
228+
df_implicit = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [1, 2, 3]})
229+
230+
# Explicit index
231+
df_explicit = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [1, 2, 3]}, index=['X', 'Y', 'Z'])
232+
233+
# Unique index
234+
df_unique = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [1, 2, 3]}, index=['1st', '2nd', '3rd'])
235+
236+
# Non-unique index
237+
df_non_unique = pd.DataFrame({'A': ['a', 'b', 'c', 'd'], 'B': [1, 2, 3, 4]}, index=['E', 'E', 'F', 'G'])
238+
239+
# Hierarchical index
240+
df_hierarchical = df_explicit.set_index('A')
241+
242+
print(df_implicit, df_explicit, df_unique, df_non_unique, df_hierarchical, sep='\n\n')
243+
```
244+
245+
### Code Example: Key Operations
246+
247+
```python
248+
# Setting up a DataFrame for operation examples
249+
import pandas as pd
250+
251+
data = {
252+
'A': [1, 2, 3, 4, 5],
253+
'B': [10, 20, 30, 40, 50],
254+
'C': [100, 200, 300, 400, 500]
255+
}
256+
257+
df = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e'])
258+
259+
# Reindexing
260+
new_index = ['a', 'e', 'c', 'b', 'd'] # Reordering index labels
261+
df_reindexed = df.reindex(new_index)
262+
263+
# Set operations
264+
index1 = pd.Index(['a', 'b', 'c', 'f'])
265+
index2 = pd.Index(['b', 'd', 'c'])
266+
print(index1.intersection(index2)) # Intersection
267+
print(index1.difference(index2)) # Difference
268+
269+
# Data alignment
270+
data2 = {'A': [6, 7], 'B': [60, 70], 'C': [600, 700]}
271+
df2 = pd.DataFrame(data2, index=['a', 'c'])
272+
print(df + df2) # Aligned based on index labels
273+
```
274+
<br>
275+
276+
## 5. How do you handle _missing data_ in a _DataFrame_?
277+
278+
Dealing with **missing data** is a common challenge in data analysis. **Pandas** offers flexible tools for handling missing data in DataFrames.
279+
280+
### Common Techniques for Handling Missing Data in Pandas
281+
282+
#### Dropping Missing Values
283+
284+
This is the most straightforward option, but it might lead to data loss.
285+
286+
- **Drop NaN Rows**: `df.dropna()`
287+
- **Drop NaN Columns**: `df.dropna(axis=1)`
288+
289+
#### Filling Missing Values
290+
291+
Instead of dropping missing data, you can choose to fill it with a certain value.
292+
293+
- **Fill with a Constant Value**: `df.fillna(value)`
294+
- **Forward Filling**: `df.fillna(method='ffill')`
295+
- **Backward Filling**: `df.fillna(method='bfill')`
296+
297+
#### Interpolation
298+
299+
Pandas gives you the option to use various interpolation methods like linear, time-based, or polynomial.
300+
301+
- **Linear Interpolation**: `df.interpolate(method='linear')`
302+
303+
#### Masking
304+
305+
You can create a mask to locate and replace missing data.
306+
307+
- **Create a Mask**: `mask = df.isna()`
308+
- **Replace with a value if NA**: `df.mask(mask, value=value_to_replace_with)`
309+
310+
#### Tagging
311+
312+
Categorize or "tag" missing data to handle it separately.
313+
314+
- **Select Missing Data**: `missing_data = df[df['column'].isna()]`
315+
316+
#### Advanced Techniques
317+
318+
Pandas supports more refined strategies:
319+
320+
- **Apply a Function**: `df.apply()`
321+
- **Use External Libraries for Advanced Imputation**: Libraries like `scikit-learn` offer sophisticated imputation techniques.
322+
323+
#### Visualizing Missing Data
324+
325+
The `missingno` library provides a quick way to visualize missing data in pandas dataframes using heatmaps. This can help to identify patterns in missing data that might not be immediately obvious from summary statistics. Here is a code snippet to do that:
326+
<br>
327+
328+
## 6. Discuss the use of `groupby` in _Pandas_ and provide an example.
329+
330+
**GroupBy** in Pandas lets you **aggregate** and **analyze** data based on specific features or categories. This allows for powerful split-apply-combine operations, especially when combined with **`agg`** to define multiple aggregations at once.
331+
332+
### Core Functions in GroupBy
333+
334+
- **Split-Apply-Combine**: This technique segments data into groups, applies a designated function to each group, and then merges the results.
335+
336+
- **Lazy Evaluation**: GroupBy operations are not instantly computed. Instead, they are tracked and implemented only when required. This guarantees efficient memory utilization.
337+
338+
- **In-Memory Caching**: Once an operation is computed for a distinctive group, its outcome is saved in memory. When a recurring computation for the same group is demanded, the cached result is utilized.
339+
340+
Beneath this sleek surface, GroupBy functionalities meld simplicity with robustness, furnishing swift and accurate outcomes.
341+
342+
### Key Methods
343+
344+
- **`groupby()`**: Divides a dataframe into groups based on specified keys, often column names. This provides a `groupby` object which can be combined with additional aggregations or functions.
345+
346+
- **Aggregation Functions**: These functions generate aggregated summaries once the data has been divided into groups. Commonly used aggregation functions include `sum`, `mean`, `median`, `count`, and `std`.
347+
348+
- **Chaining GroupBy Methods**: `.filter()`, `.apply()`, `.transform()`.
349+
350+
### Practical Applications
351+
352+
- **Statistical Summary by Category**: Quickly compute metrics such as average or median for segregated data.
353+
354+
- **Data Quality**: Pinpoint categories with certain characteristics, like groups with more missing values.
355+
356+
- **Splitting by Predicate**: Employ `.filter` to focus on particular categories that match user-specified criteria.
357+
358+
- **Normalized Data**: Deploy `.transform()` to standardize or normalize data within group partitions.
359+
360+
### Code Example: GroupBy & Aggregation
361+
362+
Consider this dataset of car sales:
363+
364+
| Car | Category | Price | Units Sold |
365+
|----------|----------|-------|------------|
366+
| Honda | Sedan | 25000 | 120 |
367+
| Honda | SUV | 30000 | 100 |
368+
| Toyota | Sedan | 23000 | 150 |
369+
| Toyota | SUV | 28000 | 90 |
370+
| Ford | Sedan | 24000 | 110 |
371+
| Ford | Pickup | 35000 | 80 |
372+
373+
We can compute the following using **GroupBy** and **Aggregation**:
374+
375+
1. **Category-Wise Sales**:
376+
- Sum of "Units Sold"
377+
- Average "Price"
378+
379+
2. General computations:
380+
- Total car sales
381+
- Maximum car price.
382+
<br>
383+
384+
## 7. Explain the concept of _data alignment_ and _broadcasting_ in _Pandas_.
385+
386+
**Data alignment** and **broadcasting** are two mechanisms that enable pandas to manipulate datasets with differing shapes efficiently.
387+
388+
### Data Alignment
389+
390+
Pandas operations, such as addition, are designed to generate new series that adhere to the index of both source series, avoiding any NaN values. This process, where the data aligns based on the index labels, is known as **data alignment**.
391+
392+
This behavior is particularly useful when handling data where values may be missing.
393+
394+
#### How It Works
395+
396+
Take two DataFrames, `df1` and `df2`, each with different shapes and sharing only partial indices:
397+
398+
- `df1`:
399+
- Index: A, B, C
400+
- Column: X
401+
- Values: 1, 2, 3
402+
403+
- `df2`:
404+
- Index: B, C, D
405+
- Column: Y
406+
- Values: 4, 5, 6
407+
408+
When you perform an addition (`df1 + df2`), pandas join values that have the same index.
409+
The resulting DataFrame has:
410+
- Index: A, B, C, D
411+
- Columns: X, Y
412+
- Values: NaN, 6, 8, NaN
413+
414+
### Broadcasting
415+
416+
Pandas efficiently manages operations between objects of different dimensions or shapes through **broadcasting**.
417+
418+
It employs a set of rules that allow operations to be performed on datasets even if they don't perfectly align in shape or size.
419+
420+
Key broadcasting rules:
421+
1. **Scalar**: Any scalar value can operate on an entire Series or DataFrame.
422+
2. **Vector-Vector**: Operations occur pairwise—each element in one dataset aligns with the element in the same position in the other dataset.
423+
3. **Vector-Scalar**: Each element in a vector gets the operation with the scalar.
424+
425+
#### Example: Add a Scalar to a Series
426+
427+
Consider a Series `s`:
428+
429+
```plaintext
430+
A
431+
-
432+
Index: 0, 1, 2
433+
Value: 3, 4, 5
434+
```
435+
Now, perform `s + 2`. This adds 2 to each element of the Series, resulting in:
436+
437+
```plaintext
438+
A
439+
-
440+
Index: 0, 1, 2
441+
Value: 5, 6, 7
442+
```
443+
<br>
444+
445+
## 8. What is _data slicing_ in _Pandas_, and how does it differ from _filtering_?
446+
447+
**Data slicing** and **filtering** are distinct techniques used for extracting subsets of data in **Pandas**.
448+
449+
### Key Distinctions
450+
451+
- **Slicing** entails the selection of contiguous rows and columns based on their order or the position within the DataFrame. This is more about locational reference.
452+
453+
- **Filtering**, however, involves selecting rows and columns conditionally based on specific criteria or labels.
454+
455+
### Code Example: Slicing vs Filtering
456+
457+
Here is the Python code:
458+
459+
```python
460+
import pandas as pd
461+
462+
# Create a simple DataFrame
463+
data = {'A': [1, 2, 3, 4, 5],
464+
'B': [10, 20, 30, 40, 50],
465+
'C': ['foo', 'bar', 'baz', 'qux', 'quux']}
466+
df = pd.DataFrame(data)
467+
468+
# Slicing
469+
sliced_df = df.iloc[1:4, 0:2]
470+
print("Sliced DataFrame:")
471+
print(sliced_df)
472+
# 'iloc' is a positional selection method.
473+
474+
# Filtering
475+
filtered_df = df[df['A'] > 2]
476+
print("\nFiltered DataFrame:")
477+
print(filtered_df)
478+
# Here, we use a conditional expression within the brackets to filter rows.
479+
```
480+
<br>
481+
482+
## 9. Describe how _joining_ and _merging_ data works in _Pandas_.
483+
484+
**Pandas** provides versatile methods for combining and linking datasets, with the two main approaches being **`join`** and **`merge`**. Let's explore how they operate.
485+
486+
487+
### Join: Relating DataFrames on Index or Column
488+
489+
The `join` method is a convenient way to link DataFrames based on specific columns or their indices, aligning them either through intersection or union.
490+
491+
492+
#### Types of Joins
493+
494+
- **Inner Join**: Retains only the common entries between the DataFrames.
495+
- **Outer Join**: Maintains all entries, merging based on where keys exist in either DataFrame.
496+
497+
These types of joins can be performed both along the index (`df1.join(df2)`) as well as specified columns (`df1.join(df2, on='column_key')`). The default join type is an Inner Join.
498+
499+
500+
#### Example: Inner Join on Index
501+
502+
```python
503+
# Inner join on default indices
504+
result_inner_index = df1.join(df2, lsuffix='_left')
505+
506+
# Inner join on specified column and index
507+
result_inner_col_index = df1.join(df2, on='key', how='inner', lsuffix='_left', rsuffix='_right')
508+
```
509+
510+
511+
512+
### Merge: Handling More Complex Join Scenarios
513+
514+
The `merge` function in pandas provides greater flexibility than `join`, accommodating a range of keys to combine DataFrames.
515+
516+
#### Join Types
517+
518+
- **Left Merge**: All entries from the left DataFrame are kept, with matching entries from the right DataFrame. Unmatched entries from the right get `NaN` values.
519+
- **Right Merge**: Correspondingly serves the right DataFrame.
520+
- **Outer Merge**: Unites all entries from both DataFrames, addressing any mismatches with `NaN` values.
521+
- **Inner Merge**: Selects only entries with matching keys in both DataFrames.
522+
523+
These merge types can be assigned based on the data requirement. You can use the `how` parameter to specify the type of merge.
524+
525+
#### Code Example
526+
527+
```python
528+
# Perform a left merge aligning on 'key' column
529+
left_merge = df1.merge(df2, on='key', how='left')
530+
531+
# Perform an outer merge on 'key' and 'another_key' columns
532+
outer_merge = df1.merge(df2, left_on='key', right_on='another_key', how='outer')
533+
```
534+
<br>
535+
536+
## 10. How do you _apply_ a function to all elements in a _DataFrame_ column?
537+
538+
You can apply a function to all elements in a DataFrame column using **Pandas' `.apply()`** method along with a **lambda function** for quick transformations.
539+
540+
### Using .apply() and Lambda Functions
541+
542+
The `.apply()` method works as a vectorized alternative for element-wise operations on a column. This mechanism is especially useful for complex transformation and calculation steps.
543+
544+
Here is the generic structure:
545+
546+
```python
547+
# Assuming 'df' is your DataFrame and 'col_name' is the name of your column
548+
df['col_name'] = df['col_name'].apply(lambda x: your_function(x))
549+
```
550+
551+
You can tailor this approach to your particular transformation function.
552+
553+
### Example: Doubling Values in a Column
554+
555+
Let's say you want to double all values in a `scores` column of your DataFrame:
556+
557+
```python
558+
import pandas as pd
559+
560+
# Sample DataFrame
561+
data = {'names': ['Alice', 'Bob', 'Charlie'], 'scores': [80, 90, 85]}
562+
df = pd.DataFrame(data)
563+
564+
# Doubling the scores using .apply() and a lambda function
565+
df['scores'] = df['scores'].apply(lambda x: x*2)
566+
567+
# Verify the changes
568+
print(df)
569+
```
570+
571+
### Considerations
572+
573+
- **Efficiency**: Depending on the nature of your function, a traditional `for` loop might be faster, especially for simple operations. However, in many scenarios, using vectorized operations such as `.apply()` can lead to improved efficiency.
574+
575+
- **In-Place vs. Non In-Place**: Specifying `inplace=True` in the `.apply()` method directly modifies the DataFrame, to save you from the need for reassignment.
576+
<br>
577+
578+
## 11. Demonstrate how to handle _duplicate rows_ in a _DataFrame_.
579+
580+
Dealing with **duplicate rows** in a DataFrame is a common data cleaning task. The Pandas library provides simple, yet powerful, methods to identify and handle this issue.
581+
582+
### Identifying Duplicate Rows
583+
584+
You can use the `duplicated()` method to **identify rows that are duplicated**.
585+
586+
```python
587+
import pandas as pd
588+
589+
# Create a sample DataFrame
590+
data = {'A': [1, 1, 2, 2, 3, 3], 'B': ['a', 'a', 'b', 'b', 'c', 'c']}
591+
df = pd.DataFrame(data)
592+
593+
# Identify duplicate rows
594+
print(df.duplicated()) # Output: [False, True, False, True, False, True]
595+
```
596+
597+
### Dropping Duplicate Rows
598+
599+
You can use the `drop_duplicates()` method to **remove duplicated rows**.
600+
601+
```python
602+
# Drop duplicates
603+
unique_df = df.drop_duplicates()
604+
605+
# Alternatively, you can keep the last occurrence
606+
last_occurrence_df = df.drop_duplicates(keep='last')
607+
608+
# To drop in place, use the `inplace` parameter
609+
df.drop_duplicates(inplace=True)
610+
```
611+
612+
### Carrying Out Aggregations
613+
614+
For numerical data, you can **aggregate** using functions such as mean or sum. This is beneficial when duplicates may have varying values in other columns.
615+
616+
```python
617+
# Aggregate using mean
618+
mean_df = df.groupby('A').mean()
619+
```
620+
621+
### Counting Duplicates
622+
623+
To **count the occurrences of duplicates**, use the `duplicated()` method in conjunction with `sum()`. This provides the number of duplicates for each row.
624+
625+
```python
626+
# Count duplicates
627+
num_duplicates = df.duplicated().sum()
628+
```
629+
630+
### Keeping the First or Last Occurrence
631+
632+
By default, `drop_duplicates()` keeps the **first occurrence** of a duplicated row.
633+
634+
If you prefer to keep the **last occurrence**, you can use the `keep` parameter.
635+
636+
```python
637+
# Keep the last occurrence
638+
df_last = df.drop_duplicates(keep='last')
639+
```
640+
641+
### Leverage Unique Identifiers
642+
643+
Identifying duplicates might require considering a subset of columns. For instance, in an orders dataset, two orders with the same order date might still be distinct because they involve different products. **Set** the subset of columns to consider with the `subset` parameter.
644+
645+
```python
646+
# Consider only the 'A' column to identify duplicates
647+
df_unique_A = df.drop_duplicates(subset=['A'])
648+
```
649+
<br>
650+
651+
## 12. Describe how you would convert _categorical data_ into _numeric format_.
652+
653+
**Converting categorical data to a numeric format**, a process also known as **data pre-processing**, is fundamental for many machine learning algorithms that can only handle numerical inputs.
654+
655+
There are **two common approaches** to achieve this: **Label Encoding** and **One-Hot Encoding**.
656+
657+
### Label Encoding
658+
659+
Label Encoding replaces each category with a unique numerical label. This method is often used with **ordinal data**, where the categories have an inherent order.
660+
661+
For instance, the categories "Low", "Medium", and "High" can be encoded as 1, 2, and 3.
662+
663+
Here's the Python code using scikit-learn's `LabelEncoder`:
664+
665+
```python
666+
667+
from sklearn.preprocessing import LabelEncoder
668+
import pandas as pd
669+
670+
data = {'Size': ['S', 'M', 'L', 'XL', 'M']}
671+
df = pd.DataFrame(data)
672+
673+
le = LabelEncoder()
674+
df['Size_LabelEncoded'] = le.fit_transform(df['Size'])
675+
print(df)
676+
```
677+
678+
### One-Hot Encoding
679+
680+
One-Hot Encoding creates a new binary column for each category in the dataset. For every row, only one of these columns will have a value of 1, indicating the presence of that category.
681+
682+
This method is ideal for **nominal data** that doesn't indicate any order.
683+
684+
Here's the Python code using scikit-learn's `OneHotEncoder` and pandas:
685+
686+
```python
687+
from sklearn.preprocessing import OneHotEncoder
688+
689+
# Avoids SettingWithCopyWarning
690+
df = pd.get_dummies(df, prefix=['Size'], columns=['Size'])
691+
692+
print(df)
693+
694+
```
695+
<br>
696+
697+
## 13. How can you _pivot_ data in a _DataFrame_?
698+
699+
In **Pandas**, you can pivot data using the `pivot` or `pivot_table` methods. These functions restructure data in various ways, such as converting unique values into column headers or aggregating values where duplicates exist for specific combinations of row and column indices.
700+
701+
### Key Methods
702+
703+
- **`pivot`**: It works well with a simple DataFrame but cannot handle duplicate entries for the same set of index and column labels.
704+
705+
- **`pivot_table`**: Provides more flexibility and robustness. It can deal with data duplicates and perform varying levels of aggregation.
706+
707+
### Code Example: Pivoting with `pivot`
708+
709+
Here is the Python code:
710+
711+
```python
712+
import pandas as pd
713+
714+
# Sample data
715+
data = {
716+
'Date': ['2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02'],
717+
'Category': ['A', 'B', 'A', 'B'],
718+
'Value': [10, 20, 30, 40]
719+
}
720+
721+
df = pd.DataFrame(data)
722+
723+
# Pivot the DataFrame
724+
pivot_df = df.pivot(index='Date', columns='Category', values='Value')
725+
print(pivot_df)
726+
```
727+
728+
#### Output
729+
730+
The pivot DataFrame is a transformation of the original one:
731+
732+
|Date | A | B |
733+
|--- |---|---|
734+
|2020-01-01 | 10 | 20 |
735+
|2020-01-02 | 30 | 40 |
736+
737+
### Code Example: Pivoting with `pivot_table`
738+
739+
Here is the Python code:
740+
741+
```python
742+
import pandas as pd
743+
744+
# Sample data
745+
data = {
746+
'Date': ['2020-01-01', '2020-01-01', '2020-01-02', '2020-01-02'],
747+
'Category': ['A', 'B', 'A', 'B'],
748+
'Value': [10, 20, 30, 40]
749+
}
750+
751+
df = pd.DataFrame(data)
752+
753+
# Pivoting the DataFrame using pivot_table
754+
pivot_table_df = df.pivot_table(index='Date', columns='Category', values='Value', aggfunc='sum')
755+
print(pivot_table_df)
756+
```
757+
758+
#### Output
759+
760+
The pivot table presents the sum of values for unique combinations of 'Date' and 'Category'. **Notice how duplicate entries for some combinations have been automatically aggregated**.
761+
762+
|Date | A | B |
763+
|---- |---|---|
764+
|2020-01-01 | 10 | 20 |
765+
|2020-01-02 | 30 | 40 |
766+
<br>
767+
768+
## 14. Show how to apply _conditional logic_ to columns using the `where()` method.
769+
770+
The `where()` method in **Pandas** enables conditional logic on columns, providing a more streamlined alternative to `loc` or `if-else` statements.
771+
772+
The method essentially replaces values in a **DataFrame** or **Series** based on defined conditions.
773+
774+
### `where()` Basics
775+
776+
Here are some important key points:
777+
- **Import**: It works directly on data obtained via Pandas modules (`import pandas as pd`).
778+
- **Parameters**: Specify conditions like `cond` for when to apply replacements and `other` for the values to replace when the condition is False.
779+
780+
### Code Example: `where()`
781+
782+
Here is the Python code:
783+
784+
```python
785+
# Importing Pandas
786+
import pandas as pd
787+
788+
# Sample Data
789+
data = {'A': [1, 2, 3, 4, 5], 'B': [10, 20, 30, 40, 50]}
790+
df = pd.DataFrame(data)
791+
792+
# Applying 'where'
793+
result = df.where(df > 2, df * 10)
794+
795+
# Output
796+
print(result)
797+
798+
# Output: A B
799+
# 0 NaN NaN
800+
# 1 NaN NaN
801+
# 2 3.0 30.0
802+
# 3 4.0 40.0
803+
# 4 5.0 50.0
804+
```
805+
806+
In this example:
807+
- Values less than or equal to 2 are replaced with `original_value * 10`.
808+
- `NaN` values are returned for all the cells that did not meet the condition.
809+
<br>
810+
811+
## 15. What is the purpose of the `apply()` function in _Pandas_?
812+
813+
The **`apply()`** function in **Pandas** is utilized to apply a given function along either the rows or columns of a DataFrame for advanced data manipulation tasks.
814+
815+
### Practical Applications
816+
817+
- **Row or Column Wise Operations**: Suitable for applying element-wise or aggregative functions, like sum, mean, or custom-defined operations, across rows or columns.
818+
819+
- **Aggregations**: Ideal for multiple computations, for example calculating totals and averages simultaneously.
820+
821+
- **Custom Operations**: Provides versatility for applying custom functions across data; for example, calculating the interquantile range of two columns.
822+
823+
- **Scalability**: Offers performance improvements over element-wise iterations, particularly in scenarios with significantly large datasets.
824+
825+
### Syntax
826+
827+
The `apply()` function typically requires specification regarding row or column iteration:
828+
829+
```python
830+
# Row-wise function application
831+
df.apply(func, axis=1)
832+
833+
# Column-wise function application (default; '0' or 'index' yields the same behavior)
834+
df.apply(func, axis='columns')
835+
```
836+
837+
Here `func` denotes the function to apply, which can be a built-in function, lamba function, or user-constructed function.
838+
839+
### Code Example: `apply()`
840+
841+
Let's look at a simple code example to show how `apply()` can be used to calculate the difference between two columns.
842+
843+
```python
844+
import pandas as pd
845+
846+
# Sample data
847+
data = {'A': [1, 2, 3], 'B': [4, 5, 6]}
848+
df = pd.DataFrame(data)
849+
850+
# Define the function to calculate the difference
851+
def diff_calc(row):
852+
return row['B'] - row['A']
853+
854+
# Apply the custom function along the columns
855+
df['Diff'] = df.apply(diff_calc, axis='columns')
856+
857+
print(df)
858+
```
859+
860+
The output will be:
861+
862+
```
863+
A B Diff
864+
0 1 4 3
865+
1 2 5 3
866+
2 3 6 3
867+
```
868+
<br>
869+
870+
871+
872+
#### Explore all 45 answers here 👉 [Devinterview.io - Pandas](https://devinterview.io/questions/machine-learning-and-data-science/pandas-interview-questions)
873+
874+
<br>
875+
876+
<a href="https://devinterview.io/questions/machine-learning-and-data-science/">
877+
<img src="https://firebasestorage.googleapis.com/v0/b/dev-stack-app.appspot.com/o/github-blog-img%2Fmachine-learning-and-data-science-github-img.jpg?alt=media&token=c511359d-cb91-4157-9465-a8e75a0242fe" alt="machine-learning-and-data-science" width="100%">
878+
</a>
879+
</p>
880+

0 commit comments

Comments
 (0)
Please sign in to comment.