Skip to content

Latest commit

 

History

History
59 lines (38 loc) · 2.12 KB

Pandasql_Query_Parquet_Using_SQL.md

File metadata and controls

59 lines (38 loc) · 2.12 KB



Template request | Bug report | Generate Data Product

Tags: #pandas #parquet #snippet #read #dataframe #sql #pandasql #operations

Author: Minura Punchihewa

Description: This notebook demonstrates how to use Pandasql to query Parquet files as if they were relational databases, using SQL syntax. The aim is to provide an alternative to traditional Pandas methods for filtering, grouping, and aggregating data, and make it easier for users who are familiar with SQL to perform these tasks.

Input

Import libraries

import pandas as pd
try:
    from pandasql import sqldf
except:
    !pip install pandasql --user
    from pandasql import sqldf

Setup Variables

# Inputs
file_path = "/home/minura/Documents/data/iris.parquet"
query = """SELECT sepal_length, sepal_width, variety FROM df;"""  # query to be executed

Model

Read Parquet file into DataFrame

# Read Parquet file into DataFrame
df = pd.read_parquet(file_path)
df

Use sqldf to execute the query

# Use sqldf to execute the query
output_df = sqldf(query)

Output

Display the resulting DataFrame

The output of the code will be a DataFrame containing only the A column from the original DataFrame df.

output_df