Template request | Bug report | Generate Data Product

Tags: #aws #cloud #storage #S3bucket #operations #snippet #dataframe

Description: This notebook demonstrates how to use AWS to send a dataframe to an S3 bucket.

Input

Import libraries

try:
    import awswrangler as wr
except:
    !pip install awswrangler --user
    import awswrangler as wr
import pandas as pd
from datetime import date

Setup AWS

# Credentials
AWS_ACCESS_KEY_ID = "YOUR_AWS_ACCESS_KEY_ID"
AWS_SECRET_ACCESS_KEY = "YOUR_AWS_SECRET_ACCESS_KEY"
AWS_DEFAULT_REGION = "YOUR_AWS_DEFAULT_REGION"

# Bucket
BUCKET_PATH = f"s3://naas-data-lake/dataset/"

Setup Env

%env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
%env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
%env AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION

Model

Get dataframe

df = pd.DataFrame(
    {
        "id": [1, 2],
        "value": ["foo", "boo"],
        "date": [date(2020, 1, 1), date(2020, 1, 2)],
    }
)

# Display dataframe
df

Output

Send dataset to S3

Wrangler has 3 different write modes to store Parquet Datasets on Amazon S3.

append (Default) : Only adds new files without any delete.
overwrite : Deletes everything in the target directory and then add new files.
overwrite_partitions (Partition Upsert) : Only deletes the paths of partitions that should be updated and then writes the new partitions files. It's like a "partition Upsert".

wr.s3.to_parquet(df=df, path=BUCKET_PATH, dataset=True, mode="overwrite")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!