Template request | Bug report | Generate Data Product
Tags: #aws #cloud #storage #S3bucket #operations #snippet #dataframe
Author: Maxime Jublou
Description: This notebook demonstrates how to use AWS to send a dataframe to an S3 bucket.
try:
import awswrangler as wr
except:
!pip install awswrangler --user
import awswrangler as wr
import pandas as pd
from datetime import date
# Credentials
AWS_ACCESS_KEY_ID = "YOUR_AWS_ACCESS_KEY_ID"
AWS_SECRET_ACCESS_KEY = "YOUR_AWS_SECRET_ACCESS_KEY"
AWS_DEFAULT_REGION = "YOUR_AWS_DEFAULT_REGION"
# Bucket
BUCKET_PATH = f"s3://naas-data-lake/dataset/"
%env AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
%env AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
%env AWS_DEFAULT_REGION=$AWS_DEFAULT_REGION
df = pd.DataFrame(
{
"id": [1, 2],
"value": ["foo", "boo"],
"date": [date(2020, 1, 1), date(2020, 1, 2)],
}
)
# Display dataframe
df
Wrangler has 3 different write modes to store Parquet Datasets on Amazon S3.
- append (Default) : Only adds new files without any delete.
- overwrite : Deletes everything in the target directory and then add new files.
- overwrite_partitions (Partition Upsert) : Only deletes the paths of partitions that should be updated and then writes the new partitions files. It's like a "partition Upsert".
wr.s3.to_parquet(df=df, path=BUCKET_PATH, dataset=True, mode="overwrite")