Skip to content

Usage of Glue Data Catalog with sagemaker_pyspark #109

@mattiamatrix

Description

@mattiamatrix

System Information

  • Spark or PySpark: PySpark
  • SDK Version: v1.2.8
  • Spark Version: v2.3.2
  • Algorithm (e.g. KMeans): n/a

Describe the problem

I'm following the instructions proposed HERE to connect a local spark session running in a notebook in Sagemaker to the Glue Data Catalog of my account.

I know this is doable via EMR but I'd like do to the same using a Sagemaker notebook (or any other kind of separate spark installation)

Minimal repo / logs

Below is the current code that runs in the notebook but it doesn't actually work.

import sagemaker_pyspark
from pyspark.sql import SparkSession

classpath = ":".join(sagemaker_pyspark.classpath_jars())

spark = SparkSession.builder \
    .config("spark.driver.extraClassPath", classpath) \
    .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
    .config("hive.metastore.schema.verification", "false") \
    .enableHiveSupport() \
    .getOrCreate()

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions