Skip to content

Latest commit

 

History

History
154 lines (123 loc) · 5.49 KB

YaspPlan.md

File metadata and controls

154 lines (123 loc) · 5.49 KB

YaspPlan

A YaspPlan is a model that define your ETL/EL jobs in terms of YaspAction.

Currently is defined as follow:

case class YaspPlan(
  actions: Seq[YaspAction] // A Sequence of YaspAction
)
  • actions [REQUIRED]: A List of YaspAction

YaspAction

A YaspAction is a list of actions that YaspService can execute.

Three main Actions is defined

YaspSource

A YaspSource is a model that define a data source.

case class YaspSource(
  id: String,                    // Unique ID to internally identify the action
  dataset: String,               // Dataset name, used to register the in memory dataset
  source: Source,                // Source Sum Type
  partitions: Option[Int],       // Optional number of partitions 
  cache: Option[CacheLayer]      // Optional CacheLayer that will be used to cache resulting data  
  dependsOn: Option[Seq[String]] // Optional dependency
) extends YaspAction
  • id [REQUIRED]: String ID to internally identify the data
  • source [REQUIRED]: A Source Sum Type. Valid values are Format and HiveTable.
  • partitions [OPTIONAL]: A number of partitions used to repartition the data.
  • cache [OPTIONAL]: An optional CacheLayer Sum Type. Valid values are:
    • Memory|memory|MEMORY,
    • Disk|disk|DISK
    • MemoryAndDisk|memodyanddisk|MEMORYANDDISK|memory_and_disk|MEMORY_AND_DISK
    • MemorySer|memoryser|MEMORYSER|memory_ser|MEMORY_SER
    • MemoryAndDiskSer|memoryanddiskser|MEMORYANDDISKSER|memory_and_disk_ser|MEMORY_AND_DISK_SER
    • Checkpoint|checkpoint|CHECKPOINT

Each YaspSource are loaded via a YaspLoader that read, repartition and cache the data to a specific CacheLayer and register it as a temporary table with the id provided.

An example of a full yaml configuration with Spark format:

source: 
  id: my_source                  # Source id
  source:                        # Source configuration
    dataset: mydata              # Source dataset name
    format: jdbc                 # Standard jdbc Spark format
    options:                     # Standard jdbc Spark format option
        url: my-jdbc-endpoint    # Standard jdbc Spark url config
        user: my-user            # Standard jdbc Spark user config
        password: my-pwd         # Standard jdbc Spark pwd config  
        dbTable: db.table        # Standard jdbc Spark dbTable config
    partitions: 500              # Partitions number
    cache: checkpoint            # Cache layer to checkpoint

An example of a full yaml configuration with Hive table

source: 
  id: my_source              # Source id
  source:                    # Source configuration
    dataset: mydata          # Source dataset name
    table: my_hive_table     # Hive table name  
  partitions: 500            # Partitions number
  cache: checkpoint          # Cache layer to checkpoint

YaspProcess

A YaspProcess is a model that define a data process operation.

Currently is defined as follow:

case class YaspProcess(
  id: String,                 // Unique ID to internally identify the action
  dataset: String,            // Dataset name used to register the outcome of the process
  process: Process,           // Process Sum Type
  partitions: Option[Int],    // Optional number of partitions
  cache: Option[CacheLayer]   // Optional CacheLayer that will be used to cache resulting dataframe
) extends YaspAction

Each YaspProcess are executed via a YaspProcessor that execute the Process, repartition and cache the data to a specific CacheLayer and register it as a temporary table with the id provided.

An example of a full yaml configuration with Hive table

process: 
  id: my_source         # Process id
  dataset: mydata       # Dataset name
  process:              # Process field, will contains all the Process configuration to transform the data
    query: >-           # A sql process configuration
      SELECT * 
      FROM my_csv  
  partitions: 500        # Partitions number
  cache: checkpoint      # Cache layer to checkpoint

YaspSink

A YaspSink define a data output operation.

case class YaspSink(
  id: String,       // Unique ID.
  dataset: String,  // Name of the dataset to sink.
  dest: Dest        // Dest Sum Type
) extends YaspAction

Each YaspSink are executed via a YaspWriter that retrieve the specific data using the provided id and write them to the specified destination.

An example of a full yaml configuration with Spark format:

dest: 
  id: my_source                  # Source id
  dataset: mydata                # Name of the dataset to sink
  dest:                          # Source configuration
    format: csv                  # Standard csv Spark format
    options:                     # Standard csv Spark format option
        header: 'true'           # Standard csv Spark url config
        sep: '|'                 # Standard csv Spark user config
        path: my/output/path     # Standard csv Spark dbTable config
    partitionBy:                 # PartitionBy columns configuration
        - nation                 
        - city
    mode: overwrite              # Standard Spark SaveMode

An example of a full yaml configuration with HiveFormat:

dest: 
  id: my_source                  # Source id
  dataset: mytable               # Name of the dataset to sink
  dest:                          # Source configuration
    table: my_hive_table         # Hive table name
    mode: overwrite              # Standard Spark SaveMode