Skip to content

Commit

Permalink
[ENV-413] Support examples for C5 and C6 (#287)
Browse files Browse the repository at this point in the history
  • Loading branch information
Jeremy Beard authored and Ian Buss committed Apr 17, 2019
1 parent 2ec2df0 commit e7d2b95
Show file tree
Hide file tree
Showing 17 changed files with 173 additions and 103 deletions.
20 changes: 14 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,16 +20,21 @@ Envelope requires Apache Spark 2.1.0 or above.
Additionally, if using these components, Envelope requires:
- Apache Kafka 0.10 or above
- Apache Kudu 1.4.0 or above
- Apache HBase 1.2.0 or above (note: 2.x has not yet been tested)
- Apache HBase 1.2.0 or above
- Apache ZooKeeper 3.4.5 or above
- Apache Impala 2.7.0 or above

For Cloudera's distributions, Kafka requires Cloudera's Kafka 2.1.0 or above, and HBase and ZooKeeper requires CDH5.7 or above. Note that CDH6.x has not yet been tested.
For Cloudera CDH 5, Kafka requires Cloudera's Kafka 2.1.0 or above, HBase and ZooKeeper requires CDH 5.7 or above, and Impala requires CDH 5.9 or above. For Cloudera CDH 6, any CDH 6.0 or above is required.

### Compiling Envelope
### Downloading Envelope

You can build the Envelope application from the top-level directory of the source code by running the Maven command:
Envelope and its dependencies can be downloaded as a single jar file from the GitHub repository [Releases page](https://github.com/cloudera-labs/envelope/releases).

mvn clean package
### Compiling Envelope

Alternatively, you can build the Envelope application from the top-level directory of the source code by running the Maven command:

mvn clean install

This will create `envelope-0.7.0-SNAPSHOT.jar` in the `build/envelope/target` directory.

Expand All @@ -43,12 +48,15 @@ Envelope provides these example pipelines that you can run for yourself:
- [Traffic](examples/traffic/): simulates receiving traffic conditions and calculating an aggregate view of traffic congestion.
- [Filesystem](examples/filesystem/): demonstrates a batch job that reads a JSON file from HDFS and writes the data back to Avro files on HDFS.
- [Cloudera Navigator](examples/navigator/): implements a streaming job to ingest audit events from Cloudera Navigator into Kudu, HDFS and Solr.
- [Impala DDL](examples/impala_ddl): demonstrates updating Impala metadata, such as adding partitions and refreshing table metadata

### Running Envelope

You can run Envelope by submitting it to Spark with the configuration file for your pipeline:

spark2-submit envelope-0.7.0-SNAPSHOT.jar yourpipeline.conf
spark-submit envelope-0.7.0-SNAPSHOT.jar your_pipeline.conf

Note: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.

A helpful place to monitor your running pipeline is from the Spark UI for the job. You can find this via the YARN ResourceManager UI.

Expand Down
6 changes: 4 additions & 2 deletions docs/decisions.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ A decision step can make a decision using one of three methods, which are outlin

=== Literal

The `literal` decision method takes the true or false result directly from the configuration of the decision step. This method would be useful if the result is provided by a parameter, which in turn can be populated by a `spark2-submit` argument or an environment variable.
The `literal` decision method takes the true or false result directly from the configuration of the decision step. This method would be useful if the result is provided by a parameter, which in turn can be populated by a `spark-submit` argument or an environment variable.

In this self-contained example the value of the `${result}` parameter will determine whether `run_if_true` and `run_after_run_if_true`, or `run_if_false` and `run_after_run_if_false`, are run:

Expand Down Expand Up @@ -62,7 +62,9 @@ steps {

This pipeline could be run with `${result}` populated by using an argument after the configuration file:

spark2-submit envelope-*.jar pipeline.conf result=true
spark-submit envelope-*.jar pipeline.conf result=true

NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.

=== Step by key

Expand Down
6 changes: 4 additions & 2 deletions docs/derivers.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ This shows both methods for populating parameters. In this example `${DECIMAL_PL

The `morphline` deriver is used to run Morphline transformations over the records of a single dependency of the step defined by the `step.name` parameter.

The Morphline transformation is provided to the Envelope pipeline by a local file to the Spark executors. The local file is retrieved from the location in the `morphline.file` configuration. The local file can be provided to the Spark executors from `spark2-submit` using the `--files` option.
The Morphline transformation is provided to the Envelope pipeline by a local file to the Spark executors. The local file is retrieved from the location in the `morphline.file` configuration. The local file can be provided to the Spark executors from `spark-submit` using the `--files` option.

The ID of the specific transformation within the Morphline file is specified with the `morphline.id` configuration.

Expand Down Expand Up @@ -711,7 +711,9 @@ To use an alias in configuration files, Envelope needs to be able to find your c

With the project compiled into a jar file the deriver can be submitted as part of the Envelope pipeline similarly to:

spark2-submit --jars customderiver.jar envelope-*.jar pipeline.conf
spark-submit --jars customderiver.jar envelope-*.jar pipeline.conf

NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.

The jar file can contain multiple derivers, and other pluggable classes such as custom inputs, outputs, etc.

Expand Down
8 changes: 5 additions & 3 deletions docs/inputs.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -67,15 +67,17 @@ The system environment variable can be provided as follows when executed in clie
----
export KUDU_MASTER_CONNECTION_STRING="master1:7051,master2:7051,master3:7051"
spark2-submit --master yarn --deploy-mode client envelope-<version>.jar application.conf
spark-submit --master yarn --deploy-mode client envelope-<version>.jar application.conf
----

NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.

===== yarn-cluster mode

The system environment variable can be provided as follows when executed in cluster deployment mode:

----
spark2-submit --master yarn --deploy-mode cluster --files application.conf --conf spark.yarn.appMasterEnv.KUDU_MASTER_CONNECTION_STRING="master1:7051,master2:7051,master3:7051" envelope-<version>.jar application.conf
spark-submit --master yarn --deploy-mode cluster --files application.conf --conf spark.yarn.appMasterEnv.KUDU_MASTER_CONNECTION_STRING="master1:7051,master2:7051,master3:7051" envelope-<version>.jar application.conf
----

=== Kafka
Expand Down Expand Up @@ -217,7 +219,7 @@ _Documentation for this translator has not yet been added._

In cases that Envelope does not provide an input or translator for a required data source, a custom class can be developed and referenced in the Envelope pipeline.

To create a batch input implement the `BatchInput` interface, or to create a stream input implement the `StreamInput` interface. Translators must implement the `Translator` interface. With the implemented class compiled into its own jar file the input or translator can be referenced in the pipeline by using the fully qualified class name (or alias -- see below) as the input `type`, and it can be provided to the Envelope application using the `--jars` argument when calling `spark2-submit`.
To create a batch input implement the `BatchInput` interface, or to create a stream input implement the `StreamInput` interface. Translators must implement the `Translator` interface. With the implemented class compiled into its own jar file the input or translator can be referenced in the pipeline by using the fully qualified class name (or alias -- see below) as the input `type`, and it can be provided to the Envelope application using the `--jars` argument when calling `spark-submit`.

=== Using Aliases

Expand Down
4 changes: 3 additions & 1 deletion docs/looping.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,9 @@ steps {

The pipeline could be run with the dates provided like below:

spark2-submit envelope-*.jar process_dates.conf first_date=20170501,last_date=20170531
spark-submit envelope-*.jar process_dates.conf first_date=20170501,last_date=20170531

NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.

When Envelope reaches the loop step (after `read_month`) it would unroll the loop into:

Expand Down
13 changes: 8 additions & 5 deletions docs/security.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,15 @@ achieve this is via something like the following:
----
export EXTRA_CP=/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar
spark2-submit \
spark-submit \
--driver-class-path ${EXTRA_CP} \
--conf spark.executor.extraClassPath=${EXTRA_CP} \
envelope-*.jar envelope_app.conf
----

NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.
CDH5 clusters may also require the environment variable `SPARK_KAFKA_VERSION` to be set to `0.10`.

Kudu:: If using a bulk planner (`append`, `delete`, `overwrite`, `upsert`), there is no extra
configuration required to work with secure Kudu clusters--the underlying `KuduContext` automatically handles the security.
+
Expand All @@ -62,7 +65,7 @@ variable as follows:
+
----
HADOOP_CONF_DIR=/etc/hbase/conf:/etc/hive/conf:/etc/spark2/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar \
spark2-submit \
spark-submit \
--deploy-mode cluster \
--files jaas.conf,envelope_app.conf,kafka.kt \
--driver-java-options=-Djava.security.auth.login.config=jaas.conf \
Expand All @@ -77,7 +80,7 @@ Kudu:: Kudu requires a keytab to operate correctly in cluster mode. Additionally
`secure = true` is required in the step config as in client mode. An example submission would be:
+
----
spark2-submit \
spark-submit \
--deploy-mode cluster \
--files envelope_app.conf \
--keytab user.kt \ <1>
Expand Down Expand Up @@ -122,7 +125,7 @@ We can then submit the job with something like the following:

----
export SPARK_KAFKA_VERSION=0.10
spark2-submit \
spark-submit \
--files jaas.conf,kafka.kt \
--driver-java-options=-Djava.security.auth.login.config=jaas.conf \
--conf spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf \
Expand Down Expand Up @@ -153,7 +156,7 @@ ln -s user.kt kafka.kt
export SPARK_KAFKA_VERSION=0.10 <1>
export PRINCNAME=REPLACEME
HADOOP_CONF_DIR=/etc/hbase/conf:/etc/hive/conf:/etc/spark2/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar \
spark2-submit \
spark-submit \
--keytab user.kt \
--principal ${PRINCNAME} \
--files jaas.conf,envelope_app.conf,kafka.kt \
Expand Down
4 changes: 3 additions & 1 deletion docs/tasks.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,9 @@ To develop a custom task:
. Add task step to your pipeline by setting `type` to `task`, `class` to the fully qualified class name of your task, and any other configurations for your task.
. Run the pipeline with your task similarly to:

spark2-submit --jars yourtask.jar envelope-*.jar yourpipeline.conf
spark-submit --jars yourtask.jar envelope-*.jar yourpipeline.conf

NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.

=== Custom task example

Expand Down
4 changes: 3 additions & 1 deletion examples/filesystem/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,9 @@ This example demonstrates a simple HDFS-based data processing pipeline.

**Run the Envelope job**

spark2-submit build/envelope/target/envelope-*.jar examples/filesystem/filesystem.conf
spark-submit build/envelope/target/envelope-*.jar examples/filesystem/filesystem.conf

Note: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.

**Grab the results**

Expand Down
23 changes: 11 additions & 12 deletions examples/fix-hbase/README.adoc
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
# FIX HBase Example
= FIX HBase Example

The FIX example is an Envelope pipeline that receives https://en.wikipedia.org/wiki/Financial_Information_eXchange[FIX financial messages] of order fulfillment (new orders and execution reports) and updates an HBase history table. This example shows how historical order information can be maintained for each stock symbol.

The configuration for this example is found link:fix_hbase.conf[here]. The messages do not fully conform to the real FIX protocol but should be sufficient to demonstrate how a complete implementation could be developed.

## Pre-requisites
== Pre-requisites

. Create the required HBase table using the provided HBase shell script:

Expand All @@ -14,26 +14,27 @@ The configuration for this example is found link:fix_hbase.conf[here]. The messa

kafka-topics --zookeeper YOURZOOKEEPERHOSTNAME:2181 --create --topic fix --partitions 9 --replication-factor 3

## Running the example
== Running the example

Run the example in non-secure clusters by submitting the Envelope application with the example's configuration:

$ export SPARK_KAFKA_VERSION=0.10
$ export EXTRA_CP=/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar
$ spark2-submit \
$ spark-submit \
--driver-class-path ${EXTRA_CP} \
--conf spark.executor.extraClassPath=${EXTRA_CP} \
envelope-*.jar fix_hbase.conf

NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.
CDH5 clusters may also require the environment variable `SPARK_KAFKA_VERSION` to be set to `0.10`.

For secure clusters, we need to supply a keytab and a JAAS file to the Spark job, as well as ensure that Spark obtains an HBase token on startup. For this we need a couple of supporting files. Firstly, a keytab file `user.kt` containing entries for the user submitting the job. Second, we need a JAAS configuration file `jaas.conf` for the driver and executors to reference. A sample JAAS file is supplied link:jaas.conf[here].

A method that works on secure clusters in both client and cluster mode is as follows:

$ ln -s user.kt kafka.kt
$ export SPARK_KAFKA_VERSION=0.10
$ export PRINCNAME=REPLACEME
$ HADOOP_CONF_DIR=/etc/hbase/conf:/etc/hive/conf:/etc/spark2/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar \
spark2-submit \
spark-submit \
--keytab user.kt \
--principal ${PRINCNAME} \
--files jaas.conf,fix_hbase.conf,kafka.kt \
Expand All @@ -46,19 +47,17 @@ A method that works on secure clusters in both client and cluster mode is as fol

In another terminal session start the data generator, which is itself an Envelope pipeline that writes out one million example orders to Kafka. For unsecure clusters, this can be launched via:

$ export SPARK_KAFKA_VERSION=0.10
$ spark2-submit envelope-*.jar fix_generator.conf
$ spark-submit envelope-*.jar fix_generator.conf

For secure clusters, use something like:

$ export SPARK_KAFKA_VERSION=0.10
$ spark2-submit \
$ spark-submit \
--files jaas.conf,kafka.kt \
--driver-java-options=-Djava.security.auth.login.config=jaas.conf \
--conf spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf \
envelope-*.jar fix_generator.conf

## Seeing the results
== Seeing the results

You can verify the data is being loaded into HBase by running a few sample scans from the HBase shell:

Expand Down
Loading

0 comments on commit e7d2b95

Please sign in to comment.