[ENV-413] Support examples for C5 and C6 (#287)

cloudera-labs · Apr 17, 2019 · e7d2b95 · e7d2b95
1 parent 2ec2df0
commit e7d2b95
Show file tree

Hide file tree

Showing 17 changed files with 173 additions and 103 deletions.
diff --git a/README.md b/README.md
@@ -20,16 +20,21 @@ Envelope requires Apache Spark 2.1.0 or above.
 Additionally, if using these components, Envelope requires:
 - Apache Kafka 0.10 or above
 - Apache Kudu 1.4.0 or above
-- Apache HBase 1.2.0 or above (note: 2.x has not yet been tested)
+- Apache HBase 1.2.0 or above
 - Apache ZooKeeper 3.4.5 or above
+- Apache Impala 2.7.0 or above
 
-For Cloudera's distributions, Kafka requires Cloudera's Kafka 2.1.0 or above, and HBase and ZooKeeper requires CDH5.7 or above. Note that CDH6.x has not yet been tested.
+For Cloudera CDH 5, Kafka requires Cloudera's Kafka 2.1.0 or above, HBase and ZooKeeper requires CDH 5.7 or above, and Impala requires CDH 5.9 or above. For Cloudera CDH 6, any CDH 6.0 or above is required.
 
-### Compiling Envelope  
+### Downloading Envelope
 
-You can build the Envelope application from the top-level directory of the source code by running the Maven command:
+Envelope and its dependencies can be downloaded as a single jar file from the GitHub repository [Releases page](https://github.com/cloudera-labs/envelope/releases).
 
-    mvn clean package
+### Compiling Envelope
+
+Alternatively, you can build the Envelope application from the top-level directory of the source code by running the Maven command:
+
+    mvn clean install
 
 This will create `envelope-0.7.0-SNAPSHOT.jar` in the `build/envelope/target` directory.
 
@@ -43,12 +48,15 @@ Envelope provides these example pipelines that you can run for yourself:
 - [Traffic](examples/traffic/): simulates receiving traffic conditions and calculating an aggregate view of traffic congestion.
 - [Filesystem](examples/filesystem/): demonstrates a batch job that reads a JSON file from HDFS and writes the data back to Avro files on HDFS.
 - [Cloudera Navigator](examples/navigator/): implements a streaming job to ingest audit events from Cloudera Navigator into Kudu, HDFS and Solr.
+- [Impala DDL](examples/impala_ddl): demonstrates updating Impala metadata, such as adding partitions and refreshing table metadata
 
 ### Running Envelope
 
 You can run Envelope by submitting it to Spark with the configuration file for your pipeline:
 
-    spark2-submit envelope-0.7.0-SNAPSHOT.jar yourpipeline.conf
+    spark-submit envelope-0.7.0-SNAPSHOT.jar your_pipeline.conf
+
+Note: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.
 
 A helpful place to monitor your running pipeline is from the Spark UI for the job. You can find this via the YARN ResourceManager UI.
 

diff --git a/docs/decisions.adoc b/docs/decisions.adoc
@@ -12,7 +12,7 @@ A decision step can make a decision using one of three methods, which are outlin
 
 === Literal
 
-The `literal` decision method takes the true or false result directly from the configuration of the decision step. This method would be useful if the result is provided by a parameter, which in turn can be populated by a `spark2-submit` argument or an environment variable.
+The `literal` decision method takes the true or false result directly from the configuration of the decision step. This method would be useful if the result is provided by a parameter, which in turn can be populated by a `spark-submit` argument or an environment variable.
 
 In this self-contained example the value of the `${result}` parameter will determine whether `run_if_true` and `run_after_run_if_true`, or `run_if_false` and `run_after_run_if_false`, are run:
 
@@ -62,7 +62,9 @@ steps {
 
 This pipeline could be run with `${result}` populated by using an argument after the configuration file:
 
-  spark2-submit envelope-*.jar pipeline.conf result=true
+  spark-submit envelope-*.jar pipeline.conf result=true
+
+NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.
 
 === Step by key
 

diff --git a/docs/derivers.adoc b/docs/derivers.adoc
@@ -58,7 +58,7 @@ This shows both methods for populating parameters. In this example `${DECIMAL_PL
 
 The `morphline` deriver is used to run Morphline transformations over the records of a single dependency of the step defined by the `step.name` parameter.
 
-The Morphline transformation is provided to the Envelope pipeline by a local file to the Spark executors. The local file is retrieved from the location in the `morphline.file` configuration. The local file can be provided to the Spark executors from `spark2-submit` using the `--files` option.
+The Morphline transformation is provided to the Envelope pipeline by a local file to the Spark executors. The local file is retrieved from the location in the `morphline.file` configuration. The local file can be provided to the Spark executors from `spark-submit` using the `--files` option.
 
 The ID of the specific transformation within the Morphline file is specified with the `morphline.id` configuration.
 
@@ -711,7 +711,9 @@ To use an alias in configuration files, Envelope needs to be able to find your c
 
 With the project compiled into a jar file the deriver can be submitted as part of the Envelope pipeline similarly to:
 
-  spark2-submit --jars customderiver.jar envelope-*.jar pipeline.conf
+  spark-submit --jars customderiver.jar envelope-*.jar pipeline.conf
+
+NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.
 
 The jar file can contain multiple derivers, and other pluggable classes such as custom inputs, outputs, etc.
 

diff --git a/docs/inputs.adoc b/docs/inputs.adoc
@@ -67,15 +67,17 @@ The system environment variable can be provided as follows when executed in clie
 ----
 export KUDU_MASTER_CONNECTION_STRING="master1:7051,master2:7051,master3:7051"
 
-spark2-submit --master yarn --deploy-mode client envelope-<version>.jar application.conf
+spark-submit --master yarn --deploy-mode client envelope-<version>.jar application.conf
 ----
 
+NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.
+
 ===== yarn-cluster mode
 
 The system environment variable can be provided as follows when executed in cluster deployment mode:
 
 ----
-spark2-submit --master yarn --deploy-mode cluster --files application.conf --conf spark.yarn.appMasterEnv.KUDU_MASTER_CONNECTION_STRING="master1:7051,master2:7051,master3:7051"  envelope-<version>.jar application.conf
+spark-submit --master yarn --deploy-mode cluster --files application.conf --conf spark.yarn.appMasterEnv.KUDU_MASTER_CONNECTION_STRING="master1:7051,master2:7051,master3:7051"  envelope-<version>.jar application.conf
 ----
 
 === Kafka
@@ -217,7 +219,7 @@ _Documentation for this translator has not yet been added._
 
 In cases that Envelope does not provide an input or translator for a required data source, a custom class can be developed and referenced in the Envelope pipeline.
 
-To create a batch input implement the `BatchInput` interface, or to create a stream input implement the `StreamInput` interface. Translators must implement the `Translator` interface. With the implemented class compiled into its own jar file the input or translator can be referenced in the pipeline by using the fully qualified class name (or alias -- see below) as the input `type`, and it can be provided to the Envelope application using the `--jars` argument when calling `spark2-submit`.
+To create a batch input implement the `BatchInput` interface, or to create a stream input implement the `StreamInput` interface. Translators must implement the `Translator` interface. With the implemented class compiled into its own jar file the input or translator can be referenced in the pipeline by using the fully qualified class name (or alias -- see below) as the input `type`, and it can be provided to the Envelope application using the `--jars` argument when calling `spark-submit`.
 
 === Using Aliases
 

diff --git a/docs/looping.adoc b/docs/looping.adoc
@@ -62,7 +62,9 @@ steps {
 
 The pipeline could be run with the dates provided like below:
 
-  spark2-submit envelope-*.jar process_dates.conf first_date=20170501,last_date=20170531
+  spark-submit envelope-*.jar process_dates.conf first_date=20170501,last_date=20170531
+
+NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.
 
 When Envelope reaches the loop step (after `read_month`) it would unroll the loop into:
 

diff --git a/docs/security.adoc b/docs/security.adoc
@@ -32,12 +32,15 @@ achieve this is via something like the following:
 ----
 export EXTRA_CP=/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar
 
-spark2-submit \
+spark-submit \
   --driver-class-path ${EXTRA_CP} \
   --conf spark.executor.extraClassPath=${EXTRA_CP} \
   envelope-*.jar envelope_app.conf
 ----
 
+NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.
+CDH5 clusters may also require the environment variable `SPARK_KAFKA_VERSION` to be set to `0.10`.
+
 Kudu:: If using a bulk planner (`append`, `delete`, `overwrite`, `upsert`), there is no extra
 configuration required to work with secure Kudu clusters--the underlying `KuduContext` automatically handles the security.
 +
@@ -62,7 +65,7 @@ variable as follows:
 +
 ----
 HADOOP_CONF_DIR=/etc/hbase/conf:/etc/hive/conf:/etc/spark2/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar \
-spark2-submit \
+spark-submit \
   --deploy-mode cluster \
   --files jaas.conf,envelope_app.conf,kafka.kt \
   --driver-java-options=-Djava.security.auth.login.config=jaas.conf \
@@ -77,7 +80,7 @@ Kudu:: Kudu requires a keytab to operate correctly in cluster mode. Additionally
 `secure = true` is required in the step config as in client mode. An example submission would be:
 +
 ----
-spark2-submit \
+spark-submit \
   --deploy-mode cluster \
   --files envelope_app.conf \
   --keytab user.kt \ <1>
@@ -122,7 +125,7 @@ We can then submit the job with something like the following:
 
 ----
 export SPARK_KAFKA_VERSION=0.10
-spark2-submit \
+spark-submit \
   --files jaas.conf,kafka.kt \
   --driver-java-options=-Djava.security.auth.login.config=jaas.conf \
   --conf spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf \
@@ -153,7 +156,7 @@ ln -s user.kt kafka.kt
 export SPARK_KAFKA_VERSION=0.10 <1>
 export PRINCNAME=REPLACEME
 HADOOP_CONF_DIR=/etc/hbase/conf:/etc/hive/conf:/etc/spark2/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar \
-spark2-submit \
+spark-submit \
   --keytab user.kt \
   --principal ${PRINCNAME} \
   --files jaas.conf,envelope_app.conf,kafka.kt \

diff --git a/docs/tasks.adoc b/docs/tasks.adoc
@@ -199,7 +199,9 @@ To develop a custom task:
 . Add task step to your pipeline by setting `type` to `task`, `class` to the fully qualified class name of your task, and any other configurations for your task.
 . Run the pipeline with your task similarly to:
 
-  spark2-submit --jars yourtask.jar envelope-*.jar yourpipeline.conf
+  spark-submit --jars yourtask.jar envelope-*.jar yourpipeline.conf
+
+NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.
 
 === Custom task example
 

diff --git a/examples/filesystem/README.md b/examples/filesystem/README.md
@@ -12,7 +12,9 @@ This example demonstrates a simple HDFS-based data processing pipeline.
 
 **Run the Envelope job**
 
-    spark2-submit build/envelope/target/envelope-*.jar examples/filesystem/filesystem.conf
+    spark-submit build/envelope/target/envelope-*.jar examples/filesystem/filesystem.conf
+
+Note: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.
 
 **Grab the results**
 

diff --git a/examples/fix-hbase/README.adoc b/examples/fix-hbase/README.adoc
@@ -1,10 +1,10 @@
-# FIX HBase Example
+= FIX HBase Example
 
 The FIX example is an Envelope pipeline that receives https://en.wikipedia.org/wiki/Financial_Information_eXchange[FIX financial messages] of order fulfillment (new orders and execution reports) and updates an HBase history table. This example shows how historical order information can be maintained for each stock symbol.
 
 The configuration for this example is found link:fix_hbase.conf[here]. The messages do not fully conform to the real FIX protocol but should be sufficient to demonstrate how a complete implementation could be developed.
 
-## Pre-requisites
+== Pre-requisites
 
 . Create the required HBase table using the provided HBase shell script:
 
@@ -14,26 +14,27 @@ The configuration for this example is found link:fix_hbase.conf[here]. The messa
 
     kafka-topics --zookeeper YOURZOOKEEPERHOSTNAME:2181 --create --topic fix --partitions 9 --replication-factor 3
 
-## Running the example
+== Running the example
 
 Run the example in non-secure clusters by submitting the Envelope application with the example's configuration:
 
-    $ export SPARK_KAFKA_VERSION=0.10
     $ export EXTRA_CP=/etc/hbase/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar
-    $ spark2-submit \
+    $ spark-submit \
       --driver-class-path ${EXTRA_CP} \
       --conf spark.executor.extraClassPath=${EXTRA_CP} \
       envelope-*.jar fix_hbase.conf
 
+NOTE: CDH5 uses `spark2-submit` instead of `spark-submit` for Spark 2 applications such as Envelope.
+CDH5 clusters may also require the environment variable `SPARK_KAFKA_VERSION` to be set to `0.10`.
+
 For secure clusters, we need to supply a keytab and a JAAS file to the Spark job, as well as ensure that Spark obtains an HBase token on startup. For this we need a couple of supporting files. Firstly, a keytab file `user.kt` containing entries for the user submitting the job. Second, we need a JAAS configuration file `jaas.conf` for the driver and executors to reference. A sample JAAS file is supplied link:jaas.conf[here].
 
 A method that works on secure clusters in both client and cluster mode is as follows:
 
     $ ln -s user.kt kafka.kt
-    $ export SPARK_KAFKA_VERSION=0.10
     $ export PRINCNAME=REPLACEME
     $ HADOOP_CONF_DIR=/etc/hbase/conf:/etc/hive/conf:/etc/spark2/conf:/opt/cloudera/parcels/CDH/lib/hbase/hbase-common.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-client.jar:/opt/cloudera/parcels/CDH/lib/hbase/hbase-protocol.jar:/opt/cloudera/parcels/CDH/lib/hbase/lib/htrace-core.jar \
-        spark2-submit \
+        spark-submit \
         --keytab user.kt \
         --principal ${PRINCNAME} \
         --files jaas.conf,fix_hbase.conf,kafka.kt \
@@ -46,19 +47,17 @@ A method that works on secure clusters in both client and cluster mode is as fol
 
 In another terminal session start the data generator, which is itself an Envelope pipeline that writes out one million example orders to Kafka. For unsecure clusters, this can be launched via:
 
-    $ export SPARK_KAFKA_VERSION=0.10
-    $ spark2-submit envelope-*.jar fix_generator.conf
+    $ spark-submit envelope-*.jar fix_generator.conf
 
 For secure clusters, use something like:
 
-    $ export SPARK_KAFKA_VERSION=0.10
-    $ spark2-submit \
+    $ spark-submit \
         --files jaas.conf,kafka.kt \
         --driver-java-options=-Djava.security.auth.login.config=jaas.conf \
         --conf spark.executor.extraJavaOptions=-Djava.security.auth.login.config=jaas.conf \
         envelope-*.jar fix_generator.conf
 
-## Seeing the results
+== Seeing the results
 
 You can verify the data is being loaded into HBase by running a few sample scans from the HBase shell: