This example shows how to compile a Java Map/Reduce program and programmatically submit it to a BigInsights cluster and wait for the job response. An example use cases for programmatically running Map/Reduce jobs on the BigInsights cluster is a client application that needs to process data in BigInsights and send the results of processing to a third party application. Here the client application could be an ETL tool job, a Microservice, or some other custom application.
This example uses Apache Oozie to run and monitor the Map/Reduce job. Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
BigInsights for Apache Hadoop clusters are secured using Apache Knox. Apache Knox is a REST API Gateway for interacting with Apache Hadoop clusters.
This example uses cURL to interact with the REST API. This is useful for developers to understand the REST API so they can emulate the REST calls using their programming language features.
See also:
- Oozie Map/Reduce Groovy Example
- Oozie Map/Reduce Cookbook
- BigInsights Apache Knox documentation
- BigInsights Oozie documentation
Developers will gain the most from these examples if they are:
- Comfortable using Windows, OS X or *nix command prompts
- Familiar with compiling and running java applications
- Able to read code written in a high level language such as Groovy
- Familiar with the Gradle build tool
- Familiar with Map/Reduce concepts
- Familiar with Oozie concepts
- Familiar with the WebHdfsGroovy Example
- You meet the pre-requisites in the top level README
- You have followed the setup instructions in the top level README
- You have bash, cURL, grep, perl, and tr unix tools installed.
This example consists of three parts:
- WordCount.java - Map/Reduce Java code
- MapReduce.sh - Bash script to submit the Map/Reduce job to Oozie
- build.gradle - Gradle script to compile and package the Map/Reduce code and execute the Example.groovy script
To run the example, open a command prompt window:
- change into the directory containing this example and run gradle to execute the example
./gradlew Example
(OS X / *nix)gradlew.bat Example
(Windows)
- some output from running the command on my machine is shown below
biginsight-bluemix-docs $ cd examples/OozieWorkflowMapReduceCurl
biginsight-bluemix-docs/examples/OozieWorkflowMapReduceCurl $ ./gradlew Example
:compileJava UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:jar UP-TO-DATE
:Example
Successfully created directory /user/snowch/test-1469804930
Successfully created directory /user/snowch/test-1469804930/input
Successfully uploaded workflow-definition.xml
Successfully uploaded samples/hadoop-examples.jar
Successfully uploaded LICENSE
Submitted oozie job. jobid: 0000017-160728151859109-oozie-oozi-W
status: RUNNING
status: RUNNING
status: RUNNING
status: OK
"AS 2
"Contribution" 1
"Contributor" 1
"Derivative 1
"Legal 1
"License" 1
"License"); 1
"Licensor" 1
"NOTICE" 1
"Not 1
"Object" 1
...
your 4
{name 1
{yyyy} 1
HTTP/1.1 200 OK
>> MapReduce test was successful.
BUILD SUCCESSFUL
Total time: 47.996 secs
The output above shows:
- the steps performed by gradle (command names are prefixed by ':' such as :compileJava)
- the steps performed by the MapReduce.sh script
- the Map/Reduce output which is the count of the number of times each word is seen in an Apache 2.0 License file
The examples uses a gradle build file build.gradle when you run ./gradlew
or gradle.bat
. The build.gradle does the following:
- compile the Map/Reduce code. The
apply plugin: 'java'
statement controls this more info. - create a Jar file with the compiled Map/Reduce code. The
jar { ... }
statement controls this more info. - execute the MapReduce.sh script. The
task('Example' ...)
statement controls this more info
The MapReduce.sh script performs the following:
- Create an Oozie workflow XML document
- Create an Oozie configuration XML document
- Create a HTTP session on the BigInsights Knox REST API
- Upload the workflow XML file over WebHDFS
- Upload an Apache 2.0 LICENSE file over WebHDFS
- Upload the jar file (containing the Map/Reduce code) over WebHDFS
- Submit the Oozie workflow job using Knox REST API for Oozie
- Every second, check the status of the Oozie workflow job using Knox REST API for Oozie
- When the Oozie workflow job successfully finishes, download the wordcount output from the Map/Reduce job
All code is well commented and it is suggested that you browse the source code to understand in more detail how they work.