Added more documentation and the Spark examples provided by Nicholas …

…Chammas
nielsbasjes · Oct 5, 2019 · 169ff42 · 169ff42
1 parent 831029e
commit 169ff42
Show file tree

Hide file tree

Showing 6 changed files with 108 additions and 0 deletions.
diff --git a/README-JavaMapReduce.md b/README-JavaMapReduce.md
@@ -0,0 +1,23 @@
+# Using the SplittableGZipCodec in Apache Hadoop MapReduce (Java)
+To use this in a Hadoop MapReduce job written in Java you must make sure this library has been added as a dependency.
+
+In Maven you would simply add this dependency
+
+    <dependency>
+      <groupId>nl.basjes.hadoop</groupId>
+      <artifactId>splittablegzip</artifactId>
+      <version>1.2</version>
+    </dependency>
+
+Then in Java you would create an instance of the Job that you are going to run
+
+    Job job = ...
+
+and then before actually running the job you set the configuration using something like this:
+
+    job.getConfiguration().set("io.compression.codecs", "nl.basjes.hadoop.io.compress.SplittableGzipCodec");
+    job.getConfiguration().setLong("mapreduce.input.fileinputformat.split.minsize", 5000000000);
+    job.getConfiguration().setLong("mapreduce.input.fileinputformat.split.maxsize", 5000000000);
+
+
+NOTE: The ORIGINAL GzipCodec may NOT be in the list of compression codec anymore !
diff --git a/README-Pig.md b/README-Pig.md
@@ -0,0 +1,25 @@
+# Using the SplittableGZipCodec in Apache Pig
+To use this in an Apache Pig you must make sure this library has been added as a dependency.
+
+Simply download the prebuilt library from Maven Central.
+You can find the latest jar file here: [https://search.maven.org/search?q=a:splittablegzip](https://search.maven.org/search?q=a:splittablegzip)
+
+Then in Pig you need to load this jat file into your job.
+
+    REGISTER splittablegzip-*.jar
+
+and then before actually running the job you set the configuration using something like this:
+
+    -- Set the compression codecs so that the GZipCodec is removed and the splittable variant is added.
+    -- In this example we simply remove everything and only have the splittable codec.
+    SET io.compression.codecs nl.basjes.hadoop.io.compress.SplittableGzipCodec
+
+    -- Tune the settings how big the splits should be.
+    SET mapreduce.input.fileinputformat.split.minsize $splitsize
+    SET mapreduce.input.fileinputformat.split.maxsize $splitsize
+
+    -- Avoid PIG merging multiple splits back to a single mapper.
+    -- http://stackoverflow.com/q/17054880
+    SET pig.noSplitCombination true
+
+And after this the actual job is done.
diff --git a/README-Spark.md b/README-Spark.md
@@ -0,0 +1,55 @@
+# Using the SplittableGZipCodec in Apache Spark
+
+# Thanks !
+Thanks to [Nicholas Chammas](https://github.com/nchammas) for contributing this documentation.
+
+# Common problem for Spark users
+Apparently the fact that GZipped files are not splittable is also in the Spark arena a recurring problem as you can see 
+in this [Spark Jira ticket](https://issues.apache.org/jira/browse/SPARK-29102?focusedCommentId=16932921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16932921) and
+ these two questions on StackOverflow [Question 1](https://stackoverflow.com/q/28127119/877069) [Question 2](https://stackoverflow.com/q/27531816/877069).
+
+It turns out that this libray works with Apache Spark without modification.
+
+# Using it
+Here is an example, which was tested against Apache Spark 2.4.4 using the Python DataFrame API:
+
+```python
+# splittable-gzip.py
+from pyspark.sql import SparkSession
+
+
+if __name__ == '__main__':
+    spark = (
+        SparkSession.builder
+        # If you want to change the split size, you need to use this config
+        # instead of mapreduce.input.fileinputformat.split.maxsize.
+        # I don't think Spark DataFrames offer an equivalent setting for
+        # mapreduce.input.fileinputformat.split.minsize.
+        .config('spark.sql.files.maxPartitionBytes', 1000 * (1024 ** 2))
+        .getOrCreate()
+    )
+
+    print(
+        spark.read
+        # You can also specify this option against the SparkSession.
+        .option('io.compression.codecs', 'nl.basjes.hadoop.io.compress.SplittableGzipCodec')
+        .csv(...)
+        .count()
+    )
+```
+
+Run this script as follows:
+
+```sh
+spark-submit --packages "nl.basjes.hadoop:splittablegzip:1.2" splittable-gzip.py
+```
+
+Here's what the Spark UI looks like when running this script against a 20 GB gzip file on a laptop:
+
+![Spark UI](README-SparkUI.png)
+
+You can see in the task list the behavior described in the [README](README.md) that each task is reading from the start of the file to its target split.
+
+Also in the Executor UI you can see every available core running concurrently against this single file:
+
+![Spark Executor UI](README-SparkExecutorUI.png)
diff --git a/README-SparkExecutorUI.png b/README-SparkExecutorUI.png
diff --git a/README-SparkUI.png b/README-SparkUI.png
diff --git a/README.md b/README.md
@@ -68,6 +68,11 @@ I.e. Do something like this before starting the build or loading your IDE.
    **mapreduce.input.fileinputformat.split.minsize** and/or
    **mapreduce.input.fileinputformat.split.maxsize**.
 
+# Usage examples
+- [Java Apache Hadoop MapReduce](README-JavaMapReduce.md)
+- [Apache Pig](README-Pig.md)
+- [Apache Spark](README-Spark.md)
+
 # Choosing the configuration settings
 ## How it works
 For each "split" the gzipped input file is read from the beginning of the file