Skip to content

Commit

Permalink
Added more documentation and the Spark examples provided by Nicholas …
Browse files Browse the repository at this point in the history
…Chammas
  • Loading branch information
nielsbasjes committed Oct 5, 2019
1 parent 831029e commit 169ff42
Show file tree
Hide file tree
Showing 6 changed files with 108 additions and 0 deletions.
23 changes: 23 additions & 0 deletions README-JavaMapReduce.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Using the SplittableGZipCodec in Apache Hadoop MapReduce (Java)
To use this in a Hadoop MapReduce job written in Java you must make sure this library has been added as a dependency.

In Maven you would simply add this dependency

<dependency>
<groupId>nl.basjes.hadoop</groupId>
<artifactId>splittablegzip</artifactId>
<version>1.2</version>
</dependency>

Then in Java you would create an instance of the Job that you are going to run

Job job = ...

and then before actually running the job you set the configuration using something like this:

job.getConfiguration().set("io.compression.codecs", "nl.basjes.hadoop.io.compress.SplittableGzipCodec");
job.getConfiguration().setLong("mapreduce.input.fileinputformat.split.minsize", 5000000000);
job.getConfiguration().setLong("mapreduce.input.fileinputformat.split.maxsize", 5000000000);


NOTE: The ORIGINAL GzipCodec may NOT be in the list of compression codec anymore !
25 changes: 25 additions & 0 deletions README-Pig.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Using the SplittableGZipCodec in Apache Pig
To use this in an Apache Pig you must make sure this library has been added as a dependency.

Simply download the prebuilt library from Maven Central.
You can find the latest jar file here: [https://search.maven.org/search?q=a:splittablegzip](https://search.maven.org/search?q=a:splittablegzip)

Then in Pig you need to load this jat file into your job.

REGISTER splittablegzip-*.jar

and then before actually running the job you set the configuration using something like this:

-- Set the compression codecs so that the GZipCodec is removed and the splittable variant is added.
-- In this example we simply remove everything and only have the splittable codec.
SET io.compression.codecs nl.basjes.hadoop.io.compress.SplittableGzipCodec

-- Tune the settings how big the splits should be.
SET mapreduce.input.fileinputformat.split.minsize $splitsize
SET mapreduce.input.fileinputformat.split.maxsize $splitsize

-- Avoid PIG merging multiple splits back to a single mapper.
-- http://stackoverflow.com/q/17054880
SET pig.noSplitCombination true

And after this the actual job is done.
55 changes: 55 additions & 0 deletions README-Spark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Using the SplittableGZipCodec in Apache Spark

# Thanks !
Thanks to [Nicholas Chammas](https://github.com/nchammas) for contributing this documentation.

# Common problem for Spark users
Apparently the fact that GZipped files are not splittable is also in the Spark arena a recurring problem as you can see
in this [Spark Jira ticket](https://issues.apache.org/jira/browse/SPARK-29102?focusedCommentId=16932921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16932921) and
these two questions on StackOverflow [Question 1](https://stackoverflow.com/q/28127119/877069) [Question 2](https://stackoverflow.com/q/27531816/877069).

It turns out that this libray works with Apache Spark without modification.

# Using it
Here is an example, which was tested against Apache Spark 2.4.4 using the Python DataFrame API:

```python
# splittable-gzip.py
from pyspark.sql import SparkSession


if __name__ == '__main__':
spark = (
SparkSession.builder
# If you want to change the split size, you need to use this config
# instead of mapreduce.input.fileinputformat.split.maxsize.
# I don't think Spark DataFrames offer an equivalent setting for
# mapreduce.input.fileinputformat.split.minsize.
.config('spark.sql.files.maxPartitionBytes', 1000 * (1024 ** 2))
.getOrCreate()
)

print(
spark.read
# You can also specify this option against the SparkSession.
.option('io.compression.codecs', 'nl.basjes.hadoop.io.compress.SplittableGzipCodec')
.csv(...)
.count()
)
```

Run this script as follows:

```sh
spark-submit --packages "nl.basjes.hadoop:splittablegzip:1.2" splittable-gzip.py
```

Here's what the Spark UI looks like when running this script against a 20 GB gzip file on a laptop:

![Spark UI](README-SparkUI.png)

You can see in the task list the behavior described in the [README](README.md) that each task is reading from the start of the file to its target split.

Also in the Executor UI you can see every available core running concurrently against this single file:

![Spark Executor UI](README-SparkExecutorUI.png)
Binary file added README-SparkExecutorUI.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added README-SparkUI.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,11 @@ I.e. Do something like this before starting the build or loading your IDE.
**mapreduce.input.fileinputformat.split.minsize** and/or
**mapreduce.input.fileinputformat.split.maxsize**.

# Usage examples
- [Java Apache Hadoop MapReduce](README-JavaMapReduce.md)
- [Apache Pig](README-Pig.md)
- [Apache Spark](README-Spark.md)

# Choosing the configuration settings
## How it works
For each "split" the gzipped input file is read from the beginning of the file
Expand Down

0 comments on commit 169ff42

Please sign in to comment.