-
-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added more documentation and the Spark examples provided by Nicholas …
…Chammas
- Loading branch information
1 parent
831029e
commit 169ff42
Showing
6 changed files
with
108 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
# Using the SplittableGZipCodec in Apache Hadoop MapReduce (Java) | ||
To use this in a Hadoop MapReduce job written in Java you must make sure this library has been added as a dependency. | ||
|
||
In Maven you would simply add this dependency | ||
|
||
<dependency> | ||
<groupId>nl.basjes.hadoop</groupId> | ||
<artifactId>splittablegzip</artifactId> | ||
<version>1.2</version> | ||
</dependency> | ||
|
||
Then in Java you would create an instance of the Job that you are going to run | ||
|
||
Job job = ... | ||
|
||
and then before actually running the job you set the configuration using something like this: | ||
|
||
job.getConfiguration().set("io.compression.codecs", "nl.basjes.hadoop.io.compress.SplittableGzipCodec"); | ||
job.getConfiguration().setLong("mapreduce.input.fileinputformat.split.minsize", 5000000000); | ||
job.getConfiguration().setLong("mapreduce.input.fileinputformat.split.maxsize", 5000000000); | ||
|
||
|
||
NOTE: The ORIGINAL GzipCodec may NOT be in the list of compression codec anymore ! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# Using the SplittableGZipCodec in Apache Pig | ||
To use this in an Apache Pig you must make sure this library has been added as a dependency. | ||
|
||
Simply download the prebuilt library from Maven Central. | ||
You can find the latest jar file here: [https://search.maven.org/search?q=a:splittablegzip](https://search.maven.org/search?q=a:splittablegzip) | ||
|
||
Then in Pig you need to load this jat file into your job. | ||
|
||
REGISTER splittablegzip-*.jar | ||
|
||
and then before actually running the job you set the configuration using something like this: | ||
|
||
-- Set the compression codecs so that the GZipCodec is removed and the splittable variant is added. | ||
-- In this example we simply remove everything and only have the splittable codec. | ||
SET io.compression.codecs nl.basjes.hadoop.io.compress.SplittableGzipCodec | ||
|
||
-- Tune the settings how big the splits should be. | ||
SET mapreduce.input.fileinputformat.split.minsize $splitsize | ||
SET mapreduce.input.fileinputformat.split.maxsize $splitsize | ||
|
||
-- Avoid PIG merging multiple splits back to a single mapper. | ||
-- http://stackoverflow.com/q/17054880 | ||
SET pig.noSplitCombination true | ||
|
||
And after this the actual job is done. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
# Using the SplittableGZipCodec in Apache Spark | ||
|
||
# Thanks ! | ||
Thanks to [Nicholas Chammas](https://github.com/nchammas) for contributing this documentation. | ||
|
||
# Common problem for Spark users | ||
Apparently the fact that GZipped files are not splittable is also in the Spark arena a recurring problem as you can see | ||
in this [Spark Jira ticket](https://issues.apache.org/jira/browse/SPARK-29102?focusedCommentId=16932921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16932921) and | ||
these two questions on StackOverflow [Question 1](https://stackoverflow.com/q/28127119/877069) [Question 2](https://stackoverflow.com/q/27531816/877069). | ||
|
||
It turns out that this libray works with Apache Spark without modification. | ||
|
||
# Using it | ||
Here is an example, which was tested against Apache Spark 2.4.4 using the Python DataFrame API: | ||
|
||
```python | ||
# splittable-gzip.py | ||
from pyspark.sql import SparkSession | ||
|
||
|
||
if __name__ == '__main__': | ||
spark = ( | ||
SparkSession.builder | ||
# If you want to change the split size, you need to use this config | ||
# instead of mapreduce.input.fileinputformat.split.maxsize. | ||
# I don't think Spark DataFrames offer an equivalent setting for | ||
# mapreduce.input.fileinputformat.split.minsize. | ||
.config('spark.sql.files.maxPartitionBytes', 1000 * (1024 ** 2)) | ||
.getOrCreate() | ||
) | ||
|
||
print( | ||
spark.read | ||
# You can also specify this option against the SparkSession. | ||
.option('io.compression.codecs', 'nl.basjes.hadoop.io.compress.SplittableGzipCodec') | ||
.csv(...) | ||
.count() | ||
) | ||
``` | ||
|
||
Run this script as follows: | ||
|
||
```sh | ||
spark-submit --packages "nl.basjes.hadoop:splittablegzip:1.2" splittable-gzip.py | ||
``` | ||
|
||
Here's what the Spark UI looks like when running this script against a 20 GB gzip file on a laptop: | ||
|
||
 | ||
|
||
You can see in the task list the behavior described in the [README](README.md) that each task is reading from the start of the file to its target split. | ||
|
||
Also in the Executor UI you can see every available core running concurrently against this single file: | ||
|
||
 |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters