Skip to content

Commit

Permalink
Update documentation on spark problem
Browse files Browse the repository at this point in the history
  • Loading branch information
nielsbasjes committed Nov 24, 2020
1 parent 0a64c51 commit 614e8a6
Showing 1 changed file with 18 additions and 7 deletions.
25 changes: 18 additions & 7 deletions README-Spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,31 @@
# Thanks !
Thanks to [Nicholas Chammas](https://github.com/nchammas) for contributing this documentation.

# Important
The current implementation of Spark (3.0.1) creating file splits DOES NOT have an option (yet) to ensure a minimum split size.
An important effect of this is that you will sometimes get an error that the last split of your file is too small.
The error you get looks like this:

`java.lang.IllegalArgumentException: The provided InputSplit (562600;562687] is 87 bytes which is too small. (Minimum is 65536)`

The problem here is that the size of the last split is really `${fileSize} % ${maxSplitBytes}`

By setting the maxPartitionBytes to 1 byte below the size of a test file I was even able to get a split of only 1 byte.

Now if you run into such a situation I recommend trying to play with the `spark.sql.files.maxPartitionBytes` setting
to ensure that the remainder of the mentioned division stays in the valid range.

This is by no means a real solution but at this point in time it seems the only option available.

I have submitted a request/proposal for a good solution at the Spark project: https://issues.apache.org/jira/browse/SPARK-33534

# Common problem for Spark users
Apparently the fact that GZipped files are not splittable is also in the Spark arena a recurring problem as you can see
in this [Spark Jira ticket](https://issues.apache.org/jira/browse/SPARK-29102?focusedCommentId=16932921&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16932921) and
these two questions on StackOverflow [Question 1](https://stackoverflow.com/q/28127119/877069) [Question 2](https://stackoverflow.com/q/27531816/877069).

It turns out that this library works with Apache Spark without modification.

**NOTE: The current implementation of Spark (3.0.1) creating file splits DOES NOT have an option (yet)
to ensure a minimum split size.
An important effect of this is that you will sometimes get an error that the last split of
your file is too small.
By setting the maxPartitionBytes to 1 byte below the size of a test file I was even able to get a split of only 1 byte.**
See: https://issues.apache.org/jira/browse/SPARK-33534

# Using it
Here is an example, which was tested against Apache Spark 2.4.4 using the Python DataFrame API:

Expand Down

0 comments on commit 614e8a6

Please sign in to comment.