Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new chapter for exporting data out of Spark #9

Merged
merged 12 commits into from
Dec 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion Chapters/02-python.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@

# Key concepts of python

## Introduction

If you have experience with python, and understands how objects and classes works, you might want to skip this entire chapter. But, if you are new to the language and do not have much experience with it, you might want to stick a little bit, and learn a few key concepts that will help you to understand how the `pyspark` package is organized, and how to work with it.

Expand Down
9 changes: 6 additions & 3 deletions Chapters/03-spark.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ sc.setLogLevel("OFF")
```


## Introduction

In essence, `pyspark` is an API to Apache Spark (or simply Spark). In other words, with `pyspark` we can build Spark applications using the python language. So, by learning a little more about Spark, you will understand a lot more about `pyspark`.

Expand Down Expand Up @@ -124,7 +123,7 @@ spark = SparkSession.builder.getOrCreate()

Every `pyspark` program is composed by a set of transformations and actions over a set of Spark DataFrames.

We will explain Spark DataFrames in more deth on the @sec-dataframes-chapter. For now just understand that they are the basic data sctructure that feed all `pyspark` programs. In other words, on every `pyspark` program we are transforming multiple Spark DataFrames to get the result we want.
I will explain Spark DataFrames in more deth on the @sec-dataframes-chapter. For now just understand that they are the basic data sctructure that feed all `pyspark` programs. In other words, on every `pyspark` program we are transforming multiple Spark DataFrames to get the result we want.

As an example, in the script below we begin with the Spark DataFrame stored in the object `students`, and, apply multiple transformations over it to build the `ar_department` DataFrame. Lastly, we apply the `.show()` action over the `ar_department` DataFrame:

Expand Down Expand Up @@ -211,8 +210,12 @@ Terminal$ cd ./../SparkExample
Terminal$ python3 spark-example.py
```

```
[Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4)]
```


You can see in the above result, that this Spark application produces a sequence of numbers, from 0 to 4, and, returns this sequence as a set of `Row` objects, inside a python list.
You can see in the above result, that this Spark application produces a sequence of `Row` objects, inside a Python list. Each row object contains a number from 0 to 4.

Congratulations! You have just run your first Spark application using `pyspark`!

Expand Down
7 changes: 3 additions & 4 deletions Chapters/04-dataframes.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ sc = spark.sparkContext
sc.setLogLevel("OFF")
```

## Introduction

In this chapter, you will understand how Spark represents and manages tables (or tabular data). Different programming languages and frameworks use different names to describe a table. But, in Apache Spark, they are referred as Spark DataFrames.

Expand Down Expand Up @@ -39,7 +38,7 @@ If you are running Spark in a 4 nodes cluster (one is the driver node, and the o
![A Spark DataFrame is distributed across the cluster](../Figures/distributed-df.png){#fig-distributed-df fig-align="center"}


## Partitions of a Spark DataFrame
## Partitions of a Spark DataFrame {#sec-dataframe-partitions}

A Spark DataFrame is always broken into many small pieces, and, these pieces are always spread across the cluster of machines. Each one of these small pieces of the total data are considered a DataFrame *partition*.

Expand Down Expand Up @@ -137,9 +136,9 @@ students

You can also use a method that returns a `DataFrame` object by default. Examples are the `table()` and `range()` methods from your Spark Session, like we used in the @sec-dataframe-class, to create the `df5` object.

Other examples are the methods used to read data and import it to `pyspark`. These methods are available in the `spark.read` module, like `spark.read.csv()` and `spark.read.json()`. These methods will be described in more depth in @sec-import-export.
Other examples are the methods used to read data and import it to `pyspark`. These methods are available in the `spark.read` module, like `spark.read.csv()` and `spark.read.json()`. These methods will be described in more depth in @sec-import.

## Viewing a Spark DataFrame {#sec-viewing-a-dataframe}
## Visualizing a Spark DataFrame {#sec-viewing-a-dataframe}

A key aspect of Spark is its laziness. In other words, for most operations, Spark will only check if your code is correct and if it makes sense. Spark will not actually run or execute the operations you are describing in your code, unless you explicit ask for it with a trigger operation, which is called an "action" (this kind of operation is described in @sec-dataframe-actions).

Expand Down
5 changes: 2 additions & 3 deletions Chapters/05-transforming.qmd
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@

# Transforming your Spark DataFrame - Part 1 {#sec-transforming-dataframes-part1}

## Introduction

```{python}
#| include: false
Expand Down Expand Up @@ -120,9 +119,9 @@ first_row = big_values.take(n)
print(first_row)
```

The last action would be the `write` method of a Spark DataFrame, but we will explain this method latter at @sec-import-export.
The last action would be the `write` method of a Spark DataFrame, but we will explain this method latter at @sec-import.

## Understanding narrow and wide transformations
## Understanding narrow and wide transformations {#sec-narrow-wide}

There are two kinds of transformations in Spark: narrow and wide transformations. Remember, a Spark DataFrame is divided into many small parts (called partitions), and, these parts are spread across the cluster. The basic difference between narrow and wide transformations, is if the transformation forces Spark to read data from multiple partitions to generate a single part of the result of that transformation, or not.

Expand Down
3 changes: 1 addition & 2 deletions Chapters/06-dataframes-sql.qmd
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
# Working with SQL in `pyspark` {#sec-dataframe-sql-chapter}

## Introduction

```{python}
#| include: false
Expand Down Expand Up @@ -398,7 +397,7 @@ spark\
+----+------------+
```

#### The different save "modes"
#### The different save "modes" {#sec-sql-save-modes}

There are other arguments that you might want to use in the `write.saveAsTable()` method, like the `mode` argument. This argument controls how Spark will save your data into the database. By default, `write.saveAsTable()` uses the `mode = "error"` by default. In this mode, Spark will look if the table you referenced already exists, before it saves your data.

Expand Down
Loading