pedropark99 · pedropark99 · Dec 6, 2023 · Dec 5, 2023 · Dec 5, 2023 · Dec 5, 2023
diff --git a/Chapters/02-python.qmd b/Chapters/02-python.qmd
@@ -3,7 +3,6 @@
 
 # Key concepts of python
 
-## Introduction
 
 If you have experience with python, and understands how objects and classes works, you might want to skip this entire chapter. But, if you are new to the language and do not have much experience with it, you might want to stick a little bit, and learn a few key concepts that will help you to understand how the `pyspark` package is organized, and how to work with it.
 

diff --git a/Chapters/03-spark.qmd b/Chapters/03-spark.qmd
@@ -16,7 +16,6 @@ sc.setLogLevel("OFF")
 ```
 
 
-## Introduction
 
 In essence, `pyspark` is an API to Apache Spark (or simply Spark). In other words, with `pyspark` we can build Spark applications using the python language. So, by learning a little more about Spark, you will understand a lot more about `pyspark`.
 
@@ -124,7 +123,7 @@ spark = SparkSession.builder.getOrCreate()
 
 Every `pyspark` program is composed by a set of transformations and actions over a set of Spark DataFrames. 
 
-We will explain Spark DataFrames in more deth on the @sec-dataframes-chapter. For now just understand that they are the basic data sctructure that feed all `pyspark` programs. In other words, on every `pyspark` program we are transforming multiple Spark DataFrames to get the result we want.
+I will explain Spark DataFrames in more deth on the @sec-dataframes-chapter. For now just understand that they are the basic data sctructure that feed all `pyspark` programs. In other words, on every `pyspark` program we are transforming multiple Spark DataFrames to get the result we want.
 
 As an example, in the script below we begin with the Spark DataFrame stored in the object `students`, and, apply multiple transformations over it to build the `ar_department` DataFrame. Lastly, we apply the `.show()` action over the `ar_department` DataFrame:
 
@@ -211,8 +210,12 @@ Terminal$ cd ./../SparkExample
 Terminal$ python3 spark-example.py
 ```
 
+```
+[Row(id=0), Row(id=1), Row(id=2), Row(id=3), Row(id=4)]
+```
+
 
-You can see in the above result, that this Spark application produces a sequence of numbers, from 0 to 4, and, returns this sequence as a set of `Row` objects, inside a python list.
+You can see in the above result, that this Spark application produces a sequence of `Row` objects, inside a Python list. Each row object contains a number from 0 to 4.
 
 Congratulations! You have just run your first Spark application using `pyspark`!
 

diff --git a/Chapters/04-dataframes.qmd b/Chapters/04-dataframes.qmd
@@ -9,7 +9,6 @@ sc = spark.sparkContext
 sc.setLogLevel("OFF")
 ```
 
-## Introduction
 
 In this chapter, you will understand how Spark represents and manages tables (or tabular data). Different programming languages and frameworks use different names to describe a table. But, in Apache Spark, they are referred as Spark DataFrames.
 
@@ -39,7 +38,7 @@ If you are running Spark in a 4 nodes cluster (one is the driver node, and the o
 ![A Spark DataFrame is distributed across the cluster](../Figures/distributed-df.png){#fig-distributed-df fig-align="center"}
 
 
-## Partitions of a Spark DataFrame
+## Partitions of a Spark DataFrame {#sec-dataframe-partitions}
 
 A Spark DataFrame is always broken into many small pieces, and, these pieces are always spread across the cluster of machines. Each one of these small pieces of the total data are considered a DataFrame *partition*.
 
@@ -137,9 +136,9 @@ students
 
 You can also use a method that returns a `DataFrame` object by default. Examples are the `table()` and `range()` methods from your Spark Session, like we used in the @sec-dataframe-class, to create the `df5` object.
 
-Other examples are the methods used to read data and import it to `pyspark`. These methods are available in the `spark.read` module, like `spark.read.csv()` and `spark.read.json()`. These methods will be described in more depth in @sec-import-export.
+Other examples are the methods used to read data and import it to `pyspark`. These methods are available in the `spark.read` module, like `spark.read.csv()` and `spark.read.json()`. These methods will be described in more depth in @sec-import.
 
-## Viewing a Spark DataFrame {#sec-viewing-a-dataframe}
+## Visualizing a Spark DataFrame {#sec-viewing-a-dataframe}
 
 A key aspect of Spark is its laziness. In other words, for most operations, Spark will only check if your code is correct and if it makes sense. Spark will not actually run or execute the operations you are describing in your code, unless you explicit ask for it with a trigger operation, which is called an "action" (this kind of operation is described in @sec-dataframe-actions).
 

diff --git a/Chapters/05-transforming.qmd b/Chapters/05-transforming.qmd
@@ -1,7 +1,6 @@
 
 # Transforming your Spark DataFrame - Part 1 {#sec-transforming-dataframes-part1}
 
-## Introduction
 
 ```{python}
 #| include: false
@@ -120,9 +119,9 @@ first_row = big_values.take(n)
 print(first_row)
 ```
 
-The last action would be the `write` method of a Spark DataFrame, but we will explain this method latter at @sec-import-export.
+The last action would be the `write` method of a Spark DataFrame, but we will explain this method latter at @sec-import.
 
-## Understanding narrow and wide transformations
+## Understanding narrow and wide transformations {#sec-narrow-wide}
 
 There are two kinds of transformations in Spark: narrow and wide transformations. Remember, a Spark DataFrame is divided into many small parts (called partitions), and, these parts are spread across the cluster. The basic difference between narrow and wide transformations, is if the transformation forces Spark to read data from multiple partitions to generate a single part of the result of that transformation, or not.
 

diff --git a/Chapters/06-dataframes-sql.qmd b/Chapters/06-dataframes-sql.qmd
@@ -1,6 +1,5 @@
 # Working with SQL in `pyspark` {#sec-dataframe-sql-chapter}
 
-## Introduction
 
 ```{python}
 #| include: false
@@ -398,7 +397,7 @@ spark\
 +----+------------+
 ```
 
-#### The different save "modes"
+#### The different save "modes" {#sec-sql-save-modes}
 
 There are other arguments that you might want to use in the `write.saveAsTable()` method, like the `mode` argument. This argument controls how Spark will save your data into the database. By default, `write.saveAsTable()` uses the `mode = "error"` by default. In this mode, Spark will look if the table you referenced already exists, before it saves your data.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -3,7 +3,6 @@

		# Key concepts of python

		## Introduction

		If you have experience with python, and understands how objects and classes works, you might want to skip this entire chapter. But, if you are new to the language and do not have much experience with it, you might want to stick a little bit, and learn a few key concepts that will help you to understand how the `pyspark` package is organized, and how to work with it.

Expand Down