Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 03-spark.qmd #13

Merged
merged 2 commits into from
Oct 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@
Introd-pyspark.pdf

python-env/
venv/

/.quarto/
./site_libs/
site_libs/

derby.log
Chapters/derby.log
Expand Down
4 changes: 2 additions & 2 deletions Chapters/03-spark.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ The main python modules that exists in `pyspark` are:
- `pyspark.sql.dataframe`: module that defines the `DataFrame` class;
- `pyspark.sql.column`: module that defines the `Column` class;
- `pyspark.sql.types`: module that contains all data types of Spark;
- `pyspark.sq.functions`: module that contains all of the main Spark functions that we use in transformations;
- `pyspark.sql.functions`: module that contains all of the main Spark functions that we use in transformations;
- `pyspark.sql.window`: module that defines the `Window` class, which is responsible for defining windows in a Spark DataFrame;

### Main python classes
Expand All @@ -248,4 +248,4 @@ The main python classes that exists in `pyspark` are:

- `DataFrameReader` and `DataFrameWriter`: classes responsible for reading data from a data source into a Spark DataFrame, and writing data from a Spark DataFrame into a data source;

- `DataFrameNaFunctions`: class that stores all main methods for dealing with null values (i.e. missing data);
- `DataFrameNaFunctions`: class that stores all main methods for dealing with null values (i.e. missing data);
54 changes: 27 additions & 27 deletions docs/Chapters/02-python.html

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/Chapters/03-spark.html
Original file line number Diff line number Diff line change
Expand Up @@ -504,7 +504,7 @@ <h3 data-number="2.6.1" class="anchored" data-anchor-id="main-python-modules"><s
<li><code>pyspark.sql.dataframe</code>: module that defines the <code>DataFrame</code> class;</li>
<li><code>pyspark.sql.column</code>: module that defines the <code>Column</code> class;</li>
<li><code>pyspark.sql.types</code>: module that contains all data types of Spark;</li>
<li><code>pyspark.sq.functions</code>: module that contains all of the main Spark functions that we use in transformations;</li>
<li><code>pyspark.sql.functions</code>: module that contains all of the main Spark functions that we use in transformations;</li>
<li><code>pyspark.sql.window</code>: module that defines the <code>Window</code> class, which is responsible for defining windows in a Spark DataFrame;</li>
</ul>
</section>
Expand Down
14 changes: 7 additions & 7 deletions docs/Chapters/04-columns.html
Original file line number Diff line number Diff line change
Expand Up @@ -294,7 +294,7 @@ <h1 class="title"><span class="chapter-number">4</span>&nbsp; <span class="chapt
<p>However, there is one more python class that provides some very useful methods that you will regularly use, which is the <code>Column</code> class, or more specifically, the <code>pyspark.sql.column.Column</code> class.</p>
<p>The <code>Column</code> class is used to represent a column in a Spark DataFrame. This means that, each column of your Spark DataFrame is a object of class <code>Column</code>.</p>
<p>We can confirm this statement, by taking the <code>df</code> DataFrame that we showed at <a href="04-dataframes.html#sec-building-a-dataframe" class="quarto-xref"><span>Section 3.4</span></a>, and look at the class of any column of it. Like the <code>id</code> column:</p>
<div id="f21c8a35" class="cell" data-execution_count="2">
<div id="51bd309a" class="cell" data-execution_count="2">
<div class="sourceCode cell-code" id="cb1"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span>(df.<span class="bu">id</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-display" data-execution_count="2">
<pre><code>pyspark.sql.column.Column</code></pre>
Expand All @@ -304,7 +304,7 @@ <h1 class="title"><span class="chapter-number">4</span>&nbsp; <span class="chapt
<h2 data-number="4.1" class="anchored" data-anchor-id="building-a-column-object"><span class="header-section-number">4.1</span> Building a column object</h2>
<p>You can refer to or create a column, by using the <code>col()</code> and <code>column()</code> functions from <code>pyspark.sql.functions</code> module. These functions receive a string input with the name of the column you want to create/refer to.</p>
<p>Their result are always a object of class <code>Column</code>. For example, the code below creates a column called <code>ID</code>:</p>
<div id="d3479437" class="cell" data-execution_count="3">
<div id="3ae25568" class="cell" data-execution_count="3">
<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> pyspark.sql.functions <span class="im">import</span> col</span>
<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a>id_column <span class="op">=</span> col(<span class="st">'ID'</span>)</span>
<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(id_column)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
Expand All @@ -317,7 +317,7 @@ <h2 data-number="4.1" class="anchored" data-anchor-id="building-a-column-object"
<h2 data-number="4.2" class="anchored" data-anchor-id="sec-columns-related-expressions"><span class="header-section-number">4.2</span> Columns are strongly related to expressions</h2>
<p>Many kinds of transformations that we want to apply over a Spark DataFrame, are usually described through expressions, and, these expressions in Spark are mainly composed by <strong>column transformations</strong>. That is why the <code>Column</code> class, and its methods, are so important in Apache Spark.</p>
<p>Columns in Spark are so strongly related to expressions that the columns themselves are initially interpreted as expressions. If we look again at the column <code>id</code> from <code>df</code> DataFrame, Spark will bring a expression as a result, and not the values hold by this column.</p>
<div id="a7145289" class="cell" data-execution_count="4">
<div id="2d751745" class="cell" data-execution_count="4">
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>df.<span class="bu">id</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-display" data-execution_count="4">
<pre><code>Column&lt;'id'&gt;</code></pre>
Expand All @@ -326,7 +326,7 @@ <h2 data-number="4.2" class="anchored" data-anchor-id="sec-columns-related-expre
<p>Having these ideas in mind, when I created the column <code>ID</code> on the previous section, I created a “column expression”. This means that <code>col("ID")</code> is just an expression, and as consequence, Spark does not know which are the values of column <code>ID</code>, or, where it lives (which is the DataFrame that this column belongs?). For now, Spark is not interested in this information, it just knows that we have an expression referring to a column called <code>ID</code>.</p>
<p>These ideas relates a lot to the <strong>lazy aspect</strong> of Spark that we talked about in <a href="04-dataframes.html#sec-viewing-a-dataframe" class="quarto-xref"><span>Section 3.5</span></a>. Spark will not perform any heavy calculation, or show you the actual results/values from you code, until you trigger the calculations with an action (we will talk more about these “actions” on <a href="05-transforming.html#sec-dataframe-actions" class="quarto-xref"><span>Section 5.2</span></a>). As a result, when you access a column, Spark will only deliver a expression that represents that column, and not the actual values of that column.</p>
<p>This is handy, because we can store our expressions in variables, and, reuse it latter, in multiple parts of our code. For example, I can keep building and merging a column with different kinds of operators, to build a more complex expression. In the example below, I create a expression that doubles the values of <code>ID</code> column:</p>
<div id="0b73c64e" class="cell" data-execution_count="5">
<div id="06a89287" class="cell" data-execution_count="5">
<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>expr1 <span class="op">=</span> id_column <span class="op">*</span> <span class="dv">2</span></span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(expr1)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
Expand All @@ -336,7 +336,7 @@ <h2 data-number="4.2" class="anchored" data-anchor-id="sec-columns-related-expre
<p>Remember, with this expression, Spark knows that we want to get a column called <code>ID</code> somewhere, and double its values. But Spark will not perform that action right now.</p>
<p>Logical expressions follow the same logic. In the example below, I am looking for rows where the value in column <code>Name</code> is equal to <code>'Anne'</code>, and, the value in column <code>Grade</code> is above 6.</p>
<p>Again, Spark just checks if this is a valid logical expression. For now, Spark does not want to know where are these <code>Name</code> and <code>Grade</code> columns. Spark does not evaluate the expression, until we ask for it with an action:</p>
<div id="9acb36a3" class="cell" data-execution_count="6">
<div id="2319c5a9" class="cell" data-execution_count="6">
<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>expr2 <span class="op">=</span> (col(<span class="st">'Name'</span>) <span class="op">==</span> <span class="st">'Anne'</span>) <span class="op">&amp;</span> (col(<span class="st">'Grade'</span>) <span class="op">&gt;</span> <span class="dv">6</span>)</span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(expr2)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
<div class="cell-output cell-output-stdout">
Expand All @@ -348,7 +348,7 @@ <h2 data-number="4.2" class="anchored" data-anchor-id="sec-columns-related-expre
<h2 data-number="4.3" class="anchored" data-anchor-id="literal-values-versus-expressions"><span class="header-section-number">4.3</span> Literal values versus expressions</h2>
<p>We know now that columns of a Spark DataFrame have a deep connection with expressions. But, on the other hand, there are some situations that you write a value (it can be a string, a integer, a boolean, or anything) inside your <code>pyspark</code> code, and you might actually want Spark to intepret this value as a constant (or a literal) value, rather than a expression.</p>
<p>As an example, lets suppose you control the data generated by the sales of five different stores, scattered across different regions of Belo Horizonte city (in Brazil). Now, lets suppose you receive a batch of data generated by the 4th store in the city, which is located at Amazonas Avenue, 324. This batch of data is exposed below:</p>
<div id="6639e312" class="cell" data-execution_count="7">
<div id="88fb2057" class="cell" data-execution_count="7">
<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>path <span class="op">=</span> <span class="st">'./../Data/sales.json'</span></span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>sales <span class="op">=</span> spark.read.json(path)</span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>sales.show(<span class="dv">5</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
Expand All @@ -374,7 +374,7 @@ <h2 data-number="4.4" class="anchored" data-anchor-id="sec-literal-values"><span
<p>So how do we force Spark to interpret a value as a literal (or constant) value, rather than a expression? To do this, you must write this value inside the <code>lit()</code> (short for “literal”) function from the <code>pyspark.sql.functions</code> module.</p>
<p>In other words, when you write in your code the statement <code>lit(4)</code>, Spark understand that you want to create a new column which is filled with 4’s. In other words, this new column is filled with the constant integer 4.</p>
<p>With the code below, I am creating two new columns (called <code>store_number</code> and <code>store_address</code>), and adding them to the <code>sales</code> DataFrame.</p>
<div id="e59cb607" class="cell" data-execution_count="8">
<div id="b35c6665" class="cell" data-execution_count="8">
<div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> pyspark.sql.functions <span class="im">import</span> lit</span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>store_number <span class="op">=</span> lit(<span class="dv">4</span>).alias(<span class="st">'store_number'</span>)</span>
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>store_address <span class="op">=</span> lit(<span class="st">'Amazonas Avenue, 324'</span>).alias(<span class="st">'store_address'</span>)</span>
Expand Down
Loading