Skip to content

Commit

Permalink
Merge pull request #7 from pedropark99/test-intro
Browse files Browse the repository at this point in the history
Update README and restructure the intro of the book
  • Loading branch information
pedropark99 authored Nov 26, 2023
2 parents 05e6053 + de3e01e commit d1e43a1
Show file tree
Hide file tree
Showing 36 changed files with 1,245 additions and 3,258 deletions.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,6 @@ Chapters/metastore_db
Chapters/*.html
Chapters/*/*

Scripts/__pycache__/
Scripts/__pycache__/

index_files/
79 changes: 0 additions & 79 deletions Chapters/01-intro.qmd

This file was deleted.

11 changes: 5 additions & 6 deletions Chapters/02-python.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -34,9 +34,9 @@ As an example, lets create a simple "Hello world" program. First, open a new tex
print("Hello World!")
```

It will be much easier to run this script, if you open the terminal inside the folder where you save the `hello.py` file. If you do not know how to do this, look at section @sec-open-terminal. After you opened the terminal inside the folder, just run the `python3 hello.py` command. As a result, python will execute `hello.py`, and, the text `Hello World!` should be printed to the terminal:
It will be much easier to run this script, if you open your OS's terminal inside the folder where you save the `hello.py` file. After you opened the terminal inside the folder, just run the `python3 hello.py` command. As a result, python will execute `hello.py`, and, the text `Hello World!` should be printed to the terminal:

```{terminal}
```
Terminal$ python3 hello.py
```

Expand All @@ -48,8 +48,7 @@ But, if for some reason you could not open the terminal inside the folder, just

For example, if I saved `hello.py` inside my Documents folder, the path to this folder in Windows would be something like this: `"C:\Users\pedro\Documents"`. On the other hand, this path on Linux would be something like `"/usr/pedro/Documents"`. So the command to change to this directory would be:

```{terminal}
#| eval: false
```
# On Windows:
Terminal$ cd "C:\Users\pedro\Documents"
# On Linux:
Expand Down Expand Up @@ -190,7 +189,7 @@ This error occurs, because inside the print statement, we call the name `x`. But

A python package (or a python "library") is basically a set of functions and classes that provides important functionality to solve a specific problem. And `pyspark` is one of these many python packages available.

Python packages are usually published (that is, made available to the public) through the PyPI archive[^python-5]. If a python package is published in PyPI, then, you can easily install it through the `pip` tool, that we just used in @sec-install-spark.
Python packages are usually published (that is, made available to the public) through the PyPI archive[^python-5]. If a python package is published in PyPI, then, you can easily install it through the `pip` tool.

[^python-5]: <https://pypi.org/>

Expand All @@ -209,7 +208,7 @@ ModuleNotFoundError: No module named 'pandas'

If your program produce this error, is very likely that you are trying to use a package that is not currently installed on your machine. To install it, you may use the `pip install <name of the package>` command on the terminal of your OS.

```{terminal, eval = FALSE}
```
pip install pandas
```

Expand Down
8 changes: 4 additions & 4 deletions Chapters/03-spark.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -194,19 +194,19 @@ print(result)

### Executing the code

Now that you have written your first Spark application with `pyspark`, you want to execute this application and see its results. Yet, to run a `pyspark` program, remember that you need to have the necessary software installed on your machine. I talk about how to install these software at @sec-install-spark.
Now that you have written your first Spark application with `pyspark`, you want to execute this application and see its results. Yet, to run a `pyspark` program, remember that you need to have the necessary software installed on your machine. In case you do not have Apache Spark installed yet, I personally recommend you to read the [articles from PhoenixNAP on how to install Apache Spark](https://phoenixnap.com/kb/install-spark-on-ubuntu)[^phoenix-nap].

Anyway, to execute this `pyspark` that you wrote, you need send this script to the python interpreter, and to do this you need to: 1) open a terminal inside the folder where you python script is stored; and, 2) use the python command from the terminal with the name of your python script.
[^phoenix-nap]: <https://phoenixnap.com/kb/install-spark-on-ubuntu>.

If you do not know how to open a terminal from inside a folder of your computer, you can consult @sec-open-terminal of this book, where I teach you how to do it.
Anyway, to execute this `pyspark` that you wrote, you need send this script to the python interpreter, and to do this you need to: 1) open a terminal inside the folder where you python script is stored; and, 2) use the python command from the terminal with the name of your python script.

In my current situation, I running Spark on a Ubuntu distribution, and, I saved the `spark-example.py` script inside a folder called `SparkExample`. This folder is located at the path `~/Documentos/Projetos/Livros/Introd-pyspark/SparkExample` of my computer. This means that, I need to open a terminal that is rooted inside this `SparkExample` folder.

You probably have saved your `spark-example.py` file in a different folder of your computer. This means that you need to open the terminal from a different folder.

After I opened a terminal rooted inside the `SparkExample` folder. I just use the `python3` command to access the python interpreter, and, give the name of the python script that I want to execute. In this case, the `spark-example.py` file. As a result, our first `pyspark` program will be executed:

```{terminal}
```
Terminal$ cd ./../SparkExample
Terminal$ python3 spark-example.py
```
Expand Down
Binary file added Figures/creative-commoms-88x31.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
40 changes: 39 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,39 @@
# Introd-pyspark
# Introd-pyspark

<a href="https://pedro-faria.netlify.app/publications/book/introd-pyspark/en/"><img src="Cover/cover1.png" width="250" height="366" class="cover" align="right"/></a> An open and introductory book for the Python API of Apache Spark. The book "Introduction to pyspark" provides a quick introduction for the `pyspark` Python package, which is the Python API of Apache Spark.



With `pyspark` you are able to use the Python language to write Spark applications and run them on a Spark cluster in a scalable and elegant way. This book focus on teaching the fundamentals of `pyspark`, and how to use it for big data analysis.

Some of the main subjects discussed in the book are:

- How an Apache Spark application works?
- What are Spark DataFrames?
- How to transform and model your Spark DataFrame.
- How to import data into Apache Spark.
- How to work with SQL inside pyspark.
- Tools for manipulating specific data types (e.g. string, dates and datetimes).
- How to use window functions.


## About the author

Pedro Duarte Faria have a bachelor degree in Economics from Federal University of Ouro Preto - Brazil. Currently, he is a Data Engineer at Blip, and an Associate Developer for Apache Spark 3.0 certified by Databricks.

The author have more than 3 years of experience in the data analysis market. He developed data pipelines, reports and analysis for research institutions and some of the largest companies in the brazilian financial sector, such as the BMG Bank, Sodexo and Pan Bank, besides dealing with databases that go beyond the billion rows.

Furthermore, Pedro is specialized on the R programming language, and have given several lectures and courses about it, inside graduate centers (such as PPEA-UFOP), in addition to federal and state organizations (such as FJP-MG). As researcher, he have experience in the field of Science, Technology and Innovation Economics.

Personal Website: <https://pedro-faria.netlify.app/>

Twitter: [@PedroPark9](https://twitter.com/PedroPark9)

Mastodon: [@pedropark99@fosstodon.org](https://fosstodon.org/@pedropark99)


## License

Copyright © 2024 Pedro Duarte Faria. This book is licensed by the CC-BY 4.0 Creative Commons Attribution 4.0 International Public License.

<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a>
4 changes: 0 additions & 4 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,6 @@ book:
cover-image: "Cover/cover1.png"
chapters:
- index.qmd
- Chapters/01-intro.qmd
- Chapters/02-python.qmd
- Chapters/03-spark.qmd
- Chapters/04-dataframes.qmd
Expand All @@ -32,9 +31,6 @@ book:
- Chapters/09-strings.qmd
- Chapters/10-datetime.qmd
- Chapters/references.qmd
appendices:
- Chapters/00-terminal.qmd
- Chapters/00-install-spark.qmd

bibliography: references.bib
number-sections: true
Expand Down
Loading

0 comments on commit d1e43a1

Please sign in to comment.