Merge pull request #7 from pedropark99/test-intro

Update README and restructure the intro of the book
pedropark99 · Nov 26, 2023 · d1e43a1 · d1e43a1
2 parents 05e6053 + de3e01e
commit d1e43a1
Show file tree

Hide file tree

Showing 36 changed files with 1,245 additions and 3,258 deletions.
diff --git a/.gitignore b/.gitignore
@@ -17,4 +17,6 @@ Chapters/metastore_db
 Chapters/*.html
 Chapters/*/*
 
-Scripts/__pycache__/
+Scripts/__pycache__/
+
+index_files/
diff --git a/Chapters/01-intro.qmd b/Chapters/01-intro.qmd
diff --git a/Chapters/02-python.qmd b/Chapters/02-python.qmd
@@ -34,9 +34,9 @@ As an example, lets create a simple "Hello world" program. First, open a new tex
 print("Hello World!")
 ```
 
-It will be much easier to run this script, if you open the terminal inside the folder where you save the `hello.py` file. If you do not know how to do this, look at section @sec-open-terminal. After you opened the terminal inside the folder, just run the `python3 hello.py` command. As a result, python will execute `hello.py`, and, the text `Hello World!` should be printed to the terminal:
+It will be much easier to run this script, if you open your OS's terminal inside the folder where you save the `hello.py` file. After you opened the terminal inside the folder, just run the `python3 hello.py` command. As a result, python will execute `hello.py`, and, the text `Hello World!` should be printed to the terminal:
 
-```{terminal}
+```
 Terminal$ python3 hello.py
 ```
 
@@ -48,8 +48,7 @@ But, if for some reason you could not open the terminal inside the folder, just
 
 For example, if I saved `hello.py` inside my Documents folder, the path to this folder in Windows would be something like this: `"C:\Users\pedro\Documents"`. On the other hand, this path on Linux would be something like `"/usr/pedro/Documents"`. So the command to change to this directory would be:
 
-```{terminal}
-#| eval: false
+```
 # On Windows:
 Terminal$ cd "C:\Users\pedro\Documents"
 # On Linux:
@@ -190,7 +189,7 @@ This error occurs, because inside the print statement, we call the name `x`. But
 
 A python package (or a python "library") is basically a set of functions and classes that provides important functionality to solve a specific problem. And `pyspark` is one of these many python packages available.
 
-Python packages are usually published (that is, made available to the public) through the PyPI archive[^python-5]. If a python package is published in PyPI, then, you can easily install it through the `pip` tool, that we just used in @sec-install-spark.
+Python packages are usually published (that is, made available to the public) through the PyPI archive[^python-5]. If a python package is published in PyPI, then, you can easily install it through the `pip` tool.
 
 [^python-5]: <https://pypi.org/>
 
@@ -209,7 +208,7 @@ ModuleNotFoundError: No module named 'pandas'
 
 If your program produce this error, is very likely that you are trying to use a package that is not currently installed on your machine. To install it, you may use the `pip install <name of the package>` command on the terminal of your OS.
 
-```{terminal, eval = FALSE}
+```
 pip install pandas
 ```
 

diff --git a/Chapters/03-spark.qmd b/Chapters/03-spark.qmd
@@ -194,19 +194,19 @@ print(result)
 
 ### Executing the code
 
-Now that you have written your first Spark application with `pyspark`, you want to execute this application and see its results. Yet, to run a `pyspark` program, remember that you need to have the necessary software installed on your machine. I talk about how to install these software at @sec-install-spark.
+Now that you have written your first Spark application with `pyspark`, you want to execute this application and see its results. Yet, to run a `pyspark` program, remember that you need to have the necessary software installed on your machine. In case you do not have Apache Spark installed yet, I personally recommend you to read the [articles from PhoenixNAP on how to install Apache Spark](https://phoenixnap.com/kb/install-spark-on-ubuntu)[^phoenix-nap].
 
-Anyway, to execute this `pyspark` that you wrote, you need send this script to the python interpreter, and to do this you need to: 1) open a terminal inside the folder where you python script is stored; and, 2) use the python command from the terminal with the name of your python script.
+[^phoenix-nap]: <https://phoenixnap.com/kb/install-spark-on-ubuntu>.
 
-If you do not know how to open a terminal from inside a folder of your computer, you can consult @sec-open-terminal of this book, where I teach you how to do it.
+Anyway, to execute this `pyspark` that you wrote, you need send this script to the python interpreter, and to do this you need to: 1) open a terminal inside the folder where you python script is stored; and, 2) use the python command from the terminal with the name of your python script.
 
 In my current situation, I running Spark on a Ubuntu distribution, and, I saved the `spark-example.py` script inside a folder called `SparkExample`. This folder is located at the path `~/Documentos/Projetos/Livros/Introd-pyspark/SparkExample` of my computer. This means that, I need to open a terminal that is rooted inside this `SparkExample` folder.
 
 You probably have saved your `spark-example.py` file in a different folder of your computer. This means that you need to open the terminal from a different folder.
 
 After I opened a terminal rooted inside the `SparkExample` folder. I just use the `python3` command to access the python interpreter, and, give the name of the python script that I want to execute. In this case, the `spark-example.py` file. As a result, our first `pyspark` program will be executed:
 
-```{terminal}
+```
 Terminal$ cd ./../SparkExample
 Terminal$ python3 spark-example.py
 ```

diff --git a/Figures/creative-commoms-88x31.png b/Figures/creative-commoms-88x31.png
diff --git a/README.md b/README.md
@@ -1 +1,39 @@
-# Introd-pyspark
+# Introd-pyspark
+
+<a href="https://pedro-faria.netlify.app/publications/book/introd-pyspark/en/"><img src="Cover/cover1.png" width="250" height="366" class="cover" align="right"/></a> An open and introductory book for the Python API of Apache Spark. The book "Introduction to pyspark" provides a quick introduction for the `pyspark` Python package, which is the Python API of Apache Spark.
+
+
+
+With `pyspark` you are able to use the Python language to write Spark applications and run them on a Spark cluster in a scalable and elegant way. This book focus on teaching the fundamentals of `pyspark`, and how to use it for big data analysis.
+
+Some of the main subjects discussed in the book are:
+
+- How an Apache Spark application works?
+- What are Spark DataFrames?
+- How to transform and model your Spark DataFrame.
+- How to import data into Apache Spark.
+- How to work with SQL inside pyspark.
+- Tools for manipulating specific data types (e.g. string, dates and datetimes).
+- How to use window functions.
+
+
+## About the author
+
+Pedro Duarte Faria have a bachelor degree in Economics from Federal University of Ouro Preto - Brazil. Currently, he is a Data Engineer at Blip, and an Associate Developer for Apache Spark 3.0 certified by Databricks.
+
+The author have more than 3 years of experience in the data analysis market. He developed data pipelines, reports and analysis for research institutions and some of the largest companies in the brazilian financial sector, such as the BMG Bank, Sodexo and Pan Bank, besides dealing with databases that go beyond the billion rows.
+
+Furthermore, Pedro is specialized on the R programming language, and have given several lectures and courses about it, inside graduate centers (such as PPEA-UFOP), in addition to federal and state organizations (such as FJP-MG). As researcher, he have experience in the field of Science, Technology and Innovation Economics.
+
+Personal Website: <https://pedro-faria.netlify.app/>
+
+Twitter: [@PedroPark9](https://twitter.com/PedroPark9)
+
+Mastodon: [@pedropark99@fosstodon.org](https://fosstodon.org/@pedropark99)
+
+
+## License
+
+Copyright © 2024 Pedro Duarte Faria. This book is licensed by the CC-BY 4.0 Creative Commons Attribution 4.0 International Public License.
+
+<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a>
diff --git a/_quarto.yml b/_quarto.yml
@@ -20,7 +20,6 @@ book:
   cover-image: "Cover/cover1.png"
   chapters:
     - index.qmd
-    - Chapters/01-intro.qmd
     - Chapters/02-python.qmd
     - Chapters/03-spark.qmd
     - Chapters/04-dataframes.qmd
@@ -32,9 +31,6 @@ book:
     - Chapters/09-strings.qmd
     - Chapters/10-datetime.qmd
     - Chapters/references.qmd
-  appendices:
-    - Chapters/00-terminal.qmd
-    - Chapters/00-install-spark.qmd
 
 bibliography: references.bib
 number-sections: true