Skip to content

Commit

Permalink
[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder in…
Browse files Browse the repository at this point in the history
…tegration in PySpark documentation

### What changes were proposed in this pull request?

This PR proposes to:
- add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).
- reuse this notebook as a quickstart guide in PySpark documentation.

Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit.
Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks.

<br/>

I made a simple demo to make it easier to review. Please see:
- [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet.
- [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html)

<br/>

When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address.
Another way might be:
- open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).
- edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR.
- download it as a `.ipynb` file:
    ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png)
- upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course.
- alternatively, push a commit into this PR right away if that's easier for you (if you're a committer).

References:
- https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
- https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html

### Why are the changes needed?

To improve PySpark's usability. The current quickstart for Python users are very friendly.

### Does this PR introduce _any_ user-facing change?

Yes, it will add a documentation page, and expose a live notebook to PySpark users.

### How was this patch tested?

Manually tested, and GitHub Actions builds will test.

Closes apache#29491 from HyukjinKwon/SPARK-32204.

Lead-authored-by: HyukjinKwon <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
  • Loading branch information
HyukjinKwon and Fokko committed Aug 26, 2020
1 parent 1354cf0 commit b541030
Show file tree
Hide file tree
Showing 11 changed files with 1,244 additions and 6 deletions.
5 changes: 3 additions & 2 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ jobs:
run: |
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme
pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme ipython nbsphinx
- name: Install R 4.0
run: |
sudo sh -c "echo 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> /etc/apt/sources.list"
Expand All @@ -245,10 +245,11 @@ jobs:
ruby-version: 2.7
- name: Install dependencies for documentation generation
run: |
# pandoc is required to generate PySpark APIs as well in nbsphinx.
sudo apt-get install -y libcurl4-openssl-dev pandoc
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme
pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme ipython nbsphinx
gem install jekyll jekyll-redirect-from rouge
sudo Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2'), repos='https://cloud.r-project.org/')"
- name: Scala linter
Expand Down
1 change: 1 addition & 0 deletions binder/apt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
openjdk-8-jre
24 changes: 24 additions & 0 deletions binder/postBuild
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is used for Binder integration to install PySpark available in
# Jupyter notebook.

VERSION=$(python -c "exec(open('python/pyspark/version.py').read()); print(__version__)")
pip install "pyspark[sql,ml,mllib]<=$VERSION"
3 changes: 2 additions & 1 deletion dev/create-release/spark-rm/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ ARG APT_INSTALL="apt-get install --no-install-recommends -y"
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
# We should use the latest Sphinx version once this is fixed.
ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1"
ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1 ipython==7.16.1 nbsphinx==0.7.1"
ARG GEM_PKGS="jekyll:4.0.0 jekyll-redirect-from:0.16.0 rouge:3.15.0"

# Install extra needed repos and refresh.
Expand Down Expand Up @@ -75,6 +75,7 @@ RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \
pip3 install $PIP_PKGS && \
# Install R packages and dependencies used when building.
# R depends on pandoc*, libssl (which are installed above).
# Note that PySpark doc generation also needs pandoc due to nbsphinx
$APT_INSTALL r-base r-base-dev && \
$APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf && \
Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \
Expand Down
16 changes: 16 additions & 0 deletions dev/lint-python
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,22 @@ function sphinx_test {
return
fi

# TODO(SPARK-32666): Install nbsphinx in Jenkins machines
PYTHON_HAS_NBSPHINX=$("$PYTHON_EXECUTABLE" -c 'import importlib.util; print(importlib.util.find_spec("nbsphinx") is not None)')
if [[ "$PYTHON_HAS_NBSPHINX" == "False" ]]; then
echo "$PYTHON_EXECUTABLE does not have nbsphinx installed. Skipping Sphinx build for now."
echo
return
fi

# TODO(SPARK-32666): Install ipython in Jenkins machines
PYTHON_HAS_IPYTHON=$("$PYTHON_EXECUTABLE" -c 'import importlib.util; print(importlib.util.find_spec("ipython") is not None)')
if [[ "$PYTHON_HAS_IPYTHON" == "False" ]]; then
echo "$PYTHON_EXECUTABLE does not have ipython installed. Skipping Sphinx build for now."
echo
return
fi

echo "starting $SPHINX_BUILD tests..."
pushd python/docs &> /dev/null
make clean &> /dev/null
Expand Down
2 changes: 2 additions & 0 deletions dev/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ PyGithub==1.26.0
Unidecode==0.04.19
sphinx
pydata_sphinx_theme
ipython
nbsphinx
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ See also https://github.com/sphinx-doc/sphinx/issues/7551.
-->

```sh
$ sudo pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme
$ sudo pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme ipython nbsphinx
```

## Generating the Documentation HTML
Expand Down
14 changes: 13 additions & 1 deletion python/docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,20 @@
'sphinx.ext.viewcode',
'sphinx.ext.mathjax',
'sphinx.ext.autosummary',
'nbsphinx', # Converts Jupyter Notebook to reStructuredText files for Sphinx.
# For ipython directive in reStructuredText files. It is generated by the notebook.
'IPython.sphinxext.ipython_console_highlighting'
]

# Links used globally in the RST files.
# These are defined here to allow link substitutions dynamically.
rst_epilog = """
.. |binder| replace:: Live Notebook
.. _binder: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb
.. |examples| replace:: Examples
.. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
""".format(os.environ.get("RELEASE_TAG", "master"))

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

Expand Down Expand Up @@ -84,7 +96,7 @@

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = ['_build']
exclude_patterns = ['_build', '.DS_Store', '**.ipynb_checkpoints']

# The reST default role (used for this markup: `text`) to use for all
# documents.
Expand Down
4 changes: 4 additions & 0 deletions python/docs/source/getting_started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,7 @@
Getting Started
===============

.. toctree::
:maxdepth: 2

quickstart
Loading

0 comments on commit b541030

Please sign in to comment.