-
-
Notifications
You must be signed in to change notification settings - Fork 48
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
373 additions
and
0 deletions.
There are no files selected for viewing
373 changes: 373 additions & 0 deletions
373
Dec 25 2020 - Using Spark GraphFrames in Azure Databricks.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,373 @@ | ||
|
||
<!-- README.md was wriiten in beautiful MacDown --> | ||
# Dec 25 2020 - Using Spark GraphFrames in Azure Databricks | ||
|
||
<img src="images/logo-databricks.png" align="right" width="300" /> | ||
|
||
<!-- badges: start --> | ||
 | ||
|
||
<!-- badges: end --> | ||
|
||
<span style="font-size: x-large; font-weight: normal;">Azure Databricks repository is | ||
a set of blogposts as a Advent of 2020 present to readers for easier onboarding | ||
to Azure Databricks! </span> | ||
|
||
|
||
<!-- wp:paragraph --> | ||
<p>Series of Azure Databricks posts:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:list --> | ||
<ul><li>Dec 01: <a rel="noreferrer noopener" href="https://tomaztsql.wordpress.com/2020/12/01/advent-of-2020-day-1-what-is-azure-databricks/" target="_blank">What is Azure Databricks</a></li><li>Dec 02: <a rel="noreferrer noopener" href="https://tomaztsql.wordpress.com/2020/12/02/advent-of-2020-day-2-how-to-get-started-with-azure-databricks/" target="_blank">How to get started with Azure Databricks</a></li><li>Dec 03: <a href="https://tomaztsql.wordpress.com/2020/12/03/advent-of-2020-day-3-getting-to-know-the-workspace-and-azure-databricks-platform/" target="_blank" rel="noreferrer noopener">Getting to know the workspace and Azure Databricks platform</a></li> | ||
<li>Dec 04: <a href="https://tomaztsql.wordpress.com/2020/12/04/advent-of-2020-day-4-creating-your-first-azure-databricks-cluster/" target="_blank" rel="noreferrer noopener">Creating your first Azure Databricks cluster</a></li> | ||
<li>Dec 05: <a href="https://tomaztsql.wordpress.com/2020/12/05/advent-of-2020-day-5-understanding-azure-databricks-cluster-architecture-workers-drivers-and-jobs/" target="_blank" rel="noreferrer noopener">Understanding Azure Databricks cluster architecture, workers, drivers and jobs</a></li> | ||
<li>Dec 06: <a href="https://tomaztsql.wordpress.com/2020/12/06/advent-of-2020-day-6-importing-and-storing-data-to-azure-databricks/" target="_blank" rel="noreferrer noopener">Importing and storing data to Azure Databricks</a></li> | ||
<li>Dec 07: <a href="https://tomaztsql.wordpress.com/2020/12/07/advent-of-2020-day-7-starting-with-databricks-notebooks-and-loading-data-to-dbfs/" target="_blank" rel="noreferrer noopener">Starting with Databricks notebooks and loading data to DBFS</a></li> | ||
<li>Dec 08: <a href="https://tomaztsql.wordpress.com/2020/12/08/advent-of-2020-day-8-using-databricks-cli-and-dbfs-cli-for-file-upload/" target="_blank" rel="noreferrer noopener"> Using Databricks CLI and DBFS CLI for file upload</a></li> | ||
<li>Dec 09: <a href="https://tomaztsql.wordpress.com/2020/12/09/advent-of-2020-day-9-connect-to-azure-blob-storage-using-notebooks-in-azure-databricks/" target="_blank" rel="noreferrer noopener">Connect to Azure Blob storage using Notebooks in Azure Databricks</a></li> | ||
<li>Dec 10: <a href="https://tomaztsql.wordpress.com/2020/12/10/advent-of-2020-day-10-using-azure-databricks-notebooks-with-sql-for-data-engineering-tasks/" target="_blank" rel="noreferrer noopener">Using Azure Databricks Notebooks with SQL for Data engineering tasks</a></li> | ||
<li>Dec 11: <a href="https://tomaztsql.wordpress.com/2020/12/11/advent-of-2020-day-11-using-azure-databricks-notebooks-with-r-language-for-data-analytics/" target="_blank" rel="noreferrer noopener">Using Azure Databricks Notebooks with R Language for data analytics</a></li> | ||
<li>Dec 12: <a href="https://tomaztsql.wordpress.com/2020/12/12/advent-of-2020-day-12-using-azure-databricks-notebooks-with-python-language-for-data-analytics/" target="_blank" rel="noreferrer noopener">Using Azure Databricks Notebooks with Python Language for data analytics</a></li> | ||
<li>Dec 13: <a href="https://tomaztsql.wordpress.com/2020/12/13/adventof-2020-day-13-using-python-databricks-koalas-with-azure-databricks/" target="_blank" rel="noreferrer noopener">Using Python Databricks Koalas with Azure Databricks</a></li> | ||
<li>Dec 14: <a href="https://tomaztsql.wordpress.com/2020/12/14/advent-of-2020-day-14-from-configuration-to-execution-of-databricks-jobs/" target="_blank" rel="noreferrer noopener">From configuration to execution of Databricks jobs</a></li> | ||
<li>Dec 15: <a href="https://tomaztsql.wordpress.com/2020/12/15/advent-of-2020-day-15-databricks-spark-ui-event-logs-driver-logs-and-metrics/" target="_blank" rel="noreferrer noopener">Databricks Spark UI, Event Logs, Driver logs and Metrics</a></li> | ||
<li>Dec 16: <a href="https://tomaztsql.wordpress.com/2020/12/16/advent-of-2020-day-16-databricks-experiments-models-and-mlflow/" target="_blank" rel="noreferrer noopener">Databricks experiments, models and MLFlow</a></li> | ||
<li>Dec 17: <a href="https://tomaztsql.wordpress.com/2020/12/17/advent-of-2020-day-17-end-to-end-machine-learning-project-in-azure-databricks/" target="_blank" rel="noreferrer noopener">End-to-End Machine learning project in Azure Databricks</a></li> | ||
<li>Dec 18: <a href="https://tomaztsql.wordpress.com/2020/12/18/advent-of-2020-day-18-using-azure-data-factory-with-azure-databricks/" target="_blank" rel="noreferrer noopener">Using Azure Data Factory with Azure Databricks</a></li> | ||
<li>Dec 19: <a href="https://tomaztsql.wordpress.com/2020/12/19/advent-of-2020-day-19-using-azure-data-factory-with-azure-databricks-for-merging-csv-files/" target="_blank" rel="noreferrer noopener">Using Azure Data Factory with Azure Databricks for merging CSV files</a></li> | ||
<li>Dec 20: <a href="https://tomaztsql.wordpress.com/2020/12/20/advent-of-2020-day-20-orchestrating-multiple-notebooks-with-azure-databricks/" target="_blank" rel="noreferrer noopener">Orchestrating multiple notebooks with Azure Databricks</a></li> | ||
<li>Dec 21: <a href="https://tomaztsql.wordpress.com/2020/12/21/advent-of-2020-day-21-using-scala-with-spark-core-api-in-azure-databricks/" target="_blank" rel="noreferrer noopener">Using Scala with Spark Core API in Azure Databricks</a></li> | ||
<li>Dec 22: <a href="https://tomaztsql.wordpress.com/2020/12/22/advent-of-2020-day-22-using-spark-sql-and-dataframes-in-azure-databricks/" target="_blank" rel="noreferrer noopener">Using Spark SQL and DataFrames in Azure Databricks</a></li> | ||
<li>Dec 23: <a href="https://tomaztsql.wordpress.com/2020/12/23/advent-of-2020-day-23-using-spark-streaming-in-azure-databricks/" target="_blank" rel="noreferrer noopener">Using Spark Streaming in Azure Databricks</a></li> | ||
|
||
<li>Dec 24: <a href="https://tomaztsql.wordpress.com/2020/12/24/advent-of-2020-day-24-using-spark-mllib-for-machine-learning-in-azure-databricks/" target="_blank" rel="noreferrer noopener">Using Spark MLlib for Machine Learning in Azure Databricks</a></li> | ||
|
||
|
||
</ul> | ||
<!-- /wp:list --> | ||
|
||
|
||
<!-- wp:paragraph --> | ||
<p>Yesterday we looked into MLlib package for Machine Learning. And oh, boy, there are so many topics to cover. But moving forward. Today we will look into the GraphFrames in Spark for Azure Databricks.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>This is the last part of high-level API on Spark engine is the GraphX (legacy) and GraphFrames. GraphFrames is a computation engine built on top of Spark Core API that enables end-users and taking advantages of Spark DataFrames in Python and Scala. It gives you the possibility to transform and build structured data at a massive scale.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
|
||
<div> | ||
<p> | ||
<img src="images/img174_21_1.png" width="600" align="center"/> | ||
</p> | ||
</div> | ||
|
||
<!-- wp:paragraph --> | ||
<p>In your workspace, create a new notebook, called: <em>Day25_Graph</em> and select language: <em>Python</em>. We will need a ML Databricks cluster or install additional Python packages. I installed additional Python package <strong>graphframes</strong> using <em>PyPI</em> installer.:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<div> | ||
<p> | ||
<img src="images/img220_25_1.png" width="600" align="center"/> | ||
</p> | ||
</div> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Before we begin, couple of word that I would like to explain:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Edge (edges)- is a link or a line between two nodes or a points in the network.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Vertex (vertices) - is a node or a point that has a relation to another node through a link.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
|
||
|
||
<div> | ||
<p> | ||
<img src="images/graph.png" width="400" align="center"/> | ||
</p> | ||
</div> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Motif - you can build more complex relationships involving edges and vertices. The following cell finds the pairs of vertices with edges in both directions between them. The result is a DataFrame, in which the column names are given by the motif keys.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Stateful - with combining GraphFrame motif finding with filters on the result where the filters use sequence operations to operate over DataFrame columns. Therefore it is called stateful (vis-a-vis stateless), because it remembers previous state.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Now you can start using the notebook. Import the packages that we will need.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code">from functools import reduce | ||
from pyspark.sql.functions import col, lit, when | ||
from graphframes import *</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
<!-- wp:paragraph --> | ||
### 1.Create a sample dataset | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>We will create a sample dataset (taken from Databricks website) and will be inserted as a DataFrame.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Vertices:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code">vertices = sqlContext.createDataFrame([ | ||
("a", "Alice", 34, "F"), | ||
("b", "Bob", 36, "M"), | ||
("c", "Charlie", 30, "M"), | ||
("d", "David", 29, "M"), | ||
("e", "Esther", 32, "F"), | ||
("f", "Fanny", 36, "F"), | ||
("g", "Gabby", 60, "F"), | ||
("h", "Mark", 45, "M"), | ||
("i", "Eddie", 60, "M"), | ||
("j", "Mandy", 21, "F") | ||
], ["id", "name", "age", "gender"])</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Edges:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code">edges = sqlContext.createDataFrame([ | ||
("a", "b", "friend"), | ||
("b", "c", "follow"), | ||
("c", "b", "follow"), | ||
("f", "c", "follow"), | ||
("e", "f", "follow"), | ||
("e", "d", "friend"), | ||
("d", "a", "friend"), | ||
("a", "e", "friend"), | ||
("a", "h", "follow"), | ||
("a", "i", "follow"), | ||
("a", "j", "follow"), | ||
("j", "h", "friend"), | ||
("i", "c", "follow"), | ||
("i", "c", "friend"), | ||
("b", "j", "follow"), | ||
("d", "h", "friend"), | ||
("e", "j", "friend"), | ||
("h", "a", "friend") | ||
], ["src", "dst", "relationship"])</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Let's create a graph using vertices and edges:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code">graph_sample = GraphFrame(vertices, edges) | ||
print(graph_sample)</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Or you can achieve same with:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code"># This example graph also comes with the GraphFrames package. | ||
from graphframes.examples import Graphs | ||
same_graph = Graphs(sqlContext).friends() | ||
print(same_graph)</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
<!-- wp:paragraph --> | ||
### 2.Querying graph | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>We can display Edges, vertices, incoming or outgoing degrees:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code">display(graph_sample.vertices) | ||
# | ||
display(graph_sample.edges) | ||
# | ||
display(graph_sample.inDegrees) | ||
# | ||
display(graph_sample.degrees)</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
|
||
<div> | ||
<p> | ||
<img src="images/img221_25_2.png" width="400" align="center"/> | ||
</p> | ||
</div> | ||
|
||
<!-- wp:paragraph --> | ||
<p>And you can even combine some filtering and using aggregation funtions:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code">youngest = graph_sample.vertices.groupBy().min("age") | ||
display(youngest)</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
<div> | ||
<p> | ||
<img src="images/img222_25_3.png" width="500" align="center"/> | ||
</p> | ||
</div> | ||
|
||
<!-- wp:paragraph --> | ||
### 3.Using motif | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Using motifs you can build more complex relationships involving edges and vertices. The following cell finds the pairs of vertices with edges in both directions between them. The result is a DataFrame, in which the column names are given by the motif keys.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code"># Search for pairs of vertices with edges in both directions between them. | ||
motifs = graph_sample.find("(a)-[e]->(h); (h)-[e2]->(a)") | ||
display(motifs)</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
<div> | ||
<p> | ||
<img src="images/img223_25_4.png" width="700" align="center"/> | ||
</p> | ||
</div> | ||
|
||
<!-- wp:paragraph --> | ||
### 4.Using Filter | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>You can filter out the relationship between nodes and adding multiple predicates.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code">filtered = motifs.filter("(b.age > 30 or a.age > 30) and (a.gender = 'M' and b.gender ='F')") | ||
display(filtered) | ||
# I guess Mark has a crush on Alice, but she just wants to be a follower :)</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
<div> | ||
<p> | ||
<img src="images/img224_25_5.png" width="700" align="center"/> | ||
</p> | ||
</div> | ||
|
||
<!-- wp:paragraph --> | ||
### 5. Stateful Queries | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Stateful queries are set of filters with given sequences, hence the name. You can combine GraphFrame motif finding with filters on the result where the filters use sequence operations to operate over DataFrame columns. Following an example:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code"># Find chains of 4 vertices. | ||
chain4 = graph_sample.find("(a)-[ab]->(b); (b)-[bc]->(c); (c)-[cd]->(d)") | ||
|
||
# Query on sequence, with state (cnt) | ||
# (a) Define method for updating state given the next element of the motif. | ||
def cumFriends(cnt, edge): | ||
relationship = col(edge)["relationship"] | ||
return when(relationship == "friend", cnt + 1).otherwise(cnt) | ||
|
||
# (b) Use sequence operation to apply method to sequence of elements in motif. | ||
# In this case, the elements are the 3 edges. | ||
edges = ["ab", "bc", "cd"] | ||
numFriends = reduce(cumFriends, edges, lit(0)) | ||
|
||
chainWith2Friends2 = chain4.withColumn("num_friends", numFriends).where(numFriends >= 2) | ||
display(chainWith2Friends2)</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
<!-- wp:paragraph --> | ||
### 6.Standard graph algorithms | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>GraphFrames comes with a number of standard graph algorithms built in:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:list --> | ||
<ul><li>Breadth-first search (BFS)</li><li>Connected components</li><li>Strongly connected components</li><li>Label Propagation Algorithm (LPA)</li><li>PageRank (regular and personalised)</li><li>Shortest paths</li><li>Triangle count</li></ul> | ||
<!-- /wp:list --> | ||
|
||
<!-- wp:paragraph --> | ||
#### 6.1.BFS - Breadth-first search; applying expression through edges | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>This is searching from expression through the Graph to expression. This will look from A: person named Esther to B: everyone who is 30 or younger.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code">paths = graph_sample.bfs("name = 'Esther'", "age < 31") | ||
display(paths)</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
<div> | ||
<p> | ||
<img src="images/img225_26_6.png" width="700" align="center"/> | ||
</p> | ||
</div> | ||
|
||
|
||
<!-- wp:paragraph --> | ||
<p>Same result can be achieved with refined query:</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code">filteredPaths = graph_sample.bfs( | ||
fromExpr = "name = 'Esther'", | ||
toExpr = "age < 31", | ||
edgeFilter = "relationship != 'friend'", | ||
maxPathLength = 3) | ||
display(filteredPaths)</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
<!-- wp:paragraph --> | ||
#### 6.2. Shortest Path | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Computes shortest paths to the given set of "landmark" vertices, where landmarks are specified by vertex ID.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:syntaxhighlighter/code --> | ||
<pre class="wp-block-syntaxhighlighter-code">results = graph_sample.shortestPaths(landmarks=["a", "d"]) | ||
display(results) | ||
#or | ||
results = graph_sample.shortestPaths(landmarks=["a", "d", "h"]) | ||
display(results)</pre> | ||
<!-- /wp:syntaxhighlighter/code --> | ||
|
||
|
||
<div> | ||
<p> | ||
<img src="images/img226_27_7.png" width="600" align="center"/> | ||
</p> | ||
</div> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Tomorrow we will explore how to connect Azure Machine Learning Services Workspace and Azure Databricks</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Complete set of code and the Notebook is available at the<a rel="noreferrer noopener" href="https://github.com/tomaztk/Azure-Databricks" target="_blank"> Github repository</a>.</p> | ||
<!-- /wp:paragraph --> | ||
|
||
<!-- wp:paragraph --> | ||
<p>Happy Coding and Stay Healthy!</p> | ||
<!-- /wp:paragraph --> |