-
Notifications
You must be signed in to change notification settings - Fork 325
[Documentation]Take your application to production #349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
398810c
3371ff3
24142b4
2e254e8
d1dc709
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,87 @@ | ||
Taking your .NET for Apache Spark Application to Production | ||
=== | ||
|
||
# Table of Contents | ||
This how-to provides general instructions on how to take your .NET for Apache Spark application to production. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Comments from @bamurtaugh : "What does it mean to take an app to production? Perhaps add a couple words/sentence defining that (does it just mean running on-prem? Deploying to cloud? Building and running Great point! @rapoth Could you please help with elaborating this a little more? |
||
In this documentation, we will summarize the most commonly asked scenarios when running a .NET for Apache Spark Application. | ||
You will also learn how to package your application and submit your application with [spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html) and [Apachy Livy](https://livy.incubator.apache.org/). | ||
- [How to deploy your application when you have a single dependency](#how-to-deploy-your-application-when-you-have-a-single-dependency) | ||
- [Scenarios](#scenarios) | ||
- [Package your application](#package-your-application) | ||
- [Launch your application](#launch-your-application) | ||
- [How to deploy your application when you have multiple dependencies](#how-to-deploy-your-application-when-you-have-multiple-dependencies) | ||
- [Scenarios](#scenarios-1) | ||
- [Package your application](#package-your-application-1) | ||
- [Launch your application](#launch-your-application-1) | ||
|
||
## How to deploy your application when you have a single dependency | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. comments from @bamurtaugh : "What does single dependency mean? I think it could help users to include a short explanation here or at the top of the document of what a dependency means in the .NET for Spark context." Actually I am not so sure if we should use single dependency and multiple dependency to define and separate these scenarios. @rapoth and @imback82 any suggestions? Thanks. |
||
### Scenarios | ||
#### 1. SparkSession code and business logic in the same Program.cs file | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @suhsteve I will keep the number here to make it more clear. |
||
The `SparkSession` code and business logic (UDFs) are contained in the same `Program.cs` file. | ||
#### 2. SparkSession code and business logic in the same project, but different .cs files | ||
The `SparkSession` code and business logic (UDFs) are in different `.cs` files and both contained in the same project (e.g. SparkSession in Program.cs, business logic in BusinessLogic.cs and both are in mySparkApp.csproj). | ||
|
||
### Package your application | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. comments from @suhsteve "I don't know if this section is too useful." I think it will be very helpful if we can put more detailed instruction here. Any suggestions? |
||
Please follow [Get Started](https://github.com/dotnet/spark/#get-started) to build your application. | ||
|
||
### Launch your application | ||
#### 1. Using spark-submit | ||
Please make sure you have [pre-requisites](https://github.com/dotnet/spark/blob/master/docs/getting-started/windows-instructions.md#pre-requisites) to run the following command. | ||
```powershell | ||
%SPARK_HOME%\bin\spark-submit \ | ||
--class org.apache.spark.deploy.dotnet.DotnetRunner \ | ||
--master yarn \ | ||
--deploy-mode cluster \ | ||
--files <some dir>\<dotnet version>\mySparkApp.dll \ | ||
<some dir>\<dotnet version>\microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar \ | ||
dotnet <some dir>\<dotnet version>\mySparkApp.dll <app arg 1> <app arg 2> ... <app arg n> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes I agree, but I have put pre-requisites in this example. Or we can also put an example like |
||
or | ||
|
||
``` | ||
#### 2. Using Apache Livy | ||
```shell | ||
{ | ||
"file": "adl://<cluster name>.azuredatalakestore.net/<some dir>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar", | ||
"className": "org.apache.spark.deploy.dotnet.DotnetRunner", | ||
"files": [“adl://<cluster name>.azuredatalakestore.net/<some dir>/mySparkApp.dll" ], | ||
"args": ["dotnet","adl://<cluster name>.azuredatalakestore.net/<some dir>/mySparkApp.dll","<app arg 1>","<app arg 2>,"...","<app arg n>"] | ||
} | ||
``` | ||
|
||
## How to deploy your application when you have multiple dependencies | ||
### Scenarios | ||
#### 1. SparkSession code in one project that references another project including the business logic | ||
The `SparkSession` code in one project (e.g. mySparkApp.csproj) and business logic (UDFs) in another project (e.g. businessLogic.csproj). | ||
#### 2. SparkSession code references a function from a Nuget package that has been installed in the csproj | ||
The `SparkSession` code references a function from a Nuget package. | ||
#### 3. SparkSession code references a function from a DLL on the user's machine | ||
The `SparkSession` code reference business logic (UDFs) on the user's machine (e.g. `SparkSession` code in the mySparkApp.csproj and businessLogic.dll on a different machine). | ||
#### 4. SparkSession code references functions and business logic from multiple projects/solutions that themselves depend on multiple Nuget packages | ||
This would be a more complex use case when you have `SparkSession` code reference business logic (UDFs) and functions from Nuget packages in multiple projects and/or solutions. | ||
|
||
### Package your application | ||
Please see detailed steps [here](https://github.com/dotnet/spark/tree/master/deployment#preparing-your-spark-net-app) on how to build, publish and zip your application. After packaging your .NET for Apache Spark application, you will have a zip file (e.g. mySparkApp.zip) which has all the dependencies. | ||
|
||
### Launch your application | ||
#### 1. Using spark-submit | ||
```shell | ||
spark-submit \ | ||
--class org.apache.spark.deploy.dotnet.DotnetRunner \ | ||
--master yarn \ | ||
--deploy-mode cluster \ | ||
--conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./udfs \ | ||
--conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./myLibraries.zip \ | ||
--archives hdfs://<path to your files>/businessLogics.zip#udfs,hdfs://<path to your files>/myLibraries.zip \ | ||
hdfs://<path to jar file>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar \ | ||
hdfs://<path to your files>/mySparkApp.zip mySparkApp <app arg 1> <app arg 2> ... <app arg n> | ||
``` | ||
#### 2. Using Apache Livy | ||
```shell | ||
{ | ||
"file": "adl://<cluster name>.azuredatalakestore.net/<some dir>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar", | ||
"className": "org.apache.spark.deploy.dotnet.DotnetRunner", | ||
"conf": {"spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS": "./udfs, ./myLibraries.zip"}, | ||
"archives": ["adl://<cluster name>.azuredatalakestore.net/<some dir>/businessLogics.zip#udfs”, "adl://<cluster name>.azuredatalakestore.net/<some dir>/myLibraries.zip”], | ||
"args": ["adl://<cluster name>.azuredatalakestore.net/<some dir>/mySparkApp.zip","mySparkApp","<app arg 1>","<app arg 2>,"...","<app arg n>"] | ||
} | ||
``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should say somewhere that scenarios 1 and 2 can also be submitted using these submission examples. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean scenarios 1 and 2 in single dependency session? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If users are manually submitting then yes, but in "production" use, users would most likely automate the packaging and submission. It would be good to know that they don't have to take different steps if they have single vs multiple dependencies. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we would like to unify these two sections, should we remove the command example in the first section? I just feel like we need to add more contexts in this whole instruction in general, any suggestions would be really appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments from @suhsteve "This title somehow makes me feel like after reading this I will know how to make my spark application production ready. Is this the purpose of this document ? Or just to outline different spark-submit scenarios."
I believe the purpose of this doc should tell users how to move their application to production. I am thinking of putting different scenarios along with the instruction on how to move it to production based on these different scenarios. Any suggestions to make this doc more precise and explicit would be really appreciate.