Skip to content

[Documentation]Take your application to production #349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions docs/take-to-prod.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
Taking your .NET for Apache Spark Application to Production
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments from @suhsteve "This title somehow makes me feel like after reading this I will know how to make my spark application production ready. Is this the purpose of this document ? Or just to outline different spark-submit scenarios."

I believe the purpose of this doc should tell users how to move their application to production. I am thinking of putting different scenarios along with the instruction on how to move it to production based on these different scenarios. Any suggestions to make this doc more precise and explicit would be really appreciate.

===

# Table of Contents
This how-to provides general instructions on how to take your .NET for Apache Spark application to production.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments from @bamurtaugh : "What does it mean to take an app to production? Perhaps add a couple words/sentence defining that (does it just mean running on-prem? Deploying to cloud? Building and running spark-submit? CI/CD?)"

Great point! @rapoth Could you please help with elaborating this a little more?

In this documentation, we will summarize the most commonly asked scenarios when running a .NET for Apache Spark Application.
You will also learn how to package your application and submit your application with [spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html) and [Apachy Livy](https://livy.incubator.apache.org/).
- [How to deploy your application when you have a single dependency](#how-to-deploy-your-application-when-you-have-a-single-dependency)
- [Scenarios](#scenarios)
- [Package your application](#package-your-application)
- [Launch your application](#launch-your-application)
- [How to deploy your application when you have multiple dependencies](#how-to-deploy-your-application-when-you-have-multiple-dependencies)
- [Scenarios](#scenarios-1)
- [Package your application](#package-your-application-1)
- [Launch your application](#launch-your-application-1)

## How to deploy your application when you have a single dependency
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments from @bamurtaugh : "What does single dependency mean? I think it could help users to include a short explanation here or at the top of the document of what a dependency means in the .NET for Spark context."

Actually I am not so sure if we should use single dependency and multiple dependency to define and separate these scenarios. @rapoth and @imback82 any suggestions? Thanks.

### Scenarios
#### 1. SparkSession code and business logic in the same Program.cs file
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@suhsteve I will keep the number here to make it more clear.

The `SparkSession` code and business logic (UDFs) are contained in the same `Program.cs` file.
#### 2. SparkSession code and business logic in the same project, but different .cs files
The `SparkSession` code and business logic (UDFs) are in different `.cs` files and both contained in the same project (e.g. SparkSession in Program.cs, business logic in BusinessLogic.cs and both are in mySparkApp.csproj).

### Package your application
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments from @suhsteve "I don't know if this section is too useful."

I think it will be very helpful if we can put more detailed instruction here. Any suggestions?

Please follow [Get Started](https://github.com/dotnet/spark/#get-started) to build your application.

### Launch your application
#### 1. Using spark-submit
Please make sure you have [pre-requisites](https://github.com/dotnet/spark/blob/master/docs/getting-started/windows-instructions.md#pre-requisites) to run the following command.
```powershell
%SPARK_HOME%\bin\spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master yarn \
--deploy-mode cluster \
--files <some dir>\<dotnet version>\mySparkApp.dll \
<some dir>\<dotnet version>\microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar \
dotnet <some dir>\<dotnet version>\mySparkApp.dll <app arg 1> <app arg 2> ... <app arg n>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dotnet is used in this example, but won't this fail in user scenarios where dotnet may not be available on their cluster ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I agree, but I have put pre-requisites in this example.

Or we can also put an example like mySparkApp args or dotnet mySparkApp.dll args which gives both options depending on cluster environment.

or

```
#### 2. Using Apache Livy
```shell
{
"file": "adl://<cluster name>.azuredatalakestore.net/<some dir>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar",
"className": "org.apache.spark.deploy.dotnet.DotnetRunner",
"files": [“adl://<cluster name>.azuredatalakestore.net/<some dir>/mySparkApp.dll" ],
"args": ["dotnet","adl://<cluster name>.azuredatalakestore.net/<some dir>/mySparkApp.dll","<app arg 1>","<app arg 2>,"...","<app arg n>"]
}
```

## How to deploy your application when you have multiple dependencies
### Scenarios
#### 1. SparkSession code in one project that references another project including the business logic
The `SparkSession` code in one project (e.g. mySparkApp.csproj) and business logic (UDFs) in another project (e.g. businessLogic.csproj).
#### 2. SparkSession code references a function from a Nuget package that has been installed in the csproj
The `SparkSession` code references a function from a Nuget package.
#### 3. SparkSession code references a function from a DLL on the user's machine
The `SparkSession` code reference business logic (UDFs) on the user's machine (e.g. `SparkSession` code in the mySparkApp.csproj and businessLogic.dll on a different machine).
#### 4. SparkSession code references functions and business logic from multiple projects/solutions that themselves depend on multiple Nuget packages
This would be a more complex use case when you have `SparkSession` code reference business logic (UDFs) and functions from Nuget packages in multiple projects and/or solutions.

### Package your application
Please see detailed steps [here](https://github.com/dotnet/spark/tree/master/deployment#preparing-your-spark-net-app) on how to build, publish and zip your application. After packaging your .NET for Apache Spark application, you will have a zip file (e.g. mySparkApp.zip) which has all the dependencies.

### Launch your application
#### 1. Using spark-submit
```shell
spark-submit \
--class org.apache.spark.deploy.dotnet.DotnetRunner \
--master yarn \
--deploy-mode cluster \
--conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./udfs \
--conf spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS=./myLibraries.zip \
--archives hdfs://<path to your files>/businessLogics.zip#udfs,hdfs://<path to your files>/myLibraries.zip \
hdfs://<path to jar file>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar \
hdfs://<path to your files>/mySparkApp.zip mySparkApp <app arg 1> <app arg 2> ... <app arg n>
```
#### 2. Using Apache Livy
```shell
{
"file": "adl://<cluster name>.azuredatalakestore.net/<some dir>/microsoft-spark-<spark_majorversion.spark_minorversion.x>-<spark_dotnet_version>.jar",
"className": "org.apache.spark.deploy.dotnet.DotnetRunner",
    "conf": {"spark.yarn.appMasterEnv.DOTNET_ASSEMBLY_SEARCH_PATHS": "./udfs, ./myLibraries.zip"},
"archives": ["adl://<cluster name>.azuredatalakestore.net/<some dir>/businessLogics.zip#udfs”, "adl://<cluster name>.azuredatalakestore.net/<some dir>/myLibraries.zip”],
"args": ["adl://<cluster name>.azuredatalakestore.net/<some dir>/mySparkApp.zip","mySparkApp","<app arg 1>","<app arg 2>,"...","<app arg n>"]
}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should say somewhere that scenarios 1 and 2 can also be submitted using these submission examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean scenarios 1 and 2 in single dependency session?
They can submit it using the example here, but they have to zip single dll first, would like be more work for user in such case? How about we specify the another example usage of mySparkApp args as I mentioned earlier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If users are manually submitting then yes, but in "production" use, users would most likely automate the packaging and submission. It would be good to know that they don't have to take different steps if they have single vs multiple dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we would like to unify these two sections, should we remove the command example in the first section?

I just feel like we need to add more contexts in this whole instruction in general, any suggestions would be really appreciated.