Skip to content

Add IDataView notebook #17

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions machine-learning/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
*.cs
*.csproj
bin/**/*
obj/**/*
375 changes: 375 additions & 0 deletions machine-learning/IDataView.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,375 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# IDataView\n",
"\n",
"In this notebooks, we'll cover:\n",
"\n",
"- What is an IDataView?\n",
"- What's the difference between DataFrame vs. IDataView?\n",
"- How to create an IDataView?\n",
"- How to inspect data in an IDataView?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What is an IDataView?\n",
"\n",
"The [IDataView](https://docs.microsoft.com/dotnet/api/microsoft.ml.idataview?view=ml-dotnet) system is a set of interfaces and components that provide efficient, compositional processing of schematized data for machine learning and advanced analytics applications. It is designed to gracefully and efficiently handle high dimensional data and large data sets. It does not directly address distributed data and computation, but is suitable for single node processing of data partitions belonging to larger distributed data sets.\n",
"\n",
"### Schema\n",
"\n",
"IDataView has general schema support, in that a view can have an arbitrary number of columns, each having an associated name, index, data type, and optional annotation.\n",
"\n",
"Column names are case sensitive. Multiple columns can share the same name, in which case, one of the columns hides the others, in the sense that the name will map to one of the column indices, the visible one. \n",
"\n",
"All user interaction with columns should be via name, not index, so the hidden columns are generally invisible to the user. However, hidden columns are often useful for diagnostic purposes.\n",
"\n",
"### Supported Data Types\n",
"\n",
"The set of supported column data types forms an open type system, in the sense\n",
"that additional types can be added at any time and in any assembly. However,\n",
"there is a precisely defined set of standard types including:\n",
"\n",
"- Text\n",
"- Boolean\n",
"- Single and Double precision floating point\n",
"- Signed integer values using 1, 2, 4, or 8 bytes\n",
"- Unsigned integer values using 1, 2, 4, or 8 bytes\n",
"- Values for ids and probabilistically unique hashes, using 16 bytes\n",
"- Date time, date time zone, and timespan\n",
"- Key types\n",
"- Vector types\n",
"- Image types\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## What's the difference between a DataFrame and IDataView?\n",
"\n",
"DataFrame and IDataView are very similar in the sense that they both are ways of representing data in a tabular format and applying transformations for it. Some key differences:\n",
"\n",
"- DataFrame only supports loading delimited files.\n",
"- DataFrame runs on memory so you're limited to the amount of memory on your PC.\n",
"\n",
"The DataFrame is recommended when performing tasks like exploratory data anlysis on a sample of your data. \n",
"\n",
"IDataView is recommended for training on larger datasets. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How to create an IDataView\n",
"\n",
"You can create an IDataView by using any of the methods for loading data:\n",
"\n",
"- TextLoader\n",
"- LoadFromTextFile\n",
"- LoadFromEnumerable\n",
"- Load"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Defining Schema\n",
"\n",
"IDataViews are schematized. Therefore you need to provide the schema. There's several ways to define the schema:\n",
"\n",
"- Manually\n",
"- Classes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Manually defining IDataView Schema\n",
"\n",
"To manually define the model schema you can use the `SchemaBuilder`. "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
}
},
"source": [
"#r \"nuget:Microsoft.ML,1.7.1\""
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": "<div><div></div><div></div><div><strong>Installed Packages</strong><ul><li><span>Microsoft.ML, 1.7.1</span></li></ul></div></div>"
},
"execution_count": 1,
"metadata": {}
}
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
}
},
"source": [
"using Microsoft.ML;\n",
"using Microsoft.ML.Data;"
],
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's say that we have data that looks like the following\n",
"\n",
"| Student Name | Score | \n",
"| --- | --- |\n",
"| Jane | 80 |\n",
"| John | 75 | \n",
"| Jack | 90 |\n",
"| Sally | 100 |\n",
"\n",
"We can define the schema as follows:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
}
},
"source": [
"var schemaBuilder = new DataViewSchema.Builder();\n",
"schemaBuilder.AddColumn(\"StudentName\", TextDataViewType.Instance);\n",
"schemaBuilder.AddColumn(\"Score\", NumberDataViewType.Single);\n",
"var schema = schemaBuilder.ToSchema();"
],
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we inspect the schema we can see its different properties."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
}
},
"source": [
"schema"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": "<table><thead><tr><th><i>index</i></th><th>Name</th><th>Index</th><th>IsHidden</th><th>Type</th><th>Annotations</th></tr></thead><tbody><tr><td>0</td><td>StudentName</td><td><div class=\"dni-plaintext\">0</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">System.ReadOnlyMemory&lt;System.Char&gt;</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr><tr><td>1</td><td>Score</td><td><div class=\"dni-plaintext\">1</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">System.Single</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr></tbody></table>"
},
"execution_count": 1,
"metadata": {}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Define schema with classes\n",
"\n",
"You also have the option of creating new classes or using existing classes to define your schema. Using the same student data above, you can define the schema as follows:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
}
},
"source": [
"public class TestScores\n",
"{\n",
"\tpublic string StudentName {get;set;}\n",
"\tpublic string Scores {get;set;}\n",
"}"
],
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Loading data\n",
"\n",
"You can load data from a flat file either using the TextLoader or LoadFromTextFile methods\n",
"\n",
"#### Loading data from a TextLoader"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
}
},
"source": [
"// Initialize MLContext\n",
"var mlContext = new MLContext();"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
}
},
"source": [
"// Define TextLoader\n",
"var textLoader =\n",
" mlContext.Data.CreateTextLoader(\n",
" columns: new TextLoader.Column[]\n",
" {\n",
" new TextLoader.Column(\"StudentName\",DataKind.String, 0),\n",
" new TextLoader.Column(\"Score\", DataKind.Single, 1)\n",
" },\n",
" separatorChar: ',',\n",
" hasHeader: true);"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
}
},
"source": [
"// Create IDataView\n",
"var textLoaderDataView = textLoader.Load(\"student-scores.csv\");"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
}
},
"source": [
"textLoaderDataView.Schema"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": "<table><thead><tr><th><i>index</i></th><th>Name</th><th>Index</th><th>IsHidden</th><th>Type</th><th>Annotations</th></tr></thead><tbody><tr><td>0</td><td>StudentName</td><td><div class=\"dni-plaintext\">0</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">System.ReadOnlyMemory&lt;System.Char&gt;</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr><tr><td>1</td><td>Score</td><td><div class=\"dni-plaintext\">1</div></td><td><div class=\"dni-plaintext\">False</div></td><td><table><thead><tr><th>RawType</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">System.Single</div></td></tr></tbody></table></td><td><table><thead><tr><th>Schema</th></tr></thead><tbody><tr><td><div class=\"dni-plaintext\">[ ]</div></td></tr></tbody></table></td></tr></tbody></table>"
},
"execution_count": 1,
"metadata": {}
}
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
}
},
"source": [
"// Specify column index from file via LoadColumn attribute\n",
"public class TestScoresAttributes\n",
"{\n",
"\t[LoadColumn(0)]\n",
"\tpublic string StudentName {get;set;}\n",
"\t\n",
"\t[LoadColumn(1)]\n",
"\tpublic string Scores {get;set;}\n",
"}"
],
"outputs": []
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"dotnet_interactive": {
"language": "csharp"
}
},
"source": [
"var textLoaderAttributes = \n",
"\tmlContext.Data.CreateTextLoader<TestScoresAttributes>(separatorChar: ',', hasHeader:true);"
],
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Inspecting data in IDataView\n",
"\n",
"There's several ways to inspect the data in an IDataView:\n",
"\n",
"- Use cursors\n",
"- Convert to IEnumerable\n",
"\n",
"### Use cursors"
]
}
],
"metadata": {
"kernelspec": {
"display_name": ".NET (C#)",
"language": "C#",
"name": ".net-csharp"
},
"language_info": {
"file_extension": ".cs",
"mimetype": "text/x-csharp",
"name": "C#",
"pygments_lexer": "csharp",
"version": "8.0"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
5 changes: 5 additions & 0 deletions machine-learning/data/student-scores.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Student Name, Score
Jane, 80
John, 75
Jack, 90
Sally, 100