Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -294,19 +294,33 @@
"id": "AUbv3YL59D8Q"
},
"source": [
"- `inputCols`: The name of the columns containing the input annotations. It can read either a String column or an Array.\n",
"- **inputCols**: Input annotation columns, typically `[\"sentence\", \"chunk\"]`. The `chunk` column provides the text spans, and the `sentence` column provides contextual information.\n",
"\n",
"- `outputCol`: The name of the column in Document type that is generated. We can specify only one column here.\n",
"- **outputCol**: Name of the output column that will contain the resulting sentence-chunk embeddings.\n",
"\n",
"- `chunkWeight`: Relative weight of chunk embeddings in comparison to sentence embeddings. The value should between 0 and 1. The default is 0.5, which means the chunk and sentence embeddings are given equal weight.\n",
"- **chunkWeight**: Relative weight of chunk embeddings compared to sentence embeddings. The value should be between 0 and 1. A value of `0.5` (default) means both chunk and sentence embeddings are given equal weight.\n",
"\n",
"- `setMaxSentenceLength`: Sets max sentence length to process, by default 128.\n",
"- **strategy**: Strategy for computing embeddings. Supported options:\n",
" - `\"sentence_average\"`: Averages sentence and chunk embeddings (default).\n",
" - `\"scope_average\"`: Averages scope and chunk embeddings, where the scope is defined by the `scopeWindow`.\n",
" - `\"chunk_only\"`: Uses only the chunk embeddings.\n",
" - `\"scope_only\"`: Uses only the scope embeddings, defined by `scopeWindow`.\n",
"\n",
"- `caseSensitive`: Determines whether the definitions of the white listed entities are case sensitive.\n",
"- **scopeWindow**: A tuple `(left, right)` defining how many tokens before and after the chunk are included when calculating scope embeddings. Defaults to `(0, 0)`, meaning no additional context tokens are included.\n",
"\n",
"- `strategy`: Strategy for computing embeddings. Supported strategies are: `sentence_average`, `scope_average`, `chunk_only`, `scope_only`. The default is `sentence_average`.\n",
"- **batchSize**: Number of sentences processed per batch during embedding computation. Affects performance and memory usage.\n",
"\n",
"- `scopeWindow`: cope window to calculate scope embeddings. The scope window is defined by two non-negative integers. The default is [0, 0], which means only the chunk embeddings are used. The first integer defines the number of tokens before the chunk and the second integer defines the number of tokens after the chunk.\n",
"- **caseSensitive**: Whether to preserve case when matching tokens for embedding computation. Default: `True`.\n",
"\n",
"- **dimension**: The embedding vector dimension. This depends on the pretrained model (e.g., 768 for BERT base).\n",
"\n",
"- **storageRef**: Unique reference name identifying the embeddings source. Useful when sharing models across pipelines.\n",
"\n",
"- **lazyAnnotator**: Whether the annotator should load resources lazily in a `RecursivePipeline`. Default: `False`.\n",
"\n",
"- **isLong**: Whether to use `Long` type instead of `Int` for model inputs. Some BERT models require `Long` tensors. Default: `False`.\n",
"\n",
"- **configProtoBytes**: TensorFlow configuration serialized as a byte array. Intended for advanced users who want to fine-tune session settings.\n",
"\n",
"All the parameters can be set using the corresponding set method in camel case. For example, `.setInputCols()`.\n",
"\n",
Expand Down
1,101 changes: 1,100 additions & 1 deletion Spark_NLP_Udemy_MOOC/Healthcare_NLP/ChunkMapperFilterer.ipynb

Large diffs are not rendered by default.

152 changes: 78 additions & 74 deletions Spark_NLP_Udemy_MOOC/Healthcare_NLP/DeIdentification.ipynb

Large diffs are not rendered by default.

1,688 changes: 1,687 additions & 1 deletion Spark_NLP_Udemy_MOOC/Healthcare_NLP/DeIdentificationModel.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -251,9 +251,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"- `batchSize` : Batch size for processing documents (default: 8).\n",
"- `caseSensitive` : Whether the classifier is sensitive to text casing (default: false).\n",
"- `maxSentenceLength` : Maximum input sentence length (text beyond this may be truncated)."
"- **inputCols**: Input columns containing `DOCUMENT` annotations.\n",
"\n",
"- **outputCol**: Output column name where classification results (`CATEGORY`) are stored.\n",
"\n",
"- **batchSize**: Batch size for processing documents. Default: `8`.\n",
"\n",
"- **caseSensitive**: Whether the classifier is sensitive to text casing. Default: `False`.\n",
"\n",
"- **maxSentenceLength**: Maximum input sentence length. Text beyond this limit may be truncated.\n"
]
},
{
Expand Down
Loading