Skip to content

Commit bcd2f09

Browse files
committed
update
1 parent 674517d commit bcd2f09

File tree

2 files changed

+302
-150
lines changed

2 files changed

+302
-150
lines changed

01-NLP-basics/04_fake_news_classifier.ipynb

+275-123
Large diffs are not rendered by default.

README.md

+27-27
Original file line numberDiff line numberDiff line change
@@ -300,9 +300,9 @@ Join us in unlocking the full potential of unstructured data using the power of
300300

301301
---
302302

303-
# `spaCy` for Natural Language Processing (NLP)
303+
## `spaCy` for Natural Language Processing (NLP)
304304

305-
## 1. Tokenization and Text Preprocessing
305+
### 1. Tokenization and Text Preprocessing
306306

307307
| Component | Description |
308308
|---------------------|-------------------------------------------------------------------------------------------------|
@@ -312,62 +312,62 @@ Join us in unlocking the full potential of unstructured data using the power of
312312
| Lemmatization | - Reduces words to their base or dictionary form (e.g., "better" becomes "good"). |
313313
| Dependency Parsing | - Analyzes grammatical relationships between words in a sentence. |
314314

315-
## 2. Word Vectors and Embeddings
315+
### 2. Word Vectors and Embeddings
316316

317317
| Component | Description |
318318
|---------------------|-------------------------------------------------------------------------------------------------|
319319
| Word Vectors | - Provides word vectors (word embeddings) for words in various languages. |
320320
| Pre-trained Models | - Offers pre-trained models with word embeddings for common NLP tasks. |
321321
| Similarity Analysis | - Measures word and document similarity based on word vectors. |
322322

323-
## 3. Text Classification
323+
### 3. Text Classification
324324

325325
| Component | Description |
326326
|---------------------|-------------------------------------------------------------------------------------------------|
327327
| Text Classification | - Supports text classification tasks using machine learning models. |
328328
| Custom Models | - Allows training custom text classification models with spaCy. |
329329

330-
## 4. Rule-Based Matching
330+
### 4. Rule-Based Matching
331331

332332
| Component | Description |
333333
|---------------------|-------------------------------------------------------------------------------------------------|
334334
| Rule-Based Matching | - Defines rules to identify and extract information based on patterns in text data. |
335335
| Phrase Matching | - Matches phrases and entities using custom rules. |
336336

337-
## 5. Entity Linking and Disambiguation
337+
### 5. Entity Linking and Disambiguation
338338

339339
| Component | Description |
340340
|---------------------|-------------------------------------------------------------------------------------------------|
341341
| Entity Linking | - Links named entities to external knowledge bases or databases (e.g., Wikipedia). |
342342
| Disambiguation | - Resolves entity mentions to the correct entity in a knowledge base. |
343343

344-
## 6. Text Summarization
344+
### 6. Text Summarization
345345

346346
| Component | Description |
347347
|---------------------|-------------------------------------------------------------------------------------------------|
348348
| Text Summarization | - Generates concise summaries of longer text documents. |
349349
| Extractive Summarization | - Summarizes text by selecting and extracting important sentences. |
350350
| Abstractive Summarization | - Summarizes text by generating new sentences that capture the essence of the content. |
351351

352-
## 7. Dependency Visualization
352+
### 7. Dependency Visualization
353353

354354
| Component | Description |
355355
|---------------------|-------------------------------------------------------------------------------------------------|
356356
| Dependency Visualization | - Creates visual representations of sentence grammatical structure and dependencies. |
357357

358-
## 8. Language Detection
358+
### 8. Language Detection
359359

360360
| Component | Description |
361361
|---------------------|-------------------------------------------------------------------------------------------------|
362362
| Language Detection | - Detects the language of text data. |
363363

364-
## 9. Named Entity Recognition (NER) Customization
364+
### 9. Named Entity Recognition (NER) Customization
365365

366366
| Component | Description |
367367
|---------------------|-------------------------------------------------------------------------------------------------|
368368
| NER Training | - Allows training custom named entity recognition models for specific entities or domains. |
369369

370-
## 10. Language Support
370+
### 10. Language Support
371371

372372
| Component | Description |
373373
|---------------------|-------------------------------------------------------------------------------------------------|
@@ -378,47 +378,47 @@ Join us in unlocking the full potential of unstructured data using the power of
378378

379379
---
380380

381-
# `Gensim` for Natural Language Processing (NLP)
381+
## `Gensim` for Natural Language Processing (NLP)
382382

383-
## 1. Word Embeddings and Word Vector Models
383+
### 1. Word Embeddings and Word Vector Models
384384

385385
| Component | Description |
386386
|---------------------|-------------------------------------------------------------------------------------------------|
387387
| Word2Vec | - Implements Word2Vec models for learning word embeddings from text data. |
388388
| FastText | - Provides FastText models for learning word embeddings, including subword information. |
389389
| Doc2Vec | - Learns document-level embeddings, allowing you to represent entire documents as vectors. |
390390

391-
## 2. Topic Modeling
391+
### 2. Topic Modeling
392392

393393
| Component | Description |
394394
|---------------------|-------------------------------------------------------------------------------------------------|
395395
| Latent Dirichlet Allocation (LDA) | - Implements LDA for discovering topics within a collection of documents. |
396396
| Latent Semantic Analysis (LSA) | - Performs LSA for extracting topics and concepts from large document corpora. |
397397
| Non-Negative Matrix Factorization (NMF) | - Applies NMF for topic modeling and feature extraction from text data. |
398398

399-
## 3. Similarity and Document Comparison
399+
### 3. Similarity and Document Comparison
400400

401401
| Component | Description |
402402
|---------------------|-------------------------------------------------------------------------------------------------|
403403
| Cosine Similarity | - Measures cosine similarity between vectors, useful for document and word similarity comparisons. |
404404
| Similarity Queries | - Supports similarity queries to find similar documents or words based on embeddings. |
405405

406-
## 4. Text Preprocessing
406+
### 4. Text Preprocessing
407407

408408
| Component | Description |
409409
|---------------------|-------------------------------------------------------------------------------------------------|
410410
| Tokenization | - Provides text tokenization for splitting text into words or sentences. |
411411
| Stopwords Removal | - Removes common words from text data to improve the quality of topic modeling. |
412412
| Phrase Detection | - Detects common phrases or bigrams in text data. |
413413

414-
## 5. Model Training and Customization
414+
### 5. Model Training and Customization
415415

416416
| Component | Description |
417417
|---------------------|-------------------------------------------------------------------------------------------------|
418418
| Model Training | - Trains custom word embeddings models on your text data for specific applications. |
419419
| Model Serialization | - Allows you to save and load trained models for future use. |
420420

421-
## 6. Integration with Other Libraries
421+
### 6. Integration with Other Libraries
422422

423423
| Component | Description |
424424
|---------------------|-------------------------------------------------------------------------------------------------|
@@ -429,60 +429,60 @@ Join us in unlocking the full potential of unstructured data using the power of
429429

430430
---
431431

432-
# `Transformer` Based Models for Natural Language Processing (NLP)
432+
## `Transformer` Based Models for Natural Language Processing (NLP)
433433

434-
## 1. Hugging Face Transformers
434+
### 1. Hugging Face Transformers
435435

436436
| Component | Description |
437437
|---------------------|-------------------------------------------------------------------------------------------------|
438438
| Transformers Library | - Provides easy-to-use access to a wide range of pre-trained transformer models for NLP tasks. |
439439
| Pre-trained Models | - Includes models like BERT, GPT-2, RoBERTa, T5, and more, each specialized for specific NLP tasks. |
440440
| Fine-Tuning | - Supports fine-tuning pre-trained models on custom NLP datasets for various downstream applications. |
441441

442-
## 2. BERT (Bidirectional Encoder Representations from Transformers)
442+
### 2. BERT (Bidirectional Encoder Representations from Transformers)
443443

444444
| Component | Description |
445445
|---------------------|-------------------------------------------------------------------------------------------------|
446446
| BERT Models | - Pre-trained BERT models capture contextual information from both left and right context in text. |
447447
| Fine-Tuning | - Fine-tuning BERT for tasks like text classification, NER, and question-answering is widely adopted. |
448448
| Sentence Embeddings | - BERT embeddings can be used for sentence and document-level embeddings. |
449449

450-
## 3. GPT (Generative Pre-trained Transformer)
450+
### 3. GPT (Generative Pre-trained Transformer)
451451

452452
| Component | Description |
453453
|---------------------|-------------------------------------------------------------------------------------------------|
454454
| GPT Models | - GPT-2 and GPT-3 models are popular for generating text and performing various NLP tasks. |
455455
| Text Generation | - GPT models are known for their text generation capabilities, making them useful for creative tasks. |
456456

457-
## 4. RoBERTa (A Robustly Optimized BERT Pretraining Approach)
457+
### 4. RoBERTa (A Robustly Optimized BERT Pretraining Approach)
458458

459459
| Component | Description |
460460
|---------------------|-------------------------------------------------------------------------------------------------|
461461
| RoBERTa Models | - RoBERTa builds upon BERT with optimization techniques, achieving better performance on many tasks. |
462462
| Fine-Tuning | - Fine-tuning RoBERTa for text classification and other tasks is common for improved accuracy. |
463463

464-
## 5. T5 (Text-to-Text Transfer Transformer)
464+
### 5. T5 (Text-to-Text Transfer Transformer)
465465

466466
| Component | Description |
467467
|---------------------|-------------------------------------------------------------------------------------------------|
468468
| T5 Models | - T5 models are designed for text-to-text tasks, allowing you to frame various NLP tasks in a unified manner. |
469469
| Task Agnostic | - T5 can handle a wide range of NLP tasks, from translation to summarization and question-answering. |
470470

471-
## 6. XLNet (eXtreme MultiLabelNet)
471+
### 6. XLNet (eXtreme MultiLabelNet)
472472

473473
| Component | Description |
474474
|---------------------|-------------------------------------------------------------------------------------------------|
475475
| XLNet Models | - XLNet improves upon BERT by considering all permutations of input tokens, enhancing context modeling. |
476476
| Pre-training | - XLNet is pre-trained on vast text data and can be fine-tuned for various NLP applications. |
477477

478-
## 7. DistilBERT
478+
### 7. DistilBERT
479479

480480
| Component | Description |
481481
|---------------------|-------------------------------------------------------------------------------------------------|
482482
| DistilBERT Models | - DistilBERT is a distilled version of BERT, offering a smaller and faster alternative for NLP tasks. |
483483
| Efficiency | - DistilBERT provides similar performance to BERT with reduced computational requirements. |
484484

485-
## 8. Transformers for Other Languages
485+
### 8. Transformers for Other Languages
486486

487487
| Component | Description |
488488
|---------------------|-------------------------------------------------------------------------------------------------|

0 commit comments

Comments
 (0)