Skip to content

Commit

Permalink
delete redundant files (PaddlePaddle#320)
Browse files Browse the repository at this point in the history
  • Loading branch information
moebius21 authored Apr 28, 2021
1 parent a79c9cb commit aca073a
Show file tree
Hide file tree
Showing 17 changed files with 164 additions and 99 deletions.
7 changes: 0 additions & 7 deletions docs/advanced_guide.rst

This file was deleted.

16 changes: 0 additions & 16 deletions docs/api_reference.rst

This file was deleted.

11 changes: 0 additions & 11 deletions docs/community.rst

This file was deleted.

3 changes: 3 additions & 0 deletions docs/community/contribute_docs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
==============
如何贡献问答、案例
==============
File renamed without changes.
14 changes: 0 additions & 14 deletions docs/data_prepare.rst

This file was deleted.

3 changes: 0 additions & 3 deletions docs/faq.rst

This file was deleted.

13 changes: 0 additions & 13 deletions docs/get_started.rst

This file was deleted.

28 changes: 18 additions & 10 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,19 +45,24 @@
自定义数据集 <data_prepare/dataset_self_defined>
数据处理 <data_prepare/data_preprocess>

.. toctree::
:maxdepth: 2
:caption: 模型库

预训练模型 <model_zoo/transformers.md>
基本组网单元 <model_zoo/others>

.. toctree::
:maxdepth: 2
:caption: 评价指标

评价指标 <metrics/metrics.md>

.. toctree::
:maxdepth: 2
:caption: 实践教程

文本分类 <tutorials/classify>
词向量 <tutorials/embedding>
语义匹配 <tutorials/semantic_matching>
文本生成 <tutorials/text_generation>
机器翻译 <tutorials/machine_translation>
阅读理解 <tutorials/reading_comprehension>
通用对话 <tutorials/general_dialogue>
序列标注 <tutorials/ner>
词法分析 <tutorials/lexical_analysis>
AI Studio Notebook <tutorials/overview>

.. toctree::
:maxdepth: 2
Expand All @@ -69,9 +74,12 @@

.. toctree::
:maxdepth: 2
:caption: 社区贡献
:caption: 社区交流共建

如何贡献数据集 <community/contribute_dataset>
如何贡献模型 <community/contribute_models>
如何贡献文档案例 <community/contribute_docs>
如何加入兴趣小组 <community/join_in_PaddleNLP-SIG>

.. toctree::
:maxdepth: 2
Expand Down
1 change: 0 additions & 1 deletion docs/installation.md

This file was deleted.

15 changes: 15 additions & 0 deletions docs/metrics/metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# PaddleNLP Metrics API

目前PaddleNLP提供以下模型评价指标:

| Metric | 简介 | API |
| ------ | --- | --- |
| [Perplexity](https://en.wikipedia.org/wiki/Perplexity) | 困惑度,常用来衡量语言模型优劣,也可用于机器翻译、文本生成等任务。 | `paddlenlp.metrics.Perplexity` |
| [BLEU(BiLingual Evaluation Understudy)](https://en.wikipedia.org/wiki/BLEU) | 机器翻译常用评价指标 | `paddlenlp.metrics.BLEU` |
| [Rouge(Recall-Oriented Understudy for Gisting Evaluation)](https://en.wikipedia.org/wiki/ROUGE_(metric)) | 评估自动文摘以及机器翻译的指标 | `paddlenlp.metrics.RougeL`, `paddlenlp.metrics.RougeN` |
| AccuracyAndF1 | 准确率及F1-score,可用于GLUE中的MRPC 和QQP任务 | `paddlenlp.metrics.AccuracyAndF1` |
| PearsonAndSpearman | 皮尔森相关性系数和斯皮尔曼相关系数。可用于GLUE中的STS-B任务 | `paddlenlp.metrics.PearsonAndSpearman` |
| Mcc(Matthews correlation coefficient) | 马修斯相关系数,用以测量二分类的分类性能的指标。可用于GLUE中的CoLA任务 | `paddlenlp.metrics.Mcc` |
| ChunkEvaluator | 计算了块检测的精确率、召回率和F1-score。常用于序列标记任务,如命名实体识别(NER) | `paddlenlp.metrics.ChunkEvaluator` |
| Squad Evalutaion | 用于SQuAD和DuReader-robust的评价指标 | `paddlenlp.metrics.compute_predictions`, `paddlenlp.metrics.squad_evaluate` |
| [Distinct](https://arxiv.org/abs/1510.03055) | 多样性指标,常用来衡量文本生成模型生成的句子形式上的多样性。 | `paddlenlp.metrics.Distinct` |
11 changes: 0 additions & 11 deletions docs/model_zoo.rst

This file was deleted.

5 changes: 0 additions & 5 deletions docs/model_zoo/transformer.rst

This file was deleted.

85 changes: 85 additions & 0 deletions docs/model_zoo/transformers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# PaddleNLP Transformer API

随着深度学习的发展,NLP领域涌现了一大批高质量的Transformer类预训练模型,多次刷新各种NLP任务SOTA。PaddleNLP为用户提供了常用的BERT、ERNIE、RoBERTa、XLNet经典结构预训练模型,让开发者能够方便快捷应用各类Transformer预训练模型及其下游任务。


## Transformer预训练模型汇总

下表汇总了目前PaddleNLP支持的各类预训练模型。用户可以使用PaddleNLP提供的模型,完成问答、文本分类、序列标注、文本生成等任务。同时我们提供了34种预训练的参数权重供用户使用,其中包含了17种中文语言模型的预训练权重。

| Model | Tokenizer | Supported Task | Pretrained Weight|
|---|---|---|---|
| [BERT](https://arxiv.org/abs/1810.04805) | BertTokenizer|BertModel<br> BertForQuestionAnswering<br> BertForSequenceClassification<br>BertForTokenClassification| `bert-base-uncased`<br> `bert-large-uncased` <br>`bert-base-multilingual-uncased` <br>`bert-base-cased`<br> `bert-base-chinese`<br> `bert-base-multilingual-cased`<br> `bert-large-cased`<br> `bert-wwm-chinese`<br> `bert-wwm-ext-chinese` |
|[ERNIE](https://arxiv.org/abs/1904.09223)|ErnieTokenizer<br>ErnieTinyTokenizer|ErnieModel<br> ErnieForQuestionAnswering<br> ErnieForSequenceClassification<br> ErnieForTokenClassification | `ernie-1.0`<br> `ernie-tiny`<br> `ernie-2.0-en`<br> `ernie-2.0-large-en`|
|[ERNIE-GEN](https://arxiv.org/abs/2001.11314)|ErnieTokenizer| ErnieForGeneration|`ernie-gen-base-en`<br>`ernie-gen-large-en`<br>`ernie-gen-large-en-430g`|
|[GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)| GPT2Tokenizer<br> GPT2ChineseTokenizer| GPT2ForGreedyGeneration| `gpt2-base-cn` <br> `gpt2-medium-en`|
|[RoBERTa](https://arxiv.org/abs/1907.11692)|RobertaTokenizer| RobertaModel<br>RobertaForQuestionAnswering<br>RobertaForSequenceClassification<br>RobertaForTokenClassification| `roberta-wwm-ext`<br> `roberta-wwm-ext-large`<br> `rbt3`<br> `rbtl3`|
|[ELECTRA](https://arxiv.org/abs/2003.10555) | ElectraTokenizer| ElectraModel<br>ElectraForSequenceClassification<br>ElectraForTokenClassification<br>|`electra-small`<br> `electra-base`<br> `electra-large`<br> `chinese-electra-small`<br> `chinese-electra-base`<br>|
|[XLNet](https://arxiv.org/abs/1906.08237)| XLNetTokenizer| XLNetModel<br> XLNetForSequenceClassification<br> XLNetForTokenClassification |`xlnet-base-cased`<br> `xlnet-large-cased`<br> `chinese-xlnet-base`<br> `chinese-xlnet-mid`<br> `chinese-xlnet-large`|
|[UnifiedTransformer](https://arxiv.org/abs/2006.16779)| UnifiedTransformerTokenizer| UnifiedTransformerModel<br> UnifiedTransformerLMHeadModel |`unified_transformer-12L-cn`<br> `unified_transformer-12L-cn-luge` |
|[Transformer](https://arxiv.org/abs/1706.03762) |- | TransformerModel | - |

**NOTE**:其中中文的预训练模型有`bert-base-chinese, bert-wwm-chinese, bert-wwm-ext-chinese, ernie-1.0, ernie-tiny, gpt2-base-cn, roberta-wwm-ext, roberta-wwm-ext-large, rbt3, rbtl3, chinese-electra-base, chinese-electra-small, chinese-xlnet-base, chinese-xlnet-mid, chinese-xlnet-large, unified_transformer-12L-cn, unified_transformer-12L-cn-luge`

## 预训练模型使用方法

PaddleNLP Transformer API在提丰富预训练模型的同时,也降低了用户的使用门槛。只需十几行代码,用户即可完成模型加载和下游任务Fine-tuning。

```python
import paddle
from paddlenlp.datasets import load_dataset
from paddlenlp.transformers import BertForSequenceClassification, BertTokenizer

train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])

model = BertForSequenceClassification.from_pretrained("bert-wwm-chinese", num_classes=len(train_ds.label_list))

tokenizer = BertTokenizer.from_pretrained("bert-wwm-chinese")

# Define the dataloader from dataset and tokenizer here

optimizer = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters())

criterion = paddle.nn.loss.CrossEntropyLoss()

for input_ids, token_type_ids, labels in train_dataloader:
logits = model(input_ids, token_type_ids)
loss = criterion(logits, labels)
probs = paddle.nn.functional.softmax(logits, axis=1)
loss.backward()
optimizer.step()
optimizer.clear_grad()
```

上面的代码给出使用预训练模型的简要示例,包括:

1. 加载数据集:PaddleNLP内置了多种数据集,您可以一键导入所需的数据集。
2. 加载预训练模型:PaddleNLP的预训练模型可以很容易地通过`from_pretrained()`方法加载。第一个参数是汇总表中对应的 `Pretrained Weight`,可加载对应的预训练权重。`BertForSequenceClassification`初始化`__init__`所需的其他参数,如`num_classes`等,也是通过`from_pretrained()`传入。`Tokenizer`使用同样的`from_pretrained`方法加载。
3. 使用tokenier将dataset处理成模型的输入。此部分可以参考前述的详细示例代码。
4. 定义训练所需的优化器,loss函数等,就可以开始进行模型fine-tune任务。


## 预训练模型适用任务汇总

本小节按照模型适用的不同任务类型,对预训练模型进行分类汇总。主要包括文本分类、序列标注、问答任务、文本生成、机器翻译等。

|任务|模型|预训练权重|
|---|---|---|
|文本分类<br>SequenceClassification |BertForSequenceClassification <br> ErnieForSequenceClassification <br> RobertaForSequenceClassification <br> ElectraForSequenceClassification <br> XLNetForSequenceClassification | [见上表](#Transformer预训练模型汇总)|
|序列标注<br>TokenClassification|BertForTokenClassification <br> ErnieForTokenClassification <br> RobertaForTokenClassification <br> ElectraForTokenClassification <br> XLNetForTokenClassification |[见上表](#Transformer预训练模型汇总)|
|问答任务<br>QuestionAnswering|BertForQuestionAnswering <br> ErnieForQuestionAnswering <br> RobertaForQuestionAnswering|[见上表](#Transformer预训练模型汇总)|
|文本生成<br>TextGeneration | ErnieForGeneration <br> GPT2ForGreedyGeneration |[见上表](#Transformer预训练模型汇总)|
|机器翻译<br>MachineTranslation| TransformerModel |[见上表](#Transformer预训练模型汇总)|

用户可以切换表格中的不同模型,来处理相同类型的任务。如对于[预训练模型使用方法](#预训练模型使用方法)中的文本分类任务,您可以将`BertForSequenceClassification`换成`ErnieForSequenceClassification`, 来寻找更适合的预训练模型。


## Reference
- 部分中文预训练模型来自:[ymcui/Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm), [ymcui/Chinese-XLNet](https://github.com/ymcui/Chinese-XLNet), [huggingface/xlnet_chinese_large](https://huggingface.co/clue/xlnet_chinese_large), [Knover/luge-dialogue](https://github.com/PaddlePaddle/Knover/tree/luge-dialogue/luge-dialogue)
- Sun, Yu, et al. "Ernie: Enhanced representation through knowledge integration." arXiv preprint arXiv:1904.09223 (2019).
- Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).
- Cui, Yiming, et al. "Pre-training with whole word masking for chinese bert." arXiv preprint arXiv:1906.08101 (2019).
- Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint arXiv:1706.03762 (2017).
- Yang, Zhilin, et al. "Xlnet: Generalized autoregressive pretraining for language understanding." arXiv preprint arXiv:1906.08237 (2019).
- Clark, Kevin, et al. "Electra: Pre-training text encoders as discriminators rather than generators." arXiv preprint arXiv:2003.10555 (2020).
- Radford, Alec, et al. "Language models are unsupervised multitask learners." OpenAI blog 1.8 (2019): 9.
5 changes: 0 additions & 5 deletions docs/paddlenlp.rst

This file was deleted.

43 changes: 43 additions & 0 deletions docs/tutorials/overview.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
============
整体介绍
============


案例集
----------

- 词向量

- `使用预训练词向量改善模型效果 <https://aistudio.baidu.com/aistudio/projectdetail/1535355>`_

- 文本分类

- `基于LSTM等RNN网络的文本分类 <https://aistudio.baidu.com/aistudio/projectdetail/1283423>`_
- `基于预训练模型的文本分类 <https://aistudio.baidu.com/aistudio/projectdetail/1294333>`_
- `自定义数据集实现文本多分类任务 <https://aistudio.baidu.com/aistudio/projectdetail/1468469>`_

- 信息抽取

- `使用BiGRU-CRF模型完成快递单信息抽取 <https://aistudio.baidu.com/aistudio/projectdetail/1317771>`_
- `使用预训练模型ERNIE优化快递单信息抽取 <https://aistudio.baidu.com/aistudio/projectdetail/1329361>`_
- `关系抽取 <https://aistudio.baidu.com/aistudio/projectdetail/1639963>`_
- `事件抽取 <https://aistudio.baidu.com/aistudio/projectdetail/1639964>`_

- 阅读理解式问答

- `使用预训练模型完成阅读理解 <https://aistudio.baidu.com/aistudio/projectdetail/1339612>`_

- 对话

- `多技能对话 <https://aistudio.baidu.com/aistudio/projectdetail/1640180>`_

- 文本生成

- `使用Seq2Seq模型完成自动对联 <https://aistudio.baidu.com/aistudio/projectdetail/1321118>`_
- `使用预训练模型ERNIE-GEN实现智能写诗 <https://aistudio.baidu.com/aistudio/projectdetail/1339888>`_

- 时序预测

- `使用TCN网络完成新冠疫情病例数预测 <https://aistudio.baidu.com/aistudio/projectdetail/1290873>`_

更多教程参见 `PaddleNLP on AI Studio <https://aistudio.baidu.com/aistudio/personalcenter/thirdview/574995>`_
3 changes: 0 additions & 3 deletions docs/version.rst

This file was deleted.

0 comments on commit aca073a

Please sign in to comment.