modelscope
diff --git a/‎docs/source/LLM/LLM微调文档.md
+5-5 b/‎docs/source/LLM/LLM微调文档.md
+5-5
diff --git a/‎docs/source/LLM/LLM推理文档.md
+2-2 b/‎docs/source/LLM/LLM推理文档.md
+2-2
diff --git a/‎docs/source/LLM/VLLM推理加速与部署.md
+1-2 b/‎docs/source/LLM/VLLM推理加速与部署.md
+1-2
diff --git a/‎docs/source/LLM/命令行参数.md
+2-2 b/‎docs/source/LLM/命令行参数.md
+2-2
diff --git a/‎docs/source/LLM/自我认知微调最佳实践.md
-1 b/‎docs/source/LLM/自我认知微调最佳实践.md
-1
diff --git a/‎docs/source/conf.py
+1-4 b/‎docs/source/conf.py
+1-4
diff --git a/‎docs/source_en/LLM/LLM-fine-tuning.md
+6-6 b/‎docs/source_en/LLM/LLM-fine-tuning.md
+6-6
diff --git a/‎docs/source_en/LLM/LLM-inference.md
+3-3 b/‎docs/source_en/LLM/LLM-inference.md
+3-3
diff --git a/‎docs/source_en/LLM/RLHF.md
+2-2 b/‎docs/source_en/LLM/RLHF.md
+2-2
diff --git a/‎docs/source_en/LLM/Self-cognition-best-practice.md
-109 b/‎docs/source_en/LLM/Self-cognition-best-practice.md
-109
@@ -49,7 +49,7 @@ import torch
 
 from swift.llm import (
     DatasetName, InferArguments, ModelType, SftArguments,
-    infer_main, sft_main, app_ui_main, merge_lora
+    infer_main, sft_main, app_ui_main
 )
 
 model_type = ModelType.qwen_7b_chat
@@ -182,10 +182,10 @@ CUDA_VISIBLE_DEVICES=0 swift export \
 对微调后模型进行量化可以查看[LLM量化文档](LLM量化文档.md#微调后模型)
 
 ## 推理
-如果你要使用VLLM进行推理加速, 可以查看[VLLM推理加速与部署](VLLM推理加速与部署.md调后的模型)
+如果你要使用VLLM进行推理加速, 可以查看[VLLM推理加速与部署](VLLM推理加速与部署.md#微调后的模型)
 
 ### 原始模型
-**单样本推理**可以查看[LLM推理文档](LLM推理文档.md推理)
+**单样本推理**可以查看[LLM推理文档](LLM推理文档.md#推理)
 
 使用**数据集**评估:
 ```bash
@@ -271,10 +271,10 @@ CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged'
 ```
 
 ## Web-UI
-如果你要使用VLLM进行部署并提供**API**接口, 可以查看[VLLM推理加速与部署](VLLM推理加速与部署.md署)
+如果你要使用VLLM进行部署并提供**API**接口, 可以查看[VLLM推理加速与部署](VLLM推理加速与部署.md#部署)
 
 ### 原始模型
-使用原始模型的web-ui可以查看[LLM推理文档](LLM推理文档.mdWeb-UI)
+使用原始模型的web-ui可以查看[LLM推理文档](LLM推理文档.md#Web-UI)
 
 ### 微调后模型
 ```bash
 
@@ -396,7 +396,7 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type yi-6b-chat
 ```
 
 ### 微调后模型
-如果你要使用微调后模型进行推理, 可以查看[LLM微调文档](LLM微调文档.md调后模型)
+如果你要使用微调后模型进行推理, 可以查看[LLM微调文档](LLM微调文档.md#微调后模型)
 
 
 ## Web-UI
@@ -446,4 +446,4 @@ app_ui_main(app_ui_args)
 ```
 
 ### 微调后模型
-使用微调后模型的web-ui可以查看[LLM微调文档](LLM微调文档.md调后模型-1)
+使用微调后模型的web-ui可以查看[LLM微调文档](LLM微调文档.md#微调后模型)
@@ -186,7 +186,6 @@ from swift.llm import (
     ModelType, get_vllm_engine, get_default_template_type,
     get_template, inference_vllm
 )
-from swift.tuners import Swift
 
 ckpt_dir = 'vx-xxx/checkpoint-100-merged'
 model_type = ModelType.qwen_7b_chat
@@ -240,7 +239,7 @@ CUDA_VISIBLE_DEVICES=0 swift app-ui --ckpt_dir 'xxx/vx-xxx/checkpoint-xxx-merged
 ## 部署
 swift使用VLLM作为推理后端, 并兼容openai的API样式.
 
-服务端的部署命令行参数可以参考: [deploy命令行参数](命令行参数.md#deploy-命令行参数).
+服务端的部署命令行参数可以参考: [deploy命令行参数](命令行参数.md#deploy-参数).
 
 客户端的openai的API参数可以参考: https://platform.openai.com/docs/api-reference/introduction.
 
 
@@ -85,7 +85,7 @@
 - `--disable_tqdm`: 是否不启用tqdm, 这在`nohup`启动脚本时很有用. 默认为`False`, 即为启动tqdm.
 - `--lazy_tokenize`: 如果设置为False,  则在`trainer.train()`之前提前对所有文本进行预处理. 如果设置为True, 则延迟对文本进行编码, 减少预处理的等待并减少内存占用, 这在处理大数据集时很有用. 默认为`None`, 即我们会根据template的类型进行智能选择, LLM的模型通常设置为False, 多模态的模型通常设置为True(避免图片和音频加载导致过多的内存占用).
 - `--preprocess_num_proc`: 在对数据集预处理时(对文本进行tokenize), 使用多进程. 默认为`1`. 与`lazy_tokenize`命令行参数一样, 用于解决预处理速度慢的问题. 但该策略无法减少内存占用, 所以如果当数据集巨大时, 建议使用`lazy_tokenize`. 推荐设置的值: 4, 8. 请注意: 当使用qwen-audio时, 该参数会强制设置为1, 因为qwen-audio的预处理函数中使用了torch的多进程, 会造成不兼容问题.
-- `--use_flash_attn`: 是否使用flash attn, 默认为`None`. 安装flash_attn的步骤可以查看[https://github.com/Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention). 支持flash_attn的模型可以查看[LLM支持的模型](支持的模型和数据集.md型).
+- `--use_flash_attn`: 是否使用flash attn, 默认为`None`. 安装flash_attn的步骤可以查看[https://github.com/Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention). 支持flash_attn的模型可以查看[LLM支持的模型](支持的模型和数据集.md#模型).
 - `--ignore_args_error`: 是否忽略命令行传参错误抛出的Error, 默认为`False`. 如果需要拷贝代码到notebook中运行, 需要设置成True.
 - `--check_model_is_latest`: 检查模型是否是最新, 默认为`True`. 如果你需要断网进行训练, 请将该参数设置为`False`.
 - `--logging_dir`: 默认为`None`. 即设置为`f'{self.output_dir}/runs'`, 表示tensorboard文件存储路径.
@@ -189,7 +189,7 @@ dpo参数继承了sft参数, 除此之外增加了以下参数:
 - `--model_revision`: 默认值为`None`. 具体的参数介绍可以在`sft.sh命令行参数`中查看. 如果`model_id_or_path`为None或者是本地的模型目录, 则该参数失效.
 - `--sft_type`: 默认值为`'lora'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
 - `--template_type`: 默认值为`'AUTO'`, 具体的参数介绍可以在`sft.sh命令行参数`中查看.
-- `--infer_backend`: 你可以选择'AUTO', 'vllm', 'pt'. 默认使用'AUTO', 进行智能选择, 即如果没有传入`ckpt_dir`或使用全参数微调, 并且安装了vllm且模型支持vllm则使用vllm引擎, 否则使用原生torch进行推理. vllm环境准备可以参考[VLLM推理加速与部署](VLLM推理加速与部署.md境准备), vllm支持的模型可以查看[支持的模型](支持的模型和数据集.md型).
+- `--infer_backend`: 你可以选择'AUTO', 'vllm', 'pt'. 默认使用'AUTO', 进行智能选择, 即如果没有传入`ckpt_dir`或使用全参数微调, 并且安装了vllm且模型支持vllm则使用vllm引擎, 否则使用原生torch进行推理. vllm环境准备可以参考[VLLM推理加速与部署](VLLM推理加速与部署.md#环境准备), vllm支持的模型可以查看[支持的模型](支持的模型和数据集.md#模型).
 - `--ckpt_dir`: 必填项, 值为SFT阶段保存的checkpoint路径, e.g. `'/path/to/your/vx-xxx/checkpoint-xxx'`.
 - `--load_args_from_ckpt_dir`: 是否从`ckpt_dir`的`sft_args.json`文件中读取模型配置信息. 默认是`True`.
 - `--load_dataset_config`: 该参数只有在`--load_args_from_ckpt_dir true`时才生效. 即是否从`ckpt_dir`的`sft_args.json`文件中读取数据集相关的配置信息. 默认为`False`.
 
@@ -8,7 +8,6 @@
 - [微调](#微调)
 - [微调后推理](#微调后推理)
 - [Web-UI](#web-ui)
-- [了解更多](#了解更多)
 
 ## 环境安装
 ```bash
 
@@ -83,10 +83,7 @@ def get_version():
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
 # This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = [
-    'build', 'source/.ipynb_checkpoints', 'source/api/generated', 'Thumbs.db',
-    '.DS_Store'
-]
+exclude_patterns = ['build', 'source/.ipynb_checkpoints', 'source/api/generated', 'Thumbs.db', '.DS_Store']
 # A list of glob-style patterns [1] that are used to find source files.
 # They are matched against the source file names relative to the source directory,
 # using slashes as directory separators on all platforms.
 
@@ -46,7 +46,7 @@ import torch
 
 from swift.llm import (
     DatasetName, InferArguments, ModelType, SftArguments,
-    infer_main, sft_main, app_ui_main, merge_lora
+    infer_main, sft_main, app_ui_main
 )
 
 model_type = ModelType.qwen_7b_chat
@@ -133,12 +133,12 @@ cd examples/pytorch/llm
 - We default to setting `--gradient_checkpointing true` during training to **save memory**, which may slightly reduce training speed.
 - If you want to use quantization parameters `--quantization_bit 4`, you need to first install [bnb](https://github.com/TimDettmers/bitsandbytes): `pip install bitsandbytes -U`. This will reduce memory usage but usually slows down the training speed.
 - If you want to use quantization based on **auto_gptq**, you need to install the corresponding cuda version of [auto_gptq](https://github.com/PanQiWei/AutoGPTQ): `pip install auto_gptq -U`.
-  > Models that can use auto_gptq can be viewed in [LLM Supported Models](supported-models-and-datasets.md#models). It is recommended to use auto_gptq instead of bnb.
+  > Models that can use auto_gptq can be viewed in [LLM Supported Models](Supported-models-datasets.md#models). It is recommended to use auto_gptq instead of bnb.
 - If you want to use deepspeed, you need `pip install deepspeed -U`. Using deepspeed can **save memory**, but may slightly reduce training speed.
-- If your training involves **knowledge editing**, such as: [Self-aware Fine-tuning](self-aware-fine-tuning-best-practices.md), you need to add LoRA to MLP as well, otherwise, the results might be poor. You can simply pass the argument `--lora_target_modules ALL` to add lora to all linear(qkvo, mlp), **this is usually the best result**.
+- If your training involves **knowledge editing**, such as: [Self-aware Fine-tuning](Self-cognition-best-practice.md), you need to add LoRA to MLP as well, otherwise, the results might be poor. You can simply pass the argument `--lora_target_modules ALL` to add lora to all linear(qkvo, mlp), **this is usually the best result**.
 - If you are using older GPUs like **V100**, you need to set `--dtype AUTO` or `--dtype fp16`, as they do not support bf16.
-- If your machine has high-performance graphics cards like A100 and the model supports flash-attn, it is recommended to install [**flash-attn**](https://github.com/Dao-AILab/flash-attention), which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in [LLM Supported Models](supported-models-and-datasets.md#models)
-- If you are doing **second pre-training** or **multi-turn dialogue**, you can refer to [Customization and Extension](customization-and-extension.md#ways-to-register-datasets)
+- If your machine has high-performance graphics cards like A100 and the model supports flash-attn, it is recommended to install [**flash-attn**](https://github.com/Dao-AILab/flash-attention), which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in [LLM Supported Models](Supported-models-datasets.md#models)
+- If you are doing **second pre-training** or **multi-turn dialogue**, you can refer to [Customization and Extension](Customization.md#Registering-Datasets)
 - If you need to train **offline**, please use `--model_id_or_path <model_dir>` and set `--check_model_is_latest false`. For specific parameter meanings, please check [Command-line Parameters](Command-line-parameters.md).
 - If you want to push weights to the ModelScope Hub during training, you need to set `--push_to_hub true`.
 - If you want to merge LoRA weights and save them during inference, you need to set `--merge_lora true`. **It is not recommended to merge** for models trained with qlora, as this will result in precision loss. Therefore **it is not recommended to fine-tune** with qlora, as the deployment ecology is not good.
@@ -175,7 +175,7 @@ CUDA_VISIBLE_DEVICES=0 swift export \
 
 ## Quantization
 
-For quantization of the fine-tuned model, you can check [LLM Quantization Documentation](LLM-quantization.md#post-fine-tuning-model)
+For quantization of the fine-tuned model, you can check [LLM Quantization Documentation](LLM-quantization.md#fine-tuned-model)
 
 ## Inference
 If you want to use VLLM for accelerated inference, you can check [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md)
 
@@ -1,5 +1,5 @@
 # LLM Inference Documentation
-If you want to use vllm for inference acceleration, you can check out [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md#Inference Acceleration)
+If you want to use vllm for inference acceleration, you can check out [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md#inference-acceleration)
 
 ## Table of Contents
 - [Environment Preparation](#Environment-Preparation)
@@ -394,7 +394,7 @@ CUDA_VISIBLE_DEVICES=0 swift infer --model_type yi-6b-chat
 ```
 
 ### Fine-tuned Models
-If you want to perform inference using fine-tuned models, you can check out the [LLM Fine-tuning Documentation](LLM-fine-tuning.md#Fine-tuned Model)
+If you want to perform inference using fine-tuned models, you can check out the [LLM Fine-tuning Documentation](LLM-fine-tuning.md#Fine-tuned-Model)
 
 
 ## Web-UI
@@ -444,4 +444,4 @@ app_ui_main(app_ui_args)
 ```
 
 ### Fine-tuned Models
-To use the web-ui with fine-tuned models, you can check out the [LLM Fine-tuning Documentation](LLM-fine-tuning#Fine-tuned Model)
+To use the web-ui with fine-tuned models, you can check out the [LLM Fine-tuning Documentation](LLM-fine-tuning#fine-tuned-model)
@@ -82,8 +82,8 @@ cd examples/pytorch/llm
 
 - We default to setting `--gradient_checkpointing true` during training to **save memory**, which will slightly reduce training speed.
 - If you are using older GPUs such as **V100**, you need to set `--dtype AUTO` or `--dtype fp16`, because they do not support bf16.
-- If your machine has high-performance graphics cards like A100 and you are using the qwen series models, we recommend installing [**flash-attn**](https://github.com/Dao-AILab/flash-attention), which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in [LLM Supported Models](supported-models-and-datasets.md#models)
-- If you need to train offline, please use `--model_id_or_path <model_dir>` and set `--check_model_is_latest false`. For specific parameter meanings, please see [Command Line Arguments](command-line-arguments.md).
+- If your machine has high-performance graphics cards like A100 and you are using the qwen series models, we recommend installing [**flash-attn**](https://github.com/Dao-AILab/flash-attention), which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in [LLM Supported Models](Supported-models-datasets.md#models)
+- If you need to train offline, please use `--model_id_or_path <model_dir>` and set `--check_model_is_latest false`. For specific parameter meanings, please see [Command Line Arguments](Command-line-parameters.md).
 - If you want to push weights to the ModelScope Hub during training, you need to set `--push_to_hub true`.
 
 ```bash
 
@@ -7,7 +7,6 @@ Fine-tune your own large model in just 10 minutes!
 - [Fine-Tuning](#fine-tuning)
 - [Inference After Fine-Tuning](#inference-after-fine-tuning)
 - [Web-UI](#web-ui)
-- [Learn More](#learn-more)
 
 ## Environment Setup
 ```bash
@@ -69,114 +68,6 @@ Using CLI:
 CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-4b-chat
 ```
 
-## Fine-Tuning
-Note: Since self-cognition training involves knowledge editing, it's suggested to add lora_target_modules to **MLP**. You can specify `--lora_target_modules ALL` to add LoRA to all the linear layers (including qkvo and mlp), which **usually yields the best results**.
-
-Using python:
-```python
-# Experimental environment: A10, 3090, V100, ...
-# 23GB GPU memory
-import os
-os.environ['CUDA_VISIBLE_DEVICES'] = '0'
-
-from swift.llm import DatasetName, ModelType, SftArguments, sft_main
-
-sft_args = SftArguments(
-    model_type=ModelType.qwen1half_4b_chat,
-    dataset=[DatasetName.ms_bench_mini],
-    train_dataset_sample=1000,
-    logging_steps=5,
-    max_length=2048,
-    learning_rate=5e-5,
-    warmup_ratio=0.4,
-    output_dir='output',
-    lora_target_modules=['ALL'],
-    self_cognition_sample=500,
-    model_name=['Xiao Huang', 'Little Yellow'],
-    model_author=['Moda', 'ModelScope'])
-output = sft_main(sft_args)
-best_model_checkpoint = output['best_model_checkpoint']
-print(f'best_model_checkpoint: {best_model_checkpoint}')
-
-"""Out[0]
-...
-"""
-```
-
-Using CLI (single GPU):
-```bash
-# Experimental environment: A10, 3090, V100, ...
-# 23GB GPU memory# Best Practices for Self-Cognition Fine-Tuning
-Fine-tune your own large model in just 10 minutes!
-
-## Table of Contents
-- [Environment Setup](#environment-setup)
-- [Inference Before Fine-Tuning](#inference-before-fine-tuning)
-- [Fine-Tuning](#fine-tuning)
-- [Inference After Fine-Tuning](#inference-after-fine-tuning)
-- [Web-UI](#web-ui)
-- [Learn More](#learn-more)
-
-## Environment Setup
-```bash
-# Set up the global pip mirror (for faster downloading)
-pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
-# Install ms-swift
-pip install 'ms-swift[llm]' -U
-
-# Align the environment (usually not necessary to run. If you get an error, you can run the following code, which has been tested with the latest environment in the repository)
-pip install -r requirements/framework.txt  -U
-pip install -r requirements/llm.txt  -U
-```
-
-## Inference Before Fine-Tuning
-
-Using Python:
-```python
-import os
-os.environ['CUDA_VISIBLE_DEVICES'] = '0'
-
-from swift.llm import ModelType, InferArguments, infer_main
-infer_args = InferArguments(model_type=ModelType.qwen1half_4b_chat)
-infer_main(infer_args)
-
-"""
-<<< Who are you?
-I am a large-scale language model from Alibaba Cloud, my name is Qwen.
---------------------------------------------------
-<<< What's your name?
-I am Qwen, a large language model from Alibaba Cloud.
---------------------------------------------------
-<<< Who developed you?
-I am independently developed by Alibaba Cloud as a large-scale language model.
---------------------------------------------------
-<<< Where is the capital of Zhejiang?
-The capital of Zhejiang is Hangzhou.
---------------------------------------------------
-<<< What's delicious here?
-The cuisine of Zhejiang is very rich, for example, Hangzhou's West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Shrimp, and Sister Song's Fish Soup are famous Zhejiang dishes. Additionally, there are many snacks in Zhejiang, such as fried dough sticks, shaomai, rice balls, zongzi, etc.
---------------------------------------------------
-<<< What should I do if I can't sleep at night?
-If you can't sleep at night, try the following methods:
-
-1. Relax: Try activities that help relax the body and mind, such as listening to music, practicing yoga, meditating, etc.
-
-2. Regular routines: Try to maintain a regular daily routine, and avoid staying up late.
-
-3. Avoid stimulants: Avoid spicy, greasy, caffeinated foods that may stimulate the nervous system and cause insomnia.
-
-4. Exercise: Moderate exercise can help relax the body and improve sleep.
-
-5. Drink milk before bed: Milk contains tryptophan which helps produce melatonin and can aid in sleep.
-"""
-```
-If you want to perform single-sample inference, you can refer to [LLM Inference Documentation](LLM-inference#qwen-7b-chat)
-
-Using CLI:
-```bash
-CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-4b-chat
-```
-
 ## Fine-Tuning
 Note: Self-cognition training involves knowledge editing, so it is recommended to add `lora_target_modules` to **MLP**. You can specify `--lora_target_modules ALL` to add LoRA to all linear layers (including qkvo and mlp), which **usually yields the best results**.