You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- We default to setting `--gradient_checkpointing true` during training to **save memory**, which may slightly reduce training speed.
134
134
- If you want to use quantization parameters `--quantization_bit 4`, you need to first install [bnb](https://github.com/TimDettmers/bitsandbytes): `pip install bitsandbytes -U`. This will reduce memory usage but usually slows down the training speed.
135
135
- If you want to use quantization based on **auto_gptq**, you need to install the corresponding cuda version of [auto_gptq](https://github.com/PanQiWei/AutoGPTQ): `pip install auto_gptq -U`.
136
-
> Models that can use auto_gptq can be viewed in [LLM Supported Models](supported-models-and-datasets.md#models). It is recommended to use auto_gptq instead of bnb.
136
+
> Models that can use auto_gptq can be viewed in [LLM Supported Models](Supported-models-datasets.md#models). It is recommended to use auto_gptq instead of bnb.
137
137
- If you want to use deepspeed, you need `pip install deepspeed -U`. Using deepspeed can **save memory**, but may slightly reduce training speed.
138
-
- If your training involves **knowledge editing**, such as: [Self-aware Fine-tuning](self-aware-fine-tuning-best-practices.md), you need to add LoRA to MLP as well, otherwise, the results might be poor. You can simply pass the argument `--lora_target_modules ALL` to add lora to all linear(qkvo, mlp), **this is usually the best result**.
138
+
- If your training involves **knowledge editing**, such as: [Self-aware Fine-tuning](Self-cognition-best-practice.md), you need to add LoRA to MLP as well, otherwise, the results might be poor. You can simply pass the argument `--lora_target_modules ALL` to add lora to all linear(qkvo, mlp), **this is usually the best result**.
139
139
- If you are using older GPUs like **V100**, you need to set `--dtype AUTO` or `--dtype fp16`, as they do not support bf16.
140
-
- If your machine has high-performance graphics cards like A100 and the model supports flash-attn, it is recommended to install [**flash-attn**](https://github.com/Dao-AILab/flash-attention), which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in [LLM Supported Models](supported-models-and-datasets.md#models)
141
-
- If you are doing **second pre-training** or **multi-turn dialogue**, you can refer to [Customization and Extension](customization-and-extension.md#ways-to-register-datasets)
140
+
- If your machine has high-performance graphics cards like A100 and the model supports flash-attn, it is recommended to install [**flash-attn**](https://github.com/Dao-AILab/flash-attention), which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in [LLM Supported Models](Supported-models-datasets.md#models)
141
+
- If you are doing **second pre-training** or **multi-turn dialogue**, you can refer to [Customization and Extension](Customization.md#Registering-Datasets)
142
142
- If you need to train **offline**, please use `--model_id_or_path <model_dir>` and set `--check_model_is_latest false`. For specific parameter meanings, please check [Command-line Parameters](Command-line-parameters.md).
143
143
- If you want to push weights to the ModelScope Hub during training, you need to set `--push_to_hub true`.
144
144
- If you want to merge LoRA weights and save them during inference, you need to set `--merge_lora true`. **It is not recommended to merge** for models trained with qlora, as this will result in precision loss. Therefore **it is not recommended to fine-tune** with qlora, as the deployment ecology is not good.
@@ -175,7 +175,7 @@ CUDA_VISIBLE_DEVICES=0 swift export \
175
175
176
176
## Quantization
177
177
178
-
For quantization of the fine-tuned model, you can check [LLM Quantization Documentation](LLM-quantization.md#post-fine-tuning-model)
178
+
For quantization of the fine-tuned model, you can check [LLM Quantization Documentation](LLM-quantization.md#fine-tuned-model)
179
179
180
180
## Inference
181
181
If you want to use VLLM for accelerated inference, you can check [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md)
Copy file name to clipboardExpand all lines: docs/source_en/LLM/LLM-inference.md
+3-3
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,5 @@
1
1
# LLM Inference Documentation
2
-
If you want to use vllm for inference acceleration, you can check out [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md#Inference Acceleration)
2
+
If you want to use vllm for inference acceleration, you can check out [VLLM Inference Acceleration and Deployment](VLLM-inference-acceleration-and-deployment.md#inference-acceleration)
Copy file name to clipboardExpand all lines: docs/source_en/LLM/RLHF.md
+2-2
Original file line number
Diff line number
Diff line change
@@ -82,8 +82,8 @@ cd examples/pytorch/llm
82
82
83
83
- We default to setting `--gradient_checkpointing true` during training to **save memory**, which will slightly reduce training speed.
84
84
- If you are using older GPUs such as **V100**, you need to set `--dtype AUTO` or `--dtype fp16`, because they do not support bf16.
85
-
- If your machine has high-performance graphics cards like A100 and you are using the qwen series models, we recommend installing [**flash-attn**](https://github.com/Dao-AILab/flash-attention), which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in [LLM Supported Models](supported-models-and-datasets.md#models)
86
-
- If you need to train offline, please use `--model_id_or_path <model_dir>` and set `--check_model_is_latest false`. For specific parameter meanings, please see [Command Line Arguments](command-line-arguments.md).
85
+
- If your machine has high-performance graphics cards like A100 and you are using the qwen series models, we recommend installing [**flash-attn**](https://github.com/Dao-AILab/flash-attention), which will speed up training and inference as well as reduce memory usage (A10, 3090, V100, etc. graphics cards do not support training with flash-attn). Models that support flash-attn can be viewed in [LLM Supported Models](Supported-models-datasets.md#models)
86
+
- If you need to train offline, please use `--model_id_or_path <model_dir>` and set `--check_model_is_latest false`. For specific parameter meanings, please see [Command Line Arguments](Command-line-parameters.md).
87
87
- If you want to push weights to the ModelScope Hub during training, you need to set `--push_to_hub true`.
Copy file name to clipboardExpand all lines: docs/source_en/LLM/Self-cognition-best-practice.md
-109
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,6 @@ Fine-tune your own large model in just 10 minutes!
7
7
-[Fine-Tuning](#fine-tuning)
8
8
-[Inference After Fine-Tuning](#inference-after-fine-tuning)
9
9
-[Web-UI](#web-ui)
10
-
-[Learn More](#learn-more)
11
10
12
11
## Environment Setup
13
12
```bash
@@ -69,114 +68,6 @@ Using CLI:
69
68
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-4b-chat
70
69
```
71
70
72
-
## Fine-Tuning
73
-
Note: Since self-cognition training involves knowledge editing, it's suggested to add lora_target_modules to **MLP**. You can specify `--lora_target_modules ALL` to add LoRA to all the linear layers (including qkvo and mlp), which **usually yields the best results**.
74
-
75
-
Using python:
76
-
```python
77
-
# Experimental environment: A10, 3090, V100, ...
78
-
# 23GB GPU memory
79
-
import os
80
-
os.environ['CUDA_VISIBLE_DEVICES'] ='0'
81
-
82
-
from swift.llm import DatasetName, ModelType, SftArguments, sft_main
# 23GB GPU memory# Best Practices for Self-Cognition Fine-Tuning
110
-
Fine-tune your own large model in just 10 minutes!
111
-
112
-
## Table of Contents
113
-
- [Environment Setup](#environment-setup)
114
-
- [Inference Before Fine-Tuning](#inference-before-fine-tuning)
115
-
- [Fine-Tuning](#fine-tuning)
116
-
- [Inference After Fine-Tuning](#inference-after-fine-tuning)
117
-
- [Web-UI](#web-ui)
118
-
- [Learn More](#learn-more)
119
-
120
-
## Environment Setup
121
-
```bash
122
-
# Set up the global pip mirror (for faster downloading)
123
-
pip config set global.index-url https://mirrors.aliyun.com/pypi/simple/
124
-
# Install ms-swift
125
-
pip install 'ms-swift[llm]' -U
126
-
127
-
# Align the environment (usually not necessary to run. If you get an error, you can run the following code, which has been tested with the latest environment in the repository)
128
-
pip install -r requirements/framework.txt -U
129
-
pip install -r requirements/llm.txt -U
130
-
```
131
-
132
-
## Inference Before Fine-Tuning
133
-
134
-
Using Python:
135
-
```python
136
-
import os
137
-
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
138
-
139
-
from swift.llm import ModelType, InferArguments, infer_main
The cuisine of Zhejiang is very rich, for example, Hangzhou's West Lake Fish in Vinegar Gravy, Dongpo Pork, Longjing Shrimp, and Sister Song's Fish Soup are famous Zhejiang dishes. Additionally, there are many snacks in Zhejiang, such as fried dough sticks, shaomai, rice balls, zongzi, etc.
If you can't sleep at night, try the following methods:
161
-
162
-
1. Relax: Try activities that help relax the body and mind, such as listening to music, practicing yoga, meditating, etc.
163
-
164
-
2. Regular routines: Try to maintain a regular daily routine, and avoid staying up late.
165
-
166
-
3. Avoid stimulants: Avoid spicy, greasy, caffeinated foods that may stimulate the nervous system and cause insomnia.
167
-
168
-
4. Exercise: Moderate exercise can help relax the body and improve sleep.
169
-
170
-
5. Drink milk before bed: Milk contains tryptophan which helps produce melatonin and can aid in sleep.
171
-
"""
172
-
```
173
-
If you want to perform single-sample inference, you can refer to [LLM Inference Documentation](LLM-inference#qwen-7b-chat)
174
-
175
-
Using CLI:
176
-
```bash
177
-
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen1half-4b-chat
178
-
```
179
-
180
71
## Fine-Tuning
181
72
Note: Self-cognition training involves knowledge editing, so it is recommended to add `lora_target_modules` to **MLP**. You can specify `--lora_target_modules ALL` to add LoRA to all linear layers (including qkvo and mlp), which **usually yields the best results**.
0 commit comments