Skip to content

Commit ade1c1b

Browse files
committed
update
1 parent 22803c0 commit ade1c1b

File tree

2 files changed

+12
-34
lines changed

2 files changed

+12
-34
lines changed

Diff for: litgpt/data/tinyllama.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,9 @@ class TinyLlama(LitDataModule):
3535

3636
def __post_init__(self):
3737
# Could be a remote path (s3://) or a local path
38-
self.slimpajama_train = str(self.data_path).rstrip("/") + "/slimpajama/train"
39-
self.slimpajama_val = str(self.data_path).rstrip("/") + "/slimpajama/val"
40-
self.starcoder_train = str(self.data_path).rstrip("/") + "/starcoder"
38+
self.slimpajama_train = os.path.join(str(self.data_path), "slimpajama", "train")
39+
self.slimpajama_val = os.path.join(str(self.data_path), "slimpajama", "val")
40+
self.starcoder_train = os.path.join(str(self.data_path), "starcoder")
4141

4242
def connect(
4343
self,
@@ -60,17 +60,17 @@ def prepare_data(self) -> None:
6060
# )
6161

6262
prepare_slimpajama(
63-
input_dir=os.path.join(self.data_path, "SlimPajama-627B/train"),
63+
input_dir=os.path.join(self.data_path, "slimpajama-raw/train"),
6464
output_dir=self.slimpajama_train,
6565
tokenizer=self.tokenizer,
6666
)
6767
prepare_slimpajama(
68-
input_dir=os.path.join(self.data_path, "SlimPajama-627B/validation"),
68+
input_dir=os.path.join(self.data_path, "slimpajama-raw/validation"),
6969
output_dir=self.slimpajama_val,
7070
tokenizer=self.tokenizer,
7171
)
7272
prepare_starcoder(
73-
input_dir=os.path.join(self.data_path, "starcoderdata"),
73+
input_dir=os.path.join(self.data_path, "starcoderdata-raw"),
7474
output_dir=self.starcoder_train,
7575
tokenizer=self.tokenizer,
7676
)

Diff for: tutorials/pretrain_tinyllama.md

+6-28
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ In order to start pretraining litgpt on it, you need to read, tokenize, and writ
4949
First, install additional dependencies for preprocessing:
5050

5151
```bash
52-
pip install '.[all]'
52+
pip install litgpt '.[all]'
5353
```
5454

5555
You will need to have the tokenizer config available:
@@ -61,38 +61,16 @@ litgpt download \
6161
--tokenizer_only true
6262
```
6363

64-
Then, run the preprocessing script for each dataset and split.
65-
You will require **1.1 TB** of disk space for Starcoder and **2.5** TB of space for the SlimPajama dataset.
66-
67-
**Starcoder:**
68-
69-
```bash
70-
python litgpt/data/prepare_starcoder.py \
71-
--input_dir data/starcoderdata-raw \
72-
--output_dir data/starcoder \
73-
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
74-
```
75-
76-
**SlimPajama:**
64+
Then, run the preprocessing command by pointing to the directory where the data was downloaded.
65+
You will require and additional **1.1 TB** of disk space for Starcoder and **2.5** TB of space for the SlimPajama dataset.
7766

7867
```bash
79-
python litgpt/data/prepare_slimpajama.py \
80-
--input_dir data/slimpajama-raw/validation \
81-
--output_dir data/slimpajama/val \
82-
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
83-
84-
python litgpt/data/prepare_slimpajama.py \
85-
--input_dir data/slimpajama-raw/test \
86-
--output_dir data/slimpajama/test \
87-
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
88-
89-
python litgpt/data/prepare_slimpajama.py \
90-
--input_dir data/slimpajama-raw/train \
91-
--output_dir data/slimpajama/train \
68+
litgpt prepare \
69+
--data TinyLlama \
70+
--data.data_path data \
9271
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
9372
```
9473

95-
If you want to run on a small slice of the datasets first, pass the flag `--fast_dev_run=true` to the commands above.
9674
In the above we are assuming that you will be using the same tokenizer as used in LlaMA/TinyLlama, but any trained [SentencePiece](https://github.com/google/sentencepiece) tokenizer with a 32000 vocabulary size will do here.
9775

9876
## Pretraining

0 commit comments

Comments
 (0)