Skip to content

Commit f9099f3

Browse files
committed
update
1 parent bf69b51 commit f9099f3

File tree

2 files changed

+12
-34
lines changed

2 files changed

+12
-34
lines changed

litgpt/data/tinyllama.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,9 @@ class TinyLlama(DataModule):
3535

3636
def __post_init__(self):
3737
# Could be a remote path (s3://) or a local path
38-
self.slimpajama_train = str(self.data_path).rstrip("/") + "/slimpajama/train"
39-
self.slimpajama_val = str(self.data_path).rstrip("/") + "/slimpajama/val"
40-
self.starcoder_train = str(self.data_path).rstrip("/") + "/starcoder"
38+
self.slimpajama_train = os.path.join(str(self.data_path), "slimpajama", "train")
39+
self.slimpajama_val = os.path.join(str(self.data_path), "slimpajama", "val")
40+
self.starcoder_train = os.path.join(str(self.data_path), "starcoder")
4141

4242
def connect(
4343
self, tokenizer: Optional[Tokenizer] = None, batch_size: int = 1, max_seq_length: Optional[int] = None
@@ -57,17 +57,17 @@ def prepare_data(self) -> None:
5757
# )
5858

5959
prepare_slimpajama(
60-
input_dir=os.path.join(self.data_path, "SlimPajama-627B/train"),
60+
input_dir=os.path.join(self.data_path, "slimpajama-raw/train"),
6161
output_dir=self.slimpajama_train,
6262
tokenizer=self.tokenizer,
6363
)
6464
prepare_slimpajama(
65-
input_dir=os.path.join(self.data_path, "SlimPajama-627B/validation"),
65+
input_dir=os.path.join(self.data_path, "slimpajama-raw/validation"),
6666
output_dir=self.slimpajama_val,
6767
tokenizer=self.tokenizer,
6868
)
6969
prepare_starcoder(
70-
input_dir=os.path.join(self.data_path, "starcoderdata"),
70+
input_dir=os.path.join(self.data_path, "starcoderdata-raw"),
7171
output_dir=self.starcoder_train,
7272
tokenizer=self.tokenizer,
7373
)

tutorials/pretrain_tinyllama.md

Lines changed: 6 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ In order to start pretraining litgpt on it, you need to read, tokenize, and writ
5252
First, install additional dependencies for preprocessing:
5353

5454
```bash
55-
pip install '.[all]'
55+
pip install litgpt '.[all]'
5656
```
5757

5858
You will need to have the tokenizer config available:
@@ -64,38 +64,16 @@ litgpt download \
6464
--tokenizer_only true
6565
```
6666

67-
Then, run the preprocessing script for each dataset and split.
68-
You will require **1.1 TB** of disk space for Starcoder and **2.5** TB of space for the SlimPajama dataset.
69-
70-
**Starcoder:**
71-
72-
```bash
73-
python litgpt/data/prepare_starcoder.py \
74-
--input_dir data/starcoderdata-raw \
75-
--output_dir data/starcoder \
76-
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
77-
```
78-
79-
**SlimPajama:**
67+
Then, run the preprocessing command by pointing to the directory where the data was downloaded.
68+
You will require and additional **1.1 TB** of disk space for Starcoder and **2.5** TB of space for the SlimPajama dataset.
8069

8170
```bash
82-
python litgpt/data/prepare_slimpajama.py \
83-
--input_dir data/slimpajama-raw/validation \
84-
--output_dir data/slimpajama/val \
85-
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
86-
87-
python litgpt/data/prepare_slimpajama.py \
88-
--input_dir data/slimpajama-raw/test \
89-
--output_dir data/slimpajama/test \
90-
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
91-
92-
python litgpt/data/prepare_slimpajama.py \
93-
--input_dir data/slimpajama-raw/train \
94-
--output_dir data/slimpajama/train \
71+
litgpt prepare \
72+
--data TinyLlama \
73+
--data.data_path data \
9574
--tokenizer_path checkpoints/meta-llama/Llama-2-7b-hf
9675
```
9776

98-
If you want to run on a small slice of the datasets first, pass the flag `--fast_dev_run=true` to the commands above.
9977
In the above we are assuming that you will be using the same tokenizer as used in LlaMA/TinyLlama, but any trained [SentencePiece](https://github.com/google/sentencepiece) tokenizer with a 32000 vocabulary size will do here.
10078

10179
 

0 commit comments

Comments
 (0)