-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathBaseLoaderOutput.txt
6994 lines (6994 loc) · 263 KB
/
BaseLoaderOutput.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
A Comprehensive Overview of Large Language Models
Humza Naveeda, Asad Ullah Khana,∗, Shi Qiub,∗, Muhammad Saqibc,d,∗, Saeed Anware,f, Muhammad Usmane,f, Naveed Akhtarg,i,
Nick Barnesh, Ajmal Miani
aUniversity of Engineering and Technology (UET), Lahore, Pakistan
bThe Chinese University of Hong Kong (CUHK), HKSAR, China
cUniversity of Technology Sydney (UTS), Sydney, Australia
dCommonwealth Scientific and Industrial Research Organisation (CSIRO), Sydney, Australia
eKing Fahd University of Petroleum and Minerals (KFUPM), Dhahran, Saudi Arabia
fSDAIA-KFUPM Joint Research Center for Artificial Intelligence (JRCAI), Dhahran, Saudi Arabia
gThe University of Melbourne (UoM), Melbourne, Australia
hAustralian National University (ANU), Canberra, Australia
iThe University of Western Australia (UWA), Perth, Australia
Abstract
Large Language Models (LLMs) have recently demonstrated remarkable capabilities in natural language processing tasks and
beyond. This success of LLMs has led to a large influx of research contributions in this direction. These works encompass diverse
topics such as architectural innovations, better training strategies, context length improvements, fine-tuning, multi-modal LLMs,
robotics, datasets, benchmarking, efficiency, and more. With the rapid development of techniques and regular breakthroughs in
LLM research, it has become considerably challenging to perceive the bigger picture of the advances in this direction. Considering
the rapidly emerging plethora of literature on LLMs, it is imperative that the research community is able to benefit from a concise
yet comprehensive overview of the recent developments in this field. This article provides an overview of the existing literature
on a broad range of LLM-related concepts. Our self-contained comprehensive overview of LLMs discusses relevant background
concepts along with covering the advanced topics at the frontier of research in LLMs. This review article is intended to not only
provide a systematic survey but also a quick comprehensive reference for the researchers and practitioners to draw insights from
extensive informative summaries of the existing works to advance the LLM research.
Keywords:
Large Language Models, LLMs, chatGPT, Augmented LLMs, Multimodal LLMs, LLM training, LLM Benchmarking
1. Introduction
Language plays a fundamental role in facilitating commu-
nication and self-expression for humans, and their interaction
with machines. The need for generalized models stems from
the growing demand for machines to handle complex language
tasks, including translation, summarization, information re-
trieval, conversational interactions, etc. Recently, significant
breakthroughs have been witnessed in language models, pri-
marily attributed to transformers [1], increased computational
capabilities, and the availability of large-scale training data.
These developments have brought about a revolutionary trans-
formation by enabling the creation of LLMs that can approxi-
mate human-level performance on various tasks [2, 3]. Large
∗Equal contribution
Email addresses: [email protected] (Humza Naveed),
[email protected] (Asad Ullah Khan), [email protected] (Shi
Qiu), [email protected] (Muhammad Saqib),
[email protected] (Saeed Anwar),
[email protected] (Muhammad Usman),
[email protected] (Naveed Akhtar),
[email protected] (Nick Barnes), [email protected]
(Ajmal Mian)
Figure 1: The trend of papers released over years containing keywords "Large
Language Model", "Large Language Model + Fine-Tuning", and "Large Lan-
guage Model + Alignment".
Preprint submitted to Elsevier
April 11, 2024
arXiv:2307.06435v9 [cs.CL] 9 Apr 2024
2019
T5 (Oct)
GPT-3 (May)
WebGPT (Dec)
OPT-IML
TK-Instruct (May)
mT0 (Dec)
Wizard-LM
Vicuna
Alpaca (Mar)
HuaTuo (Apr)
Koala (May)
Wizard-Coder (Jun)
Goat
PanGu-α (Apr)
CPM-2 (Jun)
GPT-NeoX-20B (Apr)
CodeGen (Mar)
Galactica (Nov)
GLM (Oct)
OPT
UL2 (May)
LLaMA (Feb)
LLaMA 2 (Jul)
MPT (Jun)
CodeT5+
Code Llama (Aug)
StarCoder
Xuan Yuan 2.0 (May)
2020
2021
2022
2023
2024
mT5 (Oct)
HyperCLOVA (Sep)
ERNIE 3.0
Codex (Jul)
Jurassic-1 (Aug)
Yuan 1.0 (Oct)
Gopher (Dec)
ERNIE 3.0 Titan
GLaM
LaMDA
T0 (Oct)
ChatGPT (Nov)
Sparrow (Sep)
FLAN-U-PaLM (Oct)
Bard (Oct)
MT-NLG (Jan)
AlphaCode (Feb)
Chinchilla (Mar)
PaLM (Apr)
U-PALM (Oct)
BLOOM (Nov)
AlexaTM (Aug)
PaLM2 (May)
GPT-4
PanGu-Σ (Mar)
BloombergGPT
Claude
Gemini (Dec)
Figure 2: Chronological display of LLM releases: blue cards represent ‘pre-trained’ models, while orange cards correspond to ‘instruction-tuned’ models. Models
on the upper half signify open-source availability, whereas those on the bottom half are closed-source. The chart illustrates the increasing trend towards instruction-
tuned models and open-source models, highlighting the evolving landscape and trends in natural language processing research.
Language Models (LLMs) have emerged as cutting-edge arti-
ficial intelligence systems that can process and generate text
with coherent communication [4], and generalize to multiple
tasks [5, 6].
The historical progress in natural language processing (NLP)
evolved from statistical to neural language modeling and then
from pre-trained language models (PLMs) to LLMs.
While
conventional language modeling (LM) trains task-specific mod-
els in supervised settings, PLMs are trained in a self-supervised
setting on a large corpus of text [7, 8, 9] with the aim of learning
a generic representation that is shareable among various NLP
tasks. After fine-tuning for downstream tasks, PLMs surpass
the performance gains of traditional language modeling (LM).
The larger PLMs bring more performance gains, which has led
to the transitioning of PLMs to LLMs by significantly increas-
ing model parameters (tens to hundreds of billions) [10] and
training dataset (many GBs and TBs) [10, 11]. Following this
development, numerous LLMs have been proposed in the lit-
erature [10, 11, 12, 6, 13, 14, 15]. An increasing trend in the
number of released LLMs and names of a few significant LLMs
proposed over the years are shown in Fig 1 and Fig 2, respec-
tively.
The early work on LLMs, such as T5 [10] and mT5 [11] em-
ployed transfer learning until GPT-3 [6] showed LLMs are
zero-shot transferable to downstream tasks without fine-tuning.
LLMs accurately respond to task queries when prompted with
task descriptions and examples. However, pre-trained LLMs
fail to follow user intent and perform worse in zero-shot set-
tings than in few-shot.
Fine-tuning them with task instruc-
tions data [16, 17, 18, 19] and aligning with human prefer-
ences [20, 21] enhances generalization to unseen tasks, im-
proving zero-shot performance significantly and reducing mis-
aligned behavior.
In addition to better generalization and domain adaptation,
LLMs appear to have emergent abilities, such as reasoning,
planning, decision-making, in-context learning, answering in
zero-shot settings, etc.
These abilities are known to be ac-
quired by them due to their gigantic scale even when the pre-
trained LLMs are not trained specifically to possess these at-
tributes [22, 23, 24]. Such abilities have led LLMs to be widely
adopted in diverse settings including, multi-modal, robotics,
tool manipulation, question answering, autonomous agents, etc.
Various improvements have also been suggested in these areas
either by task-specific training [25, 26, 27, 28, 29, 30, 31] or
better prompting [32].
The LLMs abilities to solve diverse tasks with human-level
performance come at a cost of slow training and inference,
extensive hardware requirements, and higher running costs.
Such requirements have limited their adoption and opened up
opportunities to devise better architectures [15, 33, 34, 35]
and training strategies [36, 37, 21, 38, 39, 40, 41].
Param-
eter efficient tuning [38, 41, 40], pruning [42, 43], quantiza-
tion [44, 45], knowledge distillation, and context length inter-
polation [46, 47, 48, 49] among others are some of the methods
widely studied for efficient LLM utilization.
Due to the success of LLMs on a wide variety of tasks, the
research literature has recently experienced a large influx of
LLM-related contributions.
Researchers have organized the
LLMs literature in surveys [50, 51, 52, 53], and topic-specific
surveys in [54, 55, 56, 57, 58]. In contrast to these surveys, our
contribution focuses on providing a comprehensive yet concise
overview of the general direction of LLM research. This arti-
cle summarizes architectural and training details of pre-trained
LLMs and delves deeper into the details of concepts like fine-
tuning, multi-modal LLMs, augmented LLMs, datasets, eval-
uation, applications, challenges, and others to provide a self-
contained comprehensive overview. Our key contributions are
summarized as follows.
• We present a survey on the developments in LLM research
providing a concise comprehensive overview of the direc-
tion.
• We present extensive summaries of pre-trained models that
include fine-grained details of architecture and training de-
tails.
• We summarize major findings of the popular contributions
and provide a detailed discussion on the key design and
development aspects of LLMs to help practitioners effec-
tively leverage this technology.
• In this self-contained article, we cover a range of con-
cepts to present the general direction of LLMs compre-
2
Figure 3: A broader overview of LLMs, dividing LLMs into seven branches: 1. Pre-Training 2. Fine-Tuning 3. Efficient 4. Inference 5. Evaluation 6. Applications
7. Challenges
hensively, including background, pre-training, fine-tuning,
multi-modal LLMs, augmented LLMs, LLMs-powered
agents, datasets, evaluation, etc.
We loosely follow the existing terminology to ensure a stan-
dardized outlook of this research direction. For instance, fol-
lowing [50], our survey discusses pre-trained LLMs with 10B
parameters or more. We refer the readers interested in smaller
pre-trained models to [51, 52, 53].
The organization of this paper is as follows. Section 2 discusses
the background of LLMs. Section 3 focuses on LLMs overview,
architectures, training pipelines and strategies, fine-tuning, and
utilization in different domains. Section 4 highlights the config-
uration and parameters that play a crucial role in the function-
ing of these models. Summary and discussions are presented
in section 3.8. The LLM training and evaluation, datasets, and
benchmarks are discussed in section 5, followed by challenges
and future directions, and conclusion in sections 7 and 8, re-
spectively.
3
2. Background
We provide the relevant background to understand the fun-
damentals related to LLMs in this section. We briefly discuss
necessary components in LLMs and refer the readers interested
in details to the original works.
2.1. Tokenization
Tokenization [59] is an essential pre-processing step in
LLM training that parses the text into non-decomposing units
called tokens. Tokens can be characters, subwords [60], sym-
bols [61], or words, depending on the tokenization process.
Some of the commonly used tokenization schemes in LLMs
include wordpiece [62], byte pair encoding (BPE) [61], and un-
igramLM [60]. Readers are encouraged to refer to [63] for a
detailed survey.
2.2. Encoding Positions
The transformer processes input sequences in parallel and
independently of each other.
Moreover, the attention mod-
ule in the transformer does not capture positional information.
As a result, positional encodings were introduced in trans-
former [64], where a positional embedding vector is added to
the token embedding. Variants of positional embedding include
absolute, relative, or learned positional encodings. Within rel-
ative encoding, Alibi and RoPE are two widely used positional
embeddings in LLMs.
Alibi [65]: It subtracts a scalar bias from the attention score
that increases with the distance between token positions. This
favors using recent tokens for attention.
RoPE [66]: It rotates query and key representations at an an-
gle proportional to the token absolute position in the input
sequence, resulting in a relative positional encoding scheme
which decays with the distance between the tokens.
2.3. Attention in LLMs
Attention assigns weights to input tokens based on impor-
tance so that the model gives more emphasis to relevant tokens.
Attention in transformers [64] calculates query, key, and value
mappings for input sequences, where the attention score is
obtained by multiplying the query and key, and later used to
weight values. We discuss different attention strategies used in
LLMs below.
Self-Attention [64]: Calculates attention using queries, keys,
and values from the same block (encoder or decoder).
Cross Attention: It is used in encoder-decoder architectures,
where encoder outputs are the queries, and key-value pairs
come from the decoder.
Sparse Attention [67]: Self-attention has O(n2) time complex-
ity which becomes infeasible for large sequences. To speed
up the computation, sparse attention [67] iteratively calculates
attention in sliding windows for speed gains.
Flash Attention [68]: Memory access is the major bottleneck
in calculating attention using GPUs.
To speed up, flash
attention employs input tiling to minimize the memory reads
and writes between the GPU high bandwidth memory (HBM)
and the on-chip SRAM.
2.4. Activation Functions
The activation functions serve a crucial role in the curve-
fitting abilities of neural networks [69]. We discuss activation
functions used in LLMs in this section.
ReLU [70]: The Rectified linear unit (ReLU) is defined as:
ReLU(x) = max(0, x)
(1)
GeLU [71]: The Gaussian Error Linear Unit (GeLU) is the
combination of ReLU, dropout [72] and zoneout [73].
GLU variants [74]: The Gated Linear Unit [75] is a neural
network layer that is an element-wise product (⊗) of a linear
transformation and a sigmoid transformed (σ) linear projection
of the input given as:
GLU(x, W, V, b, c) = (xW + b) ⊗σ(xV + c),
(2)
where X is the input of layer and l, W, b, V and c are learned
parameters. Other GLU variants [74] used in LLMs are:
ReGLU(x, W, V, b, c) = max(0, xW + b)⊗,
GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗(xV + c),
S wiGLU(x, W, V, b, c, β) = S wishβ(xW + b) ⊗(xV + c).
2.5. Layer Normalization
Layer normalization leads to faster convergence and is an in-
tegrated component of transformers [64]. In addition to Layer-
Norm [76] and RMSNorm [77], LLMs use pre-layer normal-
ization [78], applying it before multi-head attention (MHA).
Pre-norm is shown to provide training stability in LLMs. An-
other normalization variant, DeepNorm [79] fixes the issue with
larger gradients in pre-norm.
2.6. Distributed LLM Training
This section describes distributed LLM training approaches
briefly. More details are available in [13, 37, 80, 81].
Data Parallelism: Data parallelism replicates the model on
multiple devices where data in a batch gets divided across de-
vices. At the end of each training iteration weights are synchro-
nized across all devices.
Tensor Parallelism: Tensor parallelism shards a tensor compu-
tation across devices. It is also known as horizontal parallelism
or intra-layer model parallelism.
Pipeline Parallelism: Pipeline parallelism shards model layers
across different devices. This is also known as vertical paral-
lelism.
Model Parallelism: A combination of tensor and pipeline par-
allelism is known as model parallelism.
3D Parallelism: A combination of data, tensor, and model par-
allelism is known as 3D parallelism.
Optimizer Parallelism: Optimizer parallelism also known as
zero redundancy optimizer [37] implements optimizer state
partitioning, gradient partitioning, and parameter partitioning
across devices to reduce memory consumption while keeping
the communication costs as low as possible.
4
2.7. Libraries
Some commonly used libraries for LLMs training are:
Transformers [82]: The library provides access to various pre-
trained transformer models with APIs to train, fine-tune, infer,
and develop custom models.
DeepSpeed [36]: A library for scalable distributed training and
inference of deep learning models.
Megatron-LM [80]: It provides GPU-optimized techniques for
large-scale training of LLMs.
JAX [83]: A Python library for high-performance numerical
computing and scaleable machine learning. It can differenti-
ate native Python and NumPy functions and execute them on
GPUs.
Colossal-AI [84]: A collection of components to write dis-
tributed deep learning models.
BMTrain [81]: A library to write efficient stand-alone LLMs
training code.
FastMoE [85]:
Provides API to build mixture-of-experts
(MoE) model in PyTorch.
MindSpore [86]: A deep learning training and inference frame-
work extendable to mobile, edge, and cloud computing.
PyTorch [87]: A framework developed by Facebook AI Re-
search lab (FAIR) to build deep learning models. The main
features of PyTorch include a dynamic computation graph and
a pythonic coding style.
Tensorflow [88]:
A deep learning framework written by
Google. The key features of TensorFlow are graph-based com-
putation, eager execution, scalability, etc.
MXNet [89]: Apache MXNet is a deep learning framework
with support to write programs in multiple languages, includ-
ing, Python, C++, Scala, R, etc. It also provides support for
dynamic and static computation graphs.
2.8. Data PreProcessing
This section briefly summarizes data preprocessing tech-
niques used in LLMs training.
Quality Filtering: For better results, training data quality is
essential. Some approaches to filtering data are: 1) classifier-
based and 2) heuristics-based.
Classifier-based approaches
train a classifier on high-quality data and predict the quality of
text for filtering, whereas heuristics-based employ some rules
for filtering like language, metrics, statistics, and keywords.
Data Deduplication: Duplicated data can affect model per-
formance and increase data memorization; therefore, to train
LLMs, data deduplication is one of the preprocessing steps.
This can be performed at multiple levels, like sentences,
documents, and datasets.
Privacy Reduction: Most of the training data for LLMs is
collected through web sources.
This data contains private
information; therefore, many LLMs employ heuristics-based
methods to filter information such as names, addresses, and
phone numbers to avoid learning personal information.
2.9. Architectures
Here we discuss the variants of the transformer architectures
used in LLMs. The difference arises due to the application of
Figure 4: An example of attention patterns in language models, image is taken
from [93].
Figure 5: An example of language model training objectives, image from [93].
the attention and the connection of transformer blocks. An il-
lustration of attention patterns of these architectures is shown
in Figure 4.
Encoder Decoder: This architecture processes inputs through
the encoder and passes the intermediate representation to the
decoder to generate the output.
Here, the encoder sees the
complete sequence utilizing self-attention whereas the decoder
processes the sequence one after the other with implementing
cross-attention.
Causal Decoder: A type of architecture that does not have an
encoder and processes and generates output using a decoder,
where the predicted token depends only on the previous time
steps.
Prefix Decoder: It is also known as a non-causal decoder,
where the attention calculation is not strictly dependent on the
past information and the attention is bidirectional. An example
of a non-causal attention mask is shown in Figure 4.
Mixture-of-Experts: It is a variant of transformer architecture
with parallel independent experts and a router to route tokens
to experts. These experts are feed-forward layers after the at-
tention block [90]. Mixture-of-Experts (MoE) is an efficient
sparse architecture that offers comparable performance to dense
models and allows increasing the model size without increas-
ing the computational cost by activating only a few experts at a
time [91, 92].
2.10. Pre-Training Objectives
This section describes LLMs pre-training objectives.
For
more details see the paper [93].
Full Language Modeling: An autoregressive language model-
ing objective where the model is asked to predict future tokens
given the previous tokens, an example is shown in Figure 5.
Prefix Language Modeling: A non-causal training objective,
where a prefix is chosen randomly and only remaining target
tokens are used to calculate the loss. An example is shown in
Figure 5.
5
Figure 6: A basic flow diagram depicting various stages of LLMs from pre-training to prompting/utilization. Prompting LLMs to generate responses is possible at
different training stages like pre-training, instruction-tuning, or alignment tuning. “RL” stands for reinforcement learning, “RM” represents reward-modeling, and
“RLHF” represents reinforcement learning with human feedback.
Masked Language Modeling: In this training objective, tokens
or spans (a sequence of tokens) are masked randomly and the
model is asked to predict masked tokens given the past and
future context. An example is shown in Figure 5.
Unified Language Modeling: Unified language modeling [94]
is a combination of causal, non-causal, and masked language
training objectives. Here in masked language modeling, the
attention is not bidirectional but unidirectional, attending either
left-to-right or right-to-left context.
2.11. LLMs Scaling Laws
Scaling laws study the optimal combination of model param-
eters, dataset size, and computational resources that predict the
improvement in the model performance. It has been shown
that the loss scales according to the power-law with model size,
dataset size, and compute resources [95]. This study suggests
larger models are more important than big data for better perfor-
mance. Another variant of scaling law [96] suggests the model
size and the number of training tokens should be scaled equally.
2.12. LLMs Adaptation Stages
This section discusses the fundamentals of LLMs adaptation
stages, from pre-training to fine-tuning for downstream tasks
and utilization. An example of different training stages and in-
ference in LLMs is shown in Figure 6. In this paper, we refer
to alignment-tuning as aligning with human preferences, while
occasionally the literature uses the term alignment for different
purposes.
2.12.1. Pre-Training
In the very first stage, the model is trained in a self-
supervised manner on a large corpus to predict the next to-
kens given the input. The design choices of LLMs vary from
encoder-decoder to decoder-only architectures with different
building blocks and loss functions in sections 2.5, 2.4, 2.10.
2.12.2. Fine-Tuning
There are different styles to fine-tune an LLM. This section
briefly discusses fine-tuning approaches.
Transfer Learning: The pre-trained LLMs perform well for
various tasks [6, 15]. However, to improve the performance for
6
a downstream task, pre-trained models are fine-tuned with the
task-specific data [10, 11], known as transfer learning.
Instruction-tuning: To enable a model to respond to user
queries effectively, the pre-trained model is fine-tuned on in-
struction formatted data i.e., instruction and an input-output
pair. Instructions generally comprise multi-task data in plain
natural language, guiding the model to respond according to the
prompt and the input. This type of fine-tuning improves zero-
shot generalization and downstream task performance. Details
on formatting instruction data and its various styles are avail-
able in [16, 50, 97].
Alignment-tuning: LLMs are prone to generating false, biased,
and harmful text. To make them helpful, honest, and harmless,
models are aligned using human feedback. Alignment involves
asking LLMs to generate unexpected responses and then updat-
ing their parameters to avoid such responses [20, 21, 98].
It ensures LLMs operate according to human intentions and
values. A model is defined to be an “aligned” model if the
model fulfills three criteria of helpful, honest, and harmless or
“HHH” [99].
Researchers employ reinforcement learning with human feed-
back (RLHF) [100] for model alignment. In RLHF, a fine-tuned
model on demonstrations is further trained with reward model-
ing (RM) and reinforcement learning (RL), shown in Figure 6.
Below we briefly discuss RM and RL pipelines in RLHF.
Reward modeling: trains a model to rank generated responses
according to human preferences using a classification objec-
tive. To train the classifier humans annotate LLMs generated
responses based on the HHH criteria.
Reinforcement learning: in combination with the reward model
is used for alignment in the next stage. The previously trained
reward model ranks LLM-generated responses into preferred
vs. non-preferred, which is used to align the model with proxi-
mal policy optimization (PPO). This process repeats iteratively
until convergence.
2.12.3. Prompting/Utilization
Prompting is a method to query trained LLMs for generating
responses, as illustrated in Figure 6. LLMs can be prompted in
various prompt setups, where they can be adapted to the instruc-
tions without fine-tuning and in other cases with fine-tuning on
data containing different prompt styles [16, 101, 102]. A good
guide on prompt engineering is available at [32]. Below, we
will discuss various widely used prompt setups.
Zero-Shot Prompting: LLMs are zero-shot learners and ca-
pable of answering queries never seen before. This style of
prompting requires LLMs to answer user questions without see-
ing any examples in the prompt.
In-context Learning: Also known as few-shot learning, here,
multiple input-output demonstration pairs are shown to the
model to generate the desired response. This adaptation style
is also called few-shot learning. A discussion on formatting in-
context learning (ICL) templates is available in [54, 50, 18, 16].
Reasoning in LLMs: LLMs are zero-shot reasoners and can
be provoked to generate answers to logical problems, task
planning, critical thinking, etc. with reasoning.
Generating
reasons is possible only by using different prompting styles,
whereas to improve LLMs further on reasoning tasks many
methods [16, 97] train them on reasoning datasets. We discuss
various prompting techniques for reasoning below.
Chain-of-Thought (CoT): A special case of prompting where
demonstrations contain reasoning information aggregated with
inputs and outputs so that the model generates outcomes with
step-by-step reasoning. More details on CoT prompts are avail-
able in [55, 103, 101].
Self-Consistency:
Improves CoT performance by generat-
ing multiple responses and selecting the most frequent an-
swer [104].
Tree-of-Thought (ToT): Explores multiple reasoning paths
with possibilities to look ahead and backtrack for problem-
solving [105].
Single-Turn Instructions: In this prompting setup, LLMs are
queried only once with all the relevant information in the
prompt. LLMs generate responses by understanding the con-
text either in a zero-shot or few-shot setting.
Multi-Turn Instructions: Solving a complex task requires mul-
tiple interactions with LLMs, where feedback and responses
from the other tools are given as input to the LLM for the next
rounds. This style of using LLMs in the loop is common in
autonomous agents.
3. Large Language Models
This section reviews LLMs, briefly describing their architec-
tures, training objectives, pipelines, datasets, and fine-tuning
details.
3.1. Pre-Trained LLMs
Here, we provide summaries of various well-known pre-
trained LLMs with significant discoveries, changing the course
of research and development in NLP. These LLMs have consid-
erably improved the performance in NLU and NLG domains,
and are widely fine-tuned for downstream tasks. Moreover, We
also identify key findings and insights of pre-trained LLMs in
Table 1 and 2 that improve their performance.
3.1.1. General Purpose
T5 [10]: An encoder-decoder model employing a unified text-
to-text training for all NLP problems is shown in Figure 7. T5
places layer normalization outside the residual path in a conven-
tional transformer model [64]. It uses masked language mod-
eling as a pre-training objective where spans (consecutive to-
kens) are replaced with a single mask instead of separate masks
for each token. This type of masking speeds up the training as
it produces shorter sequences. After pre-training, the model is
fine-tuned using adapter layers [106] for downstream tasks.
GPT-3 [6]: The GPT-3 architecture is the same as the GPT-
2 [5] but with dense and sparse attention in transformer layers
similar to the Sparse Transformer [67]. It shows that large mod-
els can train on larger batch sizes with a lower learning rate to
decide the batch size during training, GPT-3 uses the gradient
noise scale as in [107]. Overall, GPT-3 increases model param-
eters to 175B showing that the performance of large language
7
Figure 7: Unified text-to-text training example, source image from [10].
Figure 8: The image is the article of [108], showing an example of PanGu-α
architecture.
models improves with the scale and is competitive with the fine-
tuned models.
mT5 [11]: A multilingual T5 model [10] trained on the mC4
dataset with 101 languages. The dataset is extracted from the
public common crawl scrape. The model uses a larger vocab-
ulary size of 250,000 to cover multiple languages. To avoid
over-fitting or under-fitting for a language, mT5 employs a data
sampling procedure to select samples from all languages. The
paper suggests using a small amount of pre-training datasets,
including all languages when fine-tuning for a task using En-
glish language data. This allows the model to generate correct
non-English outputs.
PanGu-α [108]: An autoregressive model that has a query
layer at the end of standard transformer layers, example shown
in Figure 8, to predict the next token. Its structure is similar to
the transformer layer but with an additional embedding for the
next position in the attention mechanism, given in Eq. 3.
a = pnWq
hWk
hTHT
L
(3)
CPM-2 [12]: Cost-efficient Pre-trained language Models
(CPM-2) pre-trains bilingual (English and Chinese) 11B and
198B mixture-of-experts (MoE) models on the WuDaoCor-
pus [109] dataset. The tokenization process removes “_” white
space tokens in the sentencepiece tokenizer. The models are
trained with knowledge inheritance, starting with only the Chi-
nese language in the first stage and then adding English and
Chinese data. This trained model gets duplicated multiple times
to initialize the 198B MoE model. Moreover, to use the model
for downstream tasks, CPM-2 experimented with both com-
plete fine-tuning and prompt fine-tuning as in [40] where only
prompt-related parameters are updated by inserting prompts at
various positions, front, middle, and back. CPM-2 also pro-
poses the INFMOE, a memory-efficient framework with a strat-
egy to dynamically offload parameters to the CPU for inference
at a 100B scale. It overlaps data movement with inference com-
putation for lower inference time.
ERNIE 3.0 [110]: ERNIE 3.0 takes inspiration from multi-
task learning to build a modular architecture using Transformer-
XL [111] as the backbone. The universal representation mod-
ule is shared by all the tasks, which serve as the basic block
for task-specific representation modules, which are all trained
jointly for natural language understanding, natural language
generation, and knowledge extraction. This LLM is primar-
ily focused on the Chinese language. It claims to train on the
largest Chinese text corpora for LLM training, and achieved
state-of-the-art in 54 Chinese NLP tasks.
Jurassic-1 [112]: A pair of auto-regressive language mod-
els, including a 7B-parameter J1-Large model and a 178B-
parameter J1-Jumbo model.
The training vocabulary of
Jurassic-1 comprise word pieces, complete words, and multi-
word expressions without any word boundaries, where possible
out-of-vocabulary instances are interpreted as Unicode bytes.
Compared to the GPT-3 counterparts, the Jurassic-1 models
apply a more balanced depth-to-width self-attention architec-
ture [113] and an improved tokenizer for a faster prediction
based on broader resources, achieving a comparable perfor-
mance in zero-shot learning tasks and a superior performance in
few-shot learning tasks given the ability to feed more examples
as a prompt.
HyperCLOVA [114]: A Korean language model with GPT-3
architecture.
Yuan 1.0 [115]: Trained on a Chinese corpus with 5TB of
high-quality text collected from the Internet. A Massive Data
Filtering System (MDFS) built on Spark is developed to pro-
cess the raw data via coarse and fine filtering techniques. To
speed up the training of Yuan 1.0 to save energy expenses and
carbon emissions, various factors that improve the performance
of distributed training are incorporated in architecture and train-
ing: like increasing the hidden state size improves pipeline and
tensor parallelism performance, larger micro batches improve
pipeline parallelism performance, and larger global batch size
improve data parallelism performance. In practice, the Yuan 1.0
model performs well on text classification, Winograd Schema,
natural language inference, and reading comprehension tasks.
Gopher [116]: The Gopher family of models ranges from
44M to 280B parameters in size to study the effect of scale
on the LLMs performance. The 280B model beats GPT-3 [6],
Jurrasic-1 [112], MT-NLG [117], and others on 81% of the
evaluated tasks.
ERNIE 3.0 TITAN [35]: ERNIE 3.0 Titan extends ERNIE 3.0
by training a larger model with 26x the number of parameters
of the latter. This bigger model outperformed other state-of-the-
art models in 68 NLP tasks. LLMs produce text with incorrect
facts. In order to have control of the generated text with fac-
tual consistency, ERNIE 3.0 Titan adds another task, Credible
and Controllable Generations, to its multi-task learning setup.
8
It introduces additional self-supervised adversarial and control-
lable language modeling losses to the pre-training step, which
enables ERNIE 3.0 Titan to beat other LLMs in their manually
selected Factual QA task set evaluations.
GPT-NeoX-20B [118]: An auto-regressive model that largely
follows GPT-3 with a few deviations in architecture design,
trained on the Pile dataset without any data deduplication. GPT-
NeoX has parallel attention and feed-forward layers in a trans-
former block, given in Eq. 4, that increases throughput by 15%.
It uses rotary positional embedding [66], applying it to only
25% of embedding vector dimension as in [119]. This reduces
the computation without performance degradation. As opposed
to GPT-3, which uses dense and sparse layers, GPT-NeoX-20B
uses only dense layers. The hyperparameter tuning at this scale
is difficult; therefore, the model chooses hyperparameters from
the method [6] and interpolates values between 13B and 175B
models for the 20B model. The model training is distributed
among GPUs using both tensor and pipeline parallelism.
x + Attn(LN1(x)) + FF(LN2(x))
(4)
OPT [14]: It is a clone of GPT-3, developed to open-source
a model that replicates GPT-3 performance. Training of OPT
employs dynamic loss scaling [120] and restarts from an earlier
checkpoint with a lower learning rate whenever loss divergence
is observed. Overall, the performance of OPT-175B models is
comparable to the GPT3-175B model.
BLOOM [13]: A causal decoder model trained on the ROOTS
corpus to open-source an LLM. The architecture of BLOOM is
shown in Figure 9, with differences like ALiBi positional em-
bedding, an additional normalization layer after the embedding
layer as suggested by the bitsandbytes1 library. These changes
stabilize training with improved downstream performance.
GLaM [91]: Generalist Language Model (GLaM) represents a
family of language models using a sparsely activated decoder-
only mixture-of-experts (MoE) structure [121, 90].
To gain
more model capacity while reducing computation, the experts
are sparsely activated where only the best two experts are used
to process each input token. The largest GLaM model, GLaM
(64B/64E), is about 7× larger than GPT-3 [6], while only part of
the parameters are activated per input token. The largest GLaM
(64B/64E) model achieves better overall results as compared
to GPT-3 while consuming only one-third of GPT-3’s training
energy.
MT-NLG [117]: A 530B causal decoder based on the GPT-
2 architecture that has roughly 3× GPT-3 model parameters.
MT-NLG is trained on filtered high-quality data collected from
various public datasets and blends various types of datasets in a
single batch, which beats GPT-3 on several evaluations.
Chinchilla [96]: A causal decoder trained on the same dataset
as the Gopher [116] but with a little different data sampling
distribution (sampled from MassiveText). The model architec-
ture is similar to the one used for Gopher, with the exception of
AdamW optimizer instead of Adam. Chinchilla identifies the
1https://github.com/TimDettmers/bitsandbytes
Figure 9: The BLOOM architecture example sourced from [13].
relationship that model size should be doubled for every dou-
bling of training tokens. Over 400 language models ranging
from 70 million to over 16 billion parameters on 5 to 500 bil-
lion tokens are trained to get the estimates for compute-optimal
training under a given budget. The authors train a 70B model
with the same compute budget as Gopher (280B) but with 4
times more data. It outperforms Gopher [116], GPT-3 [6], and
others on various downstream tasks, after fine-tuning.
AlexaTM [122]: An encoder-decoder model, where encoder
weights and decoder embeddings are initialized with a pre-
trained encoder to speed up training. The encoder stays frozen
for the initial 100k steps and is later unfrozen for end-to-end
training. The model is trained on a combination of denoising
and causal language modeling (CLM) objectives, concatenat-
ing a [CLM] token at the beginning for mode switching. Dur-
ing training, the CLM task is applied for 20% of the time, which
improves the in-context learning performance.
PaLM [15]: A causal decoder with parallel attention and
feed-forward layers similar to Eq. 4, speeding up training by
a factor of 15. Additional changes to the conventional trans-
former model include SwiGLU activation, RoPE embeddings,
multi-query attention that saves computation cost during decod-
ing, and shared input-output embeddings. During training, loss
spiking was observed, and to fix it, model training was restarted
from a 100-step earlier checkpoint by skipping 200-500 batches
around the spike. Moreover, the model was found to memo-
rize around 2.4% of the training data at the 540B model scale,
whereas this number was lower for smaller models.
PaLM-2 [123]: A smaller multi-lingual variant of PaLM,
trained for larger iterations on a better quality dataset. PaLM-
2 shows significant improvements over PaLM, while reducing
training and inference costs due to its smaller size. To lessen
toxicity and memorization, it appends special tokens with a
fraction of pre-training data, which shows a reduction in gener-
ating harmful responses.
U-PaLM [124]: This method trains PaLM for 0.1% addi-
tional compute with the UL2 (also named as UL2Restore) ob-
jective [125], using the same dataset it outperforms the baseline
significantly on various NLP tasks, including zero-shot, few-
shot, commonsense reasoning, CoT, etc. Training with UL2R
involves converting a causal decoder PaLM to a non-causal de-
coder PaLM and employing 50% sequential denoising, 25%
regular denoising, and 25% extreme denoising loss functions.
9
UL2 [125]: An encoder-decoder architecture trained using a
mixture of denoisers (MoD) objective. Denoisers include 1)
R-Denoiser: a regular span masking, 2) S-Denoiser: which cor-
rupts consecutive tokens of a large sequence and 3) X-Denoiser:
which corrupts a large number of tokens randomly. During pre-
training, UL2 includes a denoiser token from R, S, X to rep-
resent a denoising setup. It helps improve fine-tuning perfor-
mance for downstream tasks that bind the task to one of the up-
stream training modes. This MoD style of training outperforms
the T5 model on many benchmarks.
GLM-130B [33]: GLM-130B is a bilingual (English and Chi-
nese) model trained using an auto-regressive mask infilling pre-
training objective similar to the GLM [126]. This training style
makes the model bidirectional as compared to GPT-3, which is
unidirectional. As opposed to GLM, the training of GLM-130B
includes a small amount of multi-task instruction pre-training
data (5% of the total data) along with self-supervised mask in-
filling. To stabilize the training, it applies embedding layer gra-
dient shrink.
LLaMA [127, 21]: A set of decoder-only language models
varying from 7B to 70B parameters. LLaMA models series is
the most famous among the community for parameter efficiency
and instruction tuning.
LLaMA-1 [127]: Implements efficient causal attention [128]
by not storing and computing masked attention weights and
key/query scores. Another optimization is reducing the number
of activations recomputed in the backward pass, as in [129].
LLaMA-2 [21]: This work is more focused on fine-tuning a
safer and better LLaMA-2-Chat model for dialogue generation.
The pre-trained model has 40% more training data with a larger
context length and grouped-query attention.
PanGu-Σ [92]: An autoregressive model with parameters
copied from PanGu-α and extended to a trillion scale with Ran-
dom Routed Experts (RRE), the architectural diagram is shown
in Figure 10. RRE is similar to the MoE architecture, with
distinctions at the second level, where tokens are randomly
routed to experts in a domain instead of using a learnable gat-
ing method. The model has bottom layers densely activated
and shared across all domains, whereas top layers are sparsely
activated according to the domain. This training style allows
extracting task-specific models and reduces catastrophic forget-
ting effects in the case of continual learning.
3.1.2. Coding
CodeGen [130]:
CodeGen has similar architecture to
PaLM [15], i.e., parallel attention, MLP layers, and RoPE em-
beddings. The model is trained on both natural language and
programming language data sequentially (trained on the first
dataset, then the second and so on) on the following datasets
1) PILE, 2) BIGQUERY and 3) BIGPYTHON. CodeGen pro-
posed a multi-step approach to synthesizing code. The purpose
is to simplify the generation of long sequences where the previ-
ous prompt and generated code are given as input with the next
prompt to generate the next code sequence. CodeGen open-
source a Multi-Turn Programming Benchmark (MTPB) to eval-
uate multi-step program synthesis.
Codex [131]: This LLM is trained on a subset of public Python
Github repositories to generate code from docstrings. Com-
puter programming is an iterative process where the programs
are often debugged and updated before fulfilling the require-
ments. Similarly to this, Codex generates 100 versions of a
program by repetitive sampling for a given description, which
produces a working solution for 77.5% of the problems passing
unit tests. Its powerful version powers Github Copilot2.
AlphaCode [132]: A set of large language models, ranging
from 300M to 41B parameters, designed for competition-level
code generation tasks. It uses the multi-query attention [133] to
reduce memory and cache costs. Since competitive program-
ming problems highly require deep reasoning and an under-
standing of complex natural language algorithms, the Alpha-
Code models are pre-trained on filtered GitHub code in popular
languages and then fine-tuned on a new competitive program-
ming dataset named CodeContests. The CodeContests dataset
mainly contains problems, solutions, and test cases collected
from the Codeforces platform3. The pre-training employs stan-
dard language modeling objectives, while GOLD [134] with
tempering [135] serves as the training objective for the fine-
tuning on CodeContests data. To evaluate the performance of
AlphaCode, simulated programming competitions are hosted
on the Codeforces platform: overall, AlphaCode ranks at the
top 54.3% among over 5000 competitors, where its Codeforces
rating is within the top 28% of recently participated users.
CodeT5+ [34]: CodeT5+ is based on CodeT5 [136], with
shallow encoder and deep decoder, trained in multiple stages
initially unimodal data (code) and later bimodal data (text-code
pairs). Each training stage has different training objectives and
activates different model blocks encoder, decoder, or both ac-
cording to the task. The unimodal pre-training includes span
denoising and CLM objectives, whereas bimodal pre-training
objectives contain contrastive learning, matching, and CLM for
text-code pairs. CodeT5+ adds special tokens with the text to
enable task modes, for example, [CLS ] for contrastive loss,
[Match] for text-code matching, etc.
StarCoder [137]: A decoder-only model with the SantaCoder
architecture, employing Flash attention to scale up the context
length to 8k. The StarCoder trains an encoder to filter names,
emails, and other personal data from the training data. Its fine-
tuned variant outperforms PaLM, LLaMA, and LAMDA on
HumanEval and MBPP benchmarks.
3.1.3. Scientific Knowledge
Galactica [138]: A large curated corpus of human scientific
knowledge with 48 million papers, textbooks, lecture notes,
millions of compounds and proteins, scientific websites, en-
cyclopedias, and more are trained using the metaseq library3,
which is built on PyTorch and fairscale [139]. The model wraps
reasoning datasets with the < work > token to provide step-by-
step reasoning context to the model, which has been shown to
improve the performance on reasoning tasks.
2https://github.com/features/copilot
3https://codeforces.com/
10
Figure 10: This example illustrates the PanGu-P architecture, as depicted in
the image sourced from [92].
3.1.4. Dialog
LaMDA [140]: A decoder-only model pre-trained on pub-
lic dialog data, public dialog utterances, and public web doc-
uments, where more than 90% of the pre-training data is in
English. LaMDA is trained with the objective of producing re-
sponses that exhibit high levels of quality, safety, and grounded-
ness. To achieve this, discriminative and generative fine-tuning
techniques are incorporated to enhance the model’s safety and
quality aspects. As a result, the LaMDA models can be utilized
as a general language model performing various tasks.
3.1.5. Finance
BloombergGPT [141]: A non-causal decoder model trained
using both financial ("FINPILE" from the Bloomberg archive)
and general-purpose datasets. The model’s architecture is sim-
ilar to the BLOOM [13] and OPT [14]. It allocates 50B param-
eters to different blocks of the model using the approach [113].
For effective training, BloombergGPT packs documents to-
gether with < |endo ftext| > to use the maximum sequence
length, uses warmup batch size starting from 1024 to 2048, and
manually reduces the learning rate multiple times during the
training.
Xuan Yuan 2.0 [142]: A Chinese financial chat model with
BLOOM’s [13] architecture trained on a combination of general
purpose, financial, general purpose instructions, and financial
institutions datasets. Xuan Yuan 2.0 combined the pre-training
and fine-tuning stages to avoid catastrophic forgetting.
3.2. Fine-Tuned LLMs
Pre-trained LLMs have excellent generalization abilities to
unseen tasks. However, because they are generally trained with
the objective of next token prediction, LLMs have limited ca-
pacity to follow user intent and are prone to generate unethical,
toxic or inaccurate responses [20]. For their effective utiliza-
tion, LLMs are fine-tuned to follow instructions [16, 17, 97] and
generate safe responses [20], which also results in increasing
zero-shot, few-shot, and cross-task generalization [97, 16, 18],
Figure 11: An example image shows an instance of the Flan training paradigm,
taken from [16].
with minimal compute increment, e.g., 0.2% of the total pre-
training for PaLM 540B [16].
We review various fine-tuned LLMs and strategies for effective
fine-tuning in this section.
3.2.1. Instruction-Tuning with Manually Created Datasets
Numerous hand-crafted instruction-tuning datasets with
different design choices are proposed in the literature to
instruction-tune LLMs. The performance of fine-tuned LLMs
depends on multiple factors, such as dataset, instruction diver-
sity, prompting templates, model size, and training objectives.
Keeping this in view, diverse fine-tuned models have emerged
in the literature using manually created datasets.
The models T0 [17] and mT0 (multi-lingual) [144] employ
templates to convert existing datasets into prompt datasets.
They have shown improvements in generalization to zero-shot
and held-out tasks. Tk-Instruct [18] fine-tuned the T5 model
with in-context instructions to study generalization on unseen
tasks when given in-context instructions during test time. The
model outperformed Instruct-GPT, despite being smaller in
size, i.e., 11B parameters as compared to 175B of GPT-3.
Increasing Tasks and Prompt Setups: Zero-shot and few-shot
performance improves significantly by expanding task collec-
tion and prompt styles. OPT-IML [97] and Flan [16] curated
larger 2k and 1.8k task datasets, respectively. While increasing
task size alone is not enough, OPT-IML and Flan add more
prompting setups in their datasets, zero-shot, few-shot, and
CoT. In continuation, CoT Collection [101] fine-tunes Flan-T5
further on 1.88M CoT samples. Another method [102] uses
symbolic tasks with tasks in T0, Flan, etc.
3.2.2. Instruction-Tuning with LLMs Generated Datasets
Generating an instruction-tuning dataset requires carefully
writing instructions and input-output pairs, which are often
written by humans, smaller in size, and less diverse.
To
overcome this, self-instruct [19] proposed an approach to
prompt available LLMs to generate instruction-tuning datasets.
Self-instruct outperformed models trained on manually created
dataset SUPER-NATURALINSTRUCTIONS (a dataset with
1600+ tasks) [18] by 33%. It starts with a seed of 175 tasks,
1 instruction, and 1 sample per task and iteratively generates
11
Table 1: Noteworthy findings and insights of pre-trained Large Language Models.
Models
Findings & Insights
T5
• Encoder and decoder with shared parameters perform equivalently when parameters are not shared
• Fine-tuning model layers (adapter layers) work better than the conventional way of training on only
classification layers
GPT-3
• Few-shot performance of LLMs is better than the zero-shot, suggesting that LLMs are meta-
learners
mT5
• Large multi-lingual models perform equivalently to single language models on downstream tasks.
However, smaller multi-lingual models perform worse
PanGu-α
• LLMs have good few shot capabilities
CPM-2
• Prompt fine-tuning requires updating very few parameters while achieving performance compara-
ble to full model fine-tuning
• Prompt fine-tuning takes more time to converge as compared to full model fine-tuning
• Inserting prompt tokens in-between sentences can allow the model to understand relations between
sentences and long sequences
• In an analysis, CPM-2 finds that prompts work as a provider (additional context) and aggregator
(aggregate information with the input text) for the model
ERNIE 3.0
• A modular LLM architecture with a universal representation module and task-specific representa-
tion module helps in the finetuning phase
• Optimizing the parameters of a task-specific representation network during the fine-tuning phase is
an efficient way to take advantage of the powerful pre-trained model
Jurassic-1
• The performance of LLM is highly related to the network size
• To improve runtime performance, more operations can be performed in parallel (width) rather than
sequential (depth)