Efficient Transformers for Language Modelling

Scaling Transformer architectures has been critical for pushing the frontiers of Language Modelling (LM), a problem central to Natural Language Processing (NLP) and Language Understanding. Although there is a direct positive relationship between the Transformer capacity and its LM performance, there are practical limitations which make training massive models impossible. These limitations come in the form of computation and memory costs which cannot be solely addressed by training on parallel devices. In this thesis, we investigate two approaches which can make Transformers more computationally and memory efficient. First, we introduce the Mixture-of-Experts (MoE) Transformer which can scale its capacity at a sub-linear computational cost. Second, we present a novel content-based sparse attention mechanism called Hierarchical Self Attention (HSA). We demonstrate that the MoE Transformer is capable of achieving lower test perplexity values than a vanilla Transformer model with higher computational demands. Language Modelling experiments, involving a Transformer which uses HSA in place of conventional attention, revealed that HSA can speed up attention computation by up to 330% at a negligible cost in model performance.

Name		Name	Last commit message	Last commit date
Latest commit History 178 Commits
.data/wikitext-2		.data/wikitext-2
data/wikitext-2		data/wikitext-2
main		main
main_files_archive		main_files_archive
mixtures		mixtures
model_files		model_files
non_cuda_sbatch_scripts		non_cuda_sbatch_scripts
playground		playground
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
data_utils.py		data_utils.py
data_utils_subword.py		data_utils_subword.py
generate.py		generate.py
get_lm1b.sh		get_lm1b.sh
hsa.png		hsa.png
hsa_results.png		hsa_results.png
jobMoeFFN_all.sbatch		jobMoeFFN_all.sbatch
jobMoeFFN_odd.sbatch		jobMoeFFN_odd.sbatch
jobMoeMHAttn.sbatch		jobMoeMHAttn.sbatch
jobSparseAttn.sbatch		jobSparseAttn.sbatch
jobVanillaTransformer.sbatch		jobVanillaTransformer.sbatch
jobVanillaTransformer_sparse_baseline.sbatch		jobVanillaTransformer_sparse_baseline.sbatch
main.py		main.py
main_subword.py		main_subword.py
main_tpu.py		main_tpu.py
rsync.make		rsync.make

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Efficient Transformers for Language Modelling

Hierarchical Self Attention

Hierarchical Self Attention Results

About

Releases

Packages

Languages

arturbeg/efficient_transformer

Folders and files

Latest commit

History

Repository files navigation

Efficient Transformers for Language Modelling

Hierarchical Self Attention

Hierarchical Self Attention Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages