Parallel decoding

Seeing as this is being built from the ground up, I was wondering if its possible to implement something similar to https://github.com/ggerganov/llama.cpp/pull/3228

Where it's natively possible to have parallel inference.