Language models are incredibly powerful tools that can understand and generate human-like text by learning patterns from massive datasets. However, the traditional method of training these models, called “next-token prediction,” has its limitations. It essentially teaches the model to predict the next word in a sequence, but this approach can lead to suboptimal performance, especially for more complex tasks.
The researchers behind this study propose a new technique called multi-token prediction. Instead of predicting one token (word) at a time, this method trains the model to predict multiple future tokens simultaneously. Imagine it like this: While learning a language, instead of guessing one word at a time, you’re challenged to predict entire phrases or even sentences. Sounds intriguing, right?
So, how does this multi-token prediction work? The researchers designed a model architecture with a shared trunk that produces a latent representation of the input context. This shared trunk is then connected to multiple independent output heads, each responsible for predicting one of the future tokens. For example, if the model is set to predict four future tokens, it will have four output heads working in parallel.
During training, the model is fed a text corpus, and at each position, it is tasked with predicting the next n tokens simultaneously. This approach encourages the model to learn longer-term patterns and dependencies in the data, potentially leading to better performance, especially for tasks that require understanding the broader context.
Moreover, the researchers also tackled a critical challenge: reducing the GPU memory usage of these multi-token predictors. They implemented a clever technique that sequentially computes the forward and backward passes for each output head, accumulating gradients at the shared trunk. This approach reduces the peak GPU memory utilization, making it feasible to train larger models efficiently.
The researchers conducted extensive experiments, and the results are quite promising. They found that multi-token prediction becomes increasingly useful as the model size grows. For instance, on coding evaluation benchmarks like MBPP and HumanEval, models trained with multi-token prediction outperformed their next-token prediction counterparts, sometimes by a significant margin. The 13B parameter models solve 12% more problems on HumanEval and 17% more on MBPP than comparable next-token models.
Moreover, the additional output heads can be leveraged to speed up inference using techniques like speculative decoding. The researchers observed up to a 3x speedup in decoding times for their best 4-token prediction model on code and natural language tasks.
But it’s not just about coding; multi-token prediction also showed promising results in natural language tasks. When evaluated on summarization benchmarks, models trained with multi-token prediction achieved higher ROUGE scores compared to the next-token baseline, indicating better text generation capabilities.
The next interesting question to answer is, “Why It Works?”
The researchers offer some insightful explanations for why multi-token prediction works so well. One key idea is that it mitigates the distributional discrepancy between training-time teacher forcing (where the model receives the ground truth for each future token) and inference-time autoregressive generation (where the model generates tokens without guidance).
Additionally, multi-token prediction implicitly assigns higher weights to tokens that represent “choice points” – decisions that significantly impact the remainder of the text. By reinforcing these critical decision points during training, the model learns to make better choices, leading to more coherent and useful text generations. Furthermore, an information-theoretic analysis suggests that multi-token prediction encourages the model to focus on predicting highly relevant tokens for the subsequent text, potentially capturing longer-term dependencies more effectively.
While the results are promising, the researchers acknowledge that there is still room for improvement. One area for future exploration is automatically determining the optimal value of n (the number of future tokens to predict) based on the task and data distribution. Additionally, they suggest that adjusting the vocabulary size and exploring alternative auxiliary prediction losses could lead to even better trade-offs between compressed sequence length and computational efficiency. Overall, this research opens up exciting avenues for enhancing language models’ capabilities, paving the way for more powerful and efficient natural language processing systems.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our 41k+ ML SubReddit