Training language models to predict multiple tokens at once results in better sample efficiency, says researchers at Meta.
Large language models like Llama and ChatGPT are usually trained for the next token prediction, but with this new approach, better performance can be achieved.
What is single token prediction technique?
The multi-token prediction technique provides a significant edge in some scenarios with three times the speed of generative tasks, but it still is not a one-size-fits-all solution for every type of model. The technique has quite some room for improvement, and for some LLM applications, it can become a robust tool.
For a more clearer understanding, it can be said that the traditional process for LLM training uses an approach called “next-token prediction,” and in this way, a model predicts only the next one future token in a given sequence.
In an automated process, the token it predicted is added to the input, and the process is repeated over and over again over the entire text input provided so that the model learns the common patterns and develops the ability to produce output consisting of logical and consistent text.
There are some drawbacks to this technique, as by processing only the next token, the model becomes too focused on the local patterns in text and ignores the predictions that can only be made with reasoning.
Another problem with this technique is that it requires huge amounts of datasets to be fed into the model to reach the normal flow of language output that humans can do with very little text.
Multi token prediction enables 3X speed
In the new multi-token approach suggested by Meta, the LLM is instructed to predict multiple tokens from different positions at the same time in the training process. The researchers used a simple prediction architecture for multi-token prediction that does not require extra resources like time and memory processing.
Researchers used the same Transformer architecture that is already used by most LLMs, but they did make some changes to accommodate multiple token prediction by increasing its output heads from single to multiple and allocating one to each token.
In this way, for drawing conclusions and making predictions, the model uses the same basic next prediction strategy, but by utilizing multiple heads, it can speed up the process. The research study says,
“While cost-free and simple, multi-token prediction is an effective modification to train stronger and faster transformer models.”
Source: Meta.
Researchers found during the study that the technique produced subpar results when they used it on smaller models, but the results became better than average when they applied the same process to larger models, and the results kept improving with the size of the model. As the study writes,
“The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points.”
Source: Meta.
Researchers also said that the multi token prediction technique also makes the model three times faster at producing logical results, which is useful with the benefit of no or very little extra cost.
From Zero to Web3 Pro: Your 90-Day Career Launch Plan