In order to train generative AI models, in particular frontier large language models (LLMs) and multimodal models, it is necessary to perform vast amounts of computations, typically carried out on massive clusters of Graphics Processing Units (GPUs) or other specialized AI chips. We have also seen the emergence of certain scaling laws, showing how much compute and data are required for optimal training based on model size, with the general pattern being up and to the right when it comes to data, number of parameters, and compute. Regulators have expressed interest in applying certain additional measures on these especially capable foundation models, and in an attempt to demarcate such models, regulators have focused on the amount of compute used to train them. We have seen some trends emerge in recent AI regulation and legislation on compute thresholds for particularly capable AI systems, for instance:
More Compute, More Money
But what do these compute thresholds practically mean? Jack Clark, co-founder and director of policy at Anthropic, provided some back-of-the-napkin math suggesting that a model trained with 10^25 FLOPs of compute would cost $7–10M to train, whereas a model trained with 10^26 FLOPs of compute would cost $70–100M to train. This comports with California’s proposed SB 1047, which asserts that a covered model is either one trained with 10^26 FLOPs of compute, or one that cost greater than $100M to train. Hence the regulatory and legislative trend appears to be that larger models trained with sizable compute are more likely to be subject to regulation and additional compliance requirements, however there is some misalignment on exactly how much compute should trigger such additional compliance obligations. Both the EU AI Act and Biden’s Executive Order leave room for updating these figures as the state of the art for foundation models advances.
Technological Developments May Complicate These Thresholds
However, there have also been some interesting technological developments that might cause trouble for these regulatory thresholds. For example, researchers from Stanford, Duke, University of Chicago, and Together.ai recently released their work on a mixture of agents (MoA) approach, taking advantage of the property of collaborativeness that has been observed across LLMs. Collaborativeness is essentially the phenomenon where LLMs perform better if they are provided not just with the initial prompt, but also with the output of another LLM, even if the LLM providing the output is an inferior or less capable model. The researchers pieced together six openly accessible LLMs via MoA and were able to achieve superior performance to GPT-4o on AlpacaEval 2.0,1 also at a lower inference cost. This is noteworthy because merely by stitching together various smaller non-frontier, openly accessible LLMs, they were able to create a MoA model that outperforms frontier LLMs, despite each model being trained with less computer than the 10^26 FLOPs threshold. It is unclear how the compute thresholds of current AI regulations would handle the MoA architecture. The issue of determining a model’s training compute might be made even more challenging by the technique of model merging, which uses evolutionary algorithms to splice together certain aspects of different LLMs, resulting in a model that is greater (viz. higher performing) than the sum of its parts.
As another example of recent technological developments that could confound the regulatory compute thresholds, researchers recently published a paper detailing a novel Transformer architecture that replaces the extremely compute-hungry mathematical operation of matrix multiplication. This resulted in up to a 10x reduction of memory usage on the GPUs used for training, while still maintaining comparable model performance. The researchers successfully scaled this technique up to a ~3 billion parameter model, which is of comparable size to Microsoft’s Phi-3 and Apple’s recently revealed on-device model. Such a technique could dramatically reduce training costs for frontier LLMs, if it is able to be successfully scaled up a few more orders of magnitude—certainly an important open question. Such a technique appears more readily applicable than alternative low-power chip designs, although we may see hardware breakthroughs as well.
Finally, there are other technological methods that are not necessarily new, but might still provide an end run around these AI regulatory compute thresholds. For instance, model distillation is a process in which a larger and more sophisticated “teacher” model is used to train a smaller “student” model—the teacher model is essentially distilled down to a smaller model while only sacrificing a small amount of performance. This could potentially be used to avoid the compute threshold of the EU AI Act for models with high-impact capabilities, because the EU AI Act only regulates AI systems placed on the market or put into service in the EU. For instance, a company might use greater than 10^25 FLOPs to train a teacher model, where such model is never marketed or used in the EU. Then that teacher model could be used to train a smaller student model, which is nearly as capable as the teacher model, but is trained using less than 10^25 FLOPs due to its smaller size. However, such a strategy may not be effective, as the EU AI Act notes that in addition to compute, there may be other means of classifying AI systems as possessing high-impact capabilities, such as “on the basis of appropriate technical tools and methodologies, including indicators and benchmarks” (Article 51(1)(a)).
Not Just Models, But Clusters Too
Finally, it should be noted that current AI regulations impose obligations not only on models triggering certain compute thresholds, but also on certain computing clusters themselves. Biden’s Executive Order imposes certain reporting obligations on single data centers with 100 gigabit-per-second networking and a theoretical maximum computing capacity of 10^20 FLOPs. In fact, the 10^26 FLOPs threshold for AI models mentioned above is only triggered if that amount of compute were used by such a 10^20 FLOPs single data center (Section 4.2(c)(iii)). However, we are seeing various advances in training AI models using distributed compute, such as in Google’s DiPaCo or Together.ai’s work on decentralized, heterogeneous, and lower-bandwidth interconnected compute. Such developments might permit the training of frontier AI models without triggering certain regulatory requirements regarding computing clusters.
However, such thresholds may change in future AI legislative efforts, for example, California’s proposed SB 1047 uses the same gigabit-per-second and FLOP thresholds as the Executive Order, but SB 1047 does not require that the threshold be met by a single data center. And as mentioned above, even if certain regulatory acts become outdated, they often leave themselves the ability to update as needed, for instance, Article 51(3) of the EU AI Act states:
“The Commission shall adopt delegated acts in accordance with Article 97 to amend the [compute] thresholds … as well as to supplement benchmarks and indicators in light of evolving technological developments, such as algorithmic improvements or increased hardware efficiency, when necessary, for these thresholds to reflect the state of the art.”
Hence careful attention will need to be paid both to: (1) the rapidly emerging technological innovations around computing clusters and techniques and model architectures which impact compute requirements for training frontier AI models, as well as (2) the evolving regulatory thresholds attempting to delineate computing cluster and training compute thresholds for such AI models.
1 For those curious about the AlpacaEval 2.0 evaluation mentioned above, this is an automated evaluator that nearly approximates (viz. correlation of 0.98) the ChatBot Arena. ChatBot Arena is a leaderboard in which humans provide a prompt, and then are provided with responses from two unknown LLMs, and the human must select which response they prefer. These preferences are used to calculate an Elo score to rank the models. Elo scores have perhaps most famously been used to rank players in chess, but can theoretically be used for ranking anything, as is undertaken by the website Elo Everything—currently topping the leaderboard: the universe, water, knowledge, information and love. Tough competition.