Intelligence
Bandwidth
A metric to track the exponential growth of AI
August 2025
Abstract
Traditional benchmarks for artificial intelligence (AI) have predominantly focused on the quality and accuracy of outputs, largely overlooking the critical dimension of processing speed. This omission is particularly problematic, as many real-world AI applications, from autonomous driving to customer service, are time-sensitive. Existing speed metrics such as Time to First Token (TTFT) and Tokens Per Second (TPS) are insufficient, as they are token-centric regardless of usefulness and ill-suited for a multi-modal future. This paper introduces "Intelligence Bandwidth" as a new metric to measure the processing speed of AI, defined as the amount of useful information an AI can produce per unit of time. Several methods for its approximation are proposed, with a focus on measuring raw output bits per second for its simplicity and modality-agnostic nature. By analyzing historically significant generative AI models, a clear trend of exponential growth is observed. From this data, "Jin's law" is formulated, positing that the intelligence bandwidth of the best publicly available AI model doubles approximately every year. This law provides a predictive framework for the evolution of human-AI interaction, forecasting the near-term integration of real-time image generation into text-based conversations and the advent of real-time video interaction within the next three years.
Introduction
Processing speed has long been recognized as an important metric in human cognitive ability1 and intelligence scale measurement2,3,4. It reflects the fundamental capacity to perform daily tasks as a human, including reading comprehension, communication, and driving. Both the quality of work and the time required are essential to its evaluation.
To measure the intelligence scale of artificial intelligence (AI), various benchmark datasets have been designed, analogous to tests for humans. For example, the MMLU dataset5 consists of multiple-choice questions to test multi-language understanding, the GPQA dataset6 consists of graduate-level questions on a variety of subjects, and the MATH dataset7 comprises competition math problems.
However, when experiments are conducted to test AI on these datasets, unlike tests for humans, only the quality of work is emphasized, while the time aspect of the test is typically ignored. Such an approach does not satisfy the requirements of AI in real-world applications.
From a pragmatic perspective, many tasks are time-sensitive, such as autonomous driving and customer service. In these applications, the output of intelligence is required within a limited time window. Even in the early days of deep learning, there was an implicit assumption of a time limit. When AlphaGo8 defeated the best human Go player in 2016, it adhered to the same rules, including time constraints, as the human player.
Implications of test-time scaling
With the advent of the test-time scaling law9, the scores achieved on these AI benchmark datasets alone can no longer fully measure intelligence.
According to the test-time scaling law, a small model can solve harder problems by spending more tokens "thinking," while a larger model can solve the same problem with fewer tokens. By analogy to a human IQ test, it is as if one person spent more time and used a long sheet of scratch paper to finish the test, while another person used very little time and no scratch paper at all. If both achieve the same number of correct answers, it is more rational to conclude that the person who completed the test faster possesses higher intelligence, likely employing a more efficient method.
That smaller and larger models can answer the same question correctly does not imply they have the same intelligence level; rather, they approached the problem differently. Therefore, without any constraint on the output—whether in terms of the number of tokens or output speed—it is impossible to measure the actual intelligence level of AI solely based on correctness.
From a user's perspective, estimating the time required for AI to complete tasks has become increasingly challenging. Prior to the advent of test-time scaling, models typically used a similar number of tokens to solve each problem. With test-time scaling, models are incentivized to generate more tokens for certain problems, leading to greater variability. Users now have access to metrics such as tokens per second for the APIs they utilize, but lack information about the total number of tokens needed for a given task. As a result, they can no longer reliably estimate the time required for task completion.
Why speed was ignored
If processing speed is so important to intelligence, why was it historically ignored? The reasons are twofold. First, AI was not sufficiently advanced. Early systems could not pass the Turing test10, solve math or coding problems, or perform in-depth reading or writing. Given that AI was in a primitive stage, the focus was on enabling AI to perform new tasks rather than on the speed of task completion. Second, AI applications were all task-specific. The AI for X-ray image processing11 was entirely different from the AI used for recommendation systems12. The applicability of AI was measured case by case by application builders, without the need for a unified processing speed metric.
However, with the advent of large language models (LLMs)13, AI has become substantially more capable and generalizable, with applications spanning from medical diagnosis14 to recommendation systems15. Consequently, it is now both rational and timely to establish a unified metric for AI processing speed.
A mental shift for evaluating AI
Additionally, there has been a significant shift in how application builders interface with models as summarized in Table 1. Initially, developers focused on training application-specific deep learning models, beginning with the advent of AlexNet16. With the emergence of ChatGPT13, the paradigm shifted toward pretraining large language models (LLMs). The public release of Llama17 enabled post-training and fine-tuning of open models. As the capabilities of open models advanced, such as the Qwen series18, it became increasingly feasible to serve an open-weight model "as is" without additional fine-tuning. More recently, the introduction of DeepSeek19, which substantially reduced token costs, has made it more cost-effective to utilize hosted APIs from major AI service providers rather than maintaining proprietary infrastructure.
Table 1. Paradigm shifts of AI usages
Paradigms | Defining Moments |
---|---|
Deep Learning | AlexNet |
Pretraining LLMs | ChatGPT |
Open Model Fine-tuning | Llama |
Open Model Serving | Qwen |
Hosted APIs | DeepSeek |
Therefore, when evaluating the intelligence of AI, a shift is required from evaluating static models, pure mathematical constructs defined by neural architectures and parameters, to evaluating hosted AI services or APIs. It is necessary to assess their processing speed for applicability to time-sensitive tasks.
Related work
There are established metrics to evaluate the speed of LLMs, such as time to first token (TTFT) and tokens per second (TPS)20. These are suitable metrics for serving language models, but there are two major limitations when considering them as general metrics for AI processing speed.
First, unlike the time-limit for an IQ test, these metrics focus on the number of tokens rather than the actual useful information produced by intelligence. For example, a simple program without any intelligence can output random tokens at a very high speed.
Second, they are not designed for a multi-modal future of AI. Although the term multi-media may seem outdated, it is highly relevant to contemporary AI development. LLMs have largely pushed human-computer interaction back to the pre-multi-media era. On modern social media, users read articles with images and watch long, short, or livestream videos. In contrast, current AI interactions are predominantly text-based, reminiscent of the early 1990s internet.
With the emergence of multi-modal AI technologies, such as image generation21,22,23 and video generation24,25,26, it is anticipated that the future of human-AI interaction27 and even AI-AI interaction28 will be multi-modal.
What is needed is a metric that quantifies the rate at which useful information is produced, rather than merely counting tokens, and that remains applicable across diverse modalities to ensure future relevance.
Intelligence bandwidth
This paper introduces the concept of intelligence bandwidth as a metric for the processing speed of AI.
Definition 1. Intelligence bandwidth is a measurement indicating the maximum amount of useful information that a served artificial intelligence model can produce in a given amount of time, formally expressed as:
where
It is important to note that intelligence bandwidth primarily measures the output speed of AI, rather than input, for two reasons. First, the main impact of intelligence bandwidth is in human-AI interaction, which will be discussed further in the next section. Second, AI's processing speed is mainly bounded by output, not input. Thus, it is more meaningful to track AI progress with a metric that measures its bottleneck.
Approximated measurements
Despite the formal definition of intelligence bandwidth, certain ambiguities remain in practical measurement. First, "useful information" is not an unambiguous term. Defining the usefulness of an AI model remains a challenging task. Second, time is also subject to multiple interpretations in computer science.
Regarding time measurement, the two popular choices are CPU time and wall time. Wall time is preferred, as intelligence bandwidth is primarily relevant for human-AI interactions and applications. End-to-end latency is more informative than technical details such as CPU time.
Regarding useful information, the evaluation is based solely on the interaction
between the user and AI. The downstream use of the output is not considered.
For example, if a user asks "what is the square of 3?" and uses the answer in a
math competition to win a $10k
bonus, the usefulness is measured by the
quality of the AI's answer, not the subsequent economic gain.
To address the ambiguity in "useful information" in Definition 1, three approximate methods for calculating intelligence bandwidth are proposed. These methods are all approximations, as there is no universally agreed-upon way to measure useful information.
Method 1: Benchmark score divided by time
The first method is to divide the score achieved by AI on benchmark datasets by the time required to produce the answer. In this approximation, usefulness is measured by the score on benchmarking datasets, formally expressed as follows.
Let
For an AI application developer with a time-bounded task, the tokens/second metric provides limited insight, as the number of tokens required for the task is unknown. However, this approximation of intelligence bandwidth offers a more informative measure of how much useful work can be accomplished per second.
Method 2: The information theory approach
The second method is based on information theory. This approach explores whether the foundational framework of information theory29 can be used to measure the amount of information output by an AI model.
Intelligence bandwidth
There are two limitations to this approximation:
-
It requires access to the probability output by the large language model. The value of
is contained in the output probability vector. -
The amount of information does not necessarily equate to the amount of useful information. For example, a simple GPT-230 model can output text rapidly but with limited usefulness.
Therefore, this method is less widely applicable or accurate than the first method.
Method 3: Raw output bits
The third approximate method is to measure the number of bits in the raw outputs of the AI model. With this method, intelligence bandwidth is measured in bits per second.
The advantage of this method is its simplicity. The number of bits in any text, image, or video output by an AI model can be easily computed without using any benchmark dataset or accessing the output probability vector.
Another advantage is the absence of ambiguity, such as the selection of benchmark datasets in the first method or the computation of probabilities in the second method.
The primary limitation of this approach is that it diverges from measuring the actual usefulness of the information output by the AI model. However, if the models measured are constrained to the best available in the market, the useful information per bit should be similar among them. Therefore, this approximation is valid in a constrained environment.
Impact on human-AI interaction
Effective metrics can help reveal new laws to predict the future, analogous to feature width and Moore's law31, or internet bandwidth and Nielsen's law32. Examining the history of the internet through the lens of network bandwidth, as bandwidth increased, richer content formats emerged33, shifting from text-based websites like early Twitter34 to video-based platforms like YouTube35. With a simple metric such as network bandwidth, it was possible to predict the popularity of certain applications. Similarly, a robust metric for AI processing speed can facilitate the discovery of macro-level trends in AI development.
Intelligence bandwidth, as a metric, tracks the progress of human-AI interaction. Current human-AI interactions are still predominantly text-based. This is possible because the output speed of AI has exceeded human reading speed, which is approximately 238 words per minute36. This is enabled by state-of-the-art serving technologies, which can generate 14,000 words per minute. Similarly, speech generation speed is far beyond human listening speed37.
To clarify the prerequisites for enabling real-time human-AI interaction within a given modality, it is helpful to examine the specific conditions that must be satisfied. For self-paced media formats such as text and images, individuals typically consume content at their maximum perceptual speed. Once the intelligence bandwidth exceeds the human perceptual threshold, real-time interaction in that modality becomes feasible. In contrast, for fixed-speed media formats such as audio and video, users generally adhere to the inherent playback speed. Thus, as long as the AI generation speed surpasses the fixed playback rate of these media formats, real-time interaction in those modalities is achievable.
Research in multi-modal AI continues to address the bottlenecks of other modalities. With the increase of the intelligence bandwidth of image generation, visual illustrations will become integrated into AI responses. AI may also be able to perform visual reasoning akin to humans on a whiteboard, and iteratively refine graphical designs as designers do on paper. As the intelligence bandwidth of leading AI models continues to increase, it is expected that video illustrations and real-time generated environment interactions, such as those demonstrated by Genie 338, will become feasible. In a speculative future, AI could generate entire worlds that users can interact with in real time—an idealistic scenario enabled by the cognitive capabilities of AI.
Achieving such increases in intelligence bandwidth requires not only advances in AI models as static collections of neural architectures and parameters, but also significant improvements in AI hardware and machine learning systems. In this vision, hardware, software, and models are no longer orthogonal, but are deeply integrated to enable superintelligence.
Experiments
In this section, all historically significant generative AI models are measured for their intelligence bandwidth and plotted in a single figure. The models covered include large language models, image generators, and video generators. The raw output bits method is used due to its ease of measurement and minimal ambiguity. Most of the data presented in this section is collected from Artificial Analysis39.
The experimental results are shown in Figure 1, where the X-axis is the release date of the models, and the Y-axis is the intelligence bandwidth of the models measured in kilobytes per second. The modality of the models is indicated by different colors as shown in the legend.
Figure 1: Intelligence bandwidth (KB/s) over time.
Key observations from the experimental results include:
-
Most language models are between 0 KB/s and 3 KB/s.
-
Image generators exhibit an exponential growth rate.
-
The video generator, Veo325, currently exhibits an even lower intelligence bandwidth than the state-of-the-art image generators. This is primarily attributable to less mature serving technologies for video models compared to those for large language models and image generators. As serving efficiency for video generators improves, substantial growth in their intelligence bandwidth is anticipated in the near future.
-
The Gemini 2.5 Flash40 image generator is an outlier, primarily because it is optimized for low latency and usability rather than best quality and fidelity.
Jin's law
As we mentioned above, a robust metric for AI processing speed can facilitate the discovery of macro-level trends in AI development. We now assess the validity of intelligence bandwidth as such a metric by examining whether it supports the formulation of a predictive law for future AI growth.
The dotted curve in Figure 1 represents the estimated growth of intelligence bandwidth. The prediction of the growth rate is based primarily on Imagen 4, the state-of-the-art high-quality image generator, rather than a model balanced between speed and quality, such as Gemini 2.5 Flash40. The growth rate of intelligence bandwidth is summarized in a simple law named after the author's surname, presented as follows.
Jin's law: The intelligence bandwidth (KB/s) of the best hosted AI model available to the public doubles every year.
The formal definition of this law is as follows. Let
where
In Jin's law, intelligence bandwidth is defined by the modality exhibiting the highest KB/s measurement. Currently, image generators are at the forefront of this growth. As advancements in models and serving technologies for image generation reach a plateau, it is anticipated that video generators will become the primary drivers of further increases in intelligence bandwidth.
Based on Jin's law, two predictions about human-AI interaction in the near future are made as an example of how to use the law to predict the exponential growth of AI in the future:
-
Images will soon be used in AI interactions. The latency of the Gemini 2.5 Flash40 image generator is lower than that of large language model responses. Consequently, large language models may soon incorporate images to provide enhanced illustrations in their outputs. Currently, it takes only 4.6 seconds to generate an image. If this speed doubles within a year, it is likely that applications will emerge in which images become a primary mode of interaction and illustration.
-
Real-time video interaction will be widely available in three years. Given the intelligence bandwidth of models in 2025, it is currently possible to generate 8 seconds of video in 50 to 60 seconds. If generation speed increases by a factor of 7 to 8, real-time video generation will become feasible. Achieving an 8-fold increase corresponds to approximately
years. While some uncertainty remains in this prediction, video generators are presently below the projected growth curve and possess significant potential for accelerated improvement as serving technologies advance.
There are many other implications that can be derived from Jin's law. It is hoped that this law will guide AI application developers in identifying optimal time windows to bring products to market, and policymakers in enacting regulations at appropriate times to maximize development while minimizing harm.
Limitations
There are several limitations to this work.
First, the measurement of usefulness remains an open challenge. This paper proposes three straightforward approaches to approximate usefulness in AI outputs; however, more rigorous and comprehensive methods are needed.
Second, the accuracy of the estimated doubling period
Third, the exponential growth described by Jin's law represents an idealized scenario. In practice, growth may be constrained by factors such as energy supply limitations or economic pressures, particularly if the AI sector experiences a market correction.
Conclusions
The paper introduces the concept of "Intelligence Bandwidth," a metric that measures the processing speed of served AI models by quantifying the rate at which they produce useful information. The paper observes an exponential growth trend in this metric across various modalities and formulates "Jin's law," which states that the intelligence bandwidth of the best publicly available AI doubles annually. This law provides a predictive framework, forecasting the imminent integration of real-time images into human-AI text interactions and the widespread availability of real-time video generation within three years. The paper advocates for a new evaluation paradigm that focuses on dynamic, served AI systems, emphasizing the crucial interplay between hardware, software, and models.
Ultimately, the most important takeaway is that when evaluating AI for roles traditionally performed by humans, the evaluation criteria should mirror those used for humans. The distinction between assessing human and AI performance for a given task will become increasingly blurred. Innovative approaches are required to integrate processing speed into comprehensive intelligence assessments for AI.
Acknowledgments
Completing this work required stepping outside established research areas, a challenge that would not have been possible without the encouragement and support of many individuals. Sincere gratitude is extended to everyone who provided guidance and encouragement during this process. Your belief in this project and in the ability to see it through was invaluable.
References
- Woodcock, R. W., & others. (1989). Woodcock-Johnson tests of cognitive ability. DLM Teaching Resources. ↩
- Fry, A. F., & others. (1996). Processing speed, working memory, and fluid intelligence: Evidence for a developmental cascade. Psychological Science. ↩
- Lichtenberger, E. O., & others. (2012). Essentials of WAIS-IV assessment. John Wiley & Sons. ↩
- Flanagan, D. P., & others. (2017). Essentials of WISC-V assessment. John Wiley & Sons. ↩
- Hendrycks, D., & others. (2021). Measuring Massive Multitask Language Understanding. ICLR. ↩
- Rein, D., & others. (2024). GPQA: A Graduate-Level Google-Proof Q&A Benchmark. COLM. ↩
- Hendrycks, D., & others. (2021). Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS. ↩
- Silver, D., & others. (2016). Mastering the game of Go with deep neural networks and tree search. Nature. ↩
- Snell, C., & others. (2025). Scaling LLM test-time compute optimally can be more effective than scaling model parameters. ICLR. ↩
- Turing, A. M. (2009). Computing machinery and intelligence. Springer. ↩
- Çallı, E., & others. (2021). Deep learning for chest X-ray analysis: A survey. Medical Image Analysis. ↩
- He, X., & others. (2017). Neural collaborative filtering. The Web Conference. ↩
- Ullah, E., & others. (2024). Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology–a recent scoping review. Diagnostic Pathology. ↩
- Wu, X., & others. (2024). Could small language models serve as recommenders? Towards data-centric cold-start recommendation. The Web Conference. ↩
- Krizhevsky, A., & others. (2012). Imagenet classification with deep convolutional neural networks. NeurIPS. ↩
- Touvron, H., & others. (2023). Llama: Open and efficient foundation language models. arXiv Preprint arXiv:2302.13971. ↩
- Bai, J., & others. (2023). Qwen technical report. arXiv Preprint arXiv:2309.16609. ↩
- Guo, D., & others. (2025). DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv Preprint arXiv:2501.12948. ↩
- Zhong, Y., & others. (2024). DistServe: disaggregating prefill and decoding for goodput-optimized large language model serving. OSDI. ↩
- Kingma, D. P., & others. (2014). Auto-encoding variational Bayes. ICLR. ↩
- Goodfellow, I. J., & others. (2014). Generative adversarial nets. NeurIPS. ↩
- Rombach, R., & others. (2022). High-resolution image synthesis with latent diffusion models. CVPR. ↩
- Amershi, S., & others. (2019). Guidelines for human-AI interaction. CHI. ↩
- Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal. ↩
- Radford, A., & others. (2019). Language models are unsupervised multitask learners. OpenAI Blog. ↩
- Moore, G. E., & others. (1965). Cramming more components onto integrated circuits. ↩
- Coffman, K. G., & others. (2002). Internet growth: Is there a “Moore’s Law” for data traffic? Handbook of Massive Data Sets. ↩
- Murthy, D. (2018). Twitter. Polity Press Cambridge. ↩
- Snickars, P., & Vonderau, P. (2009). The YouTube reader. Kungliga biblioteket. ↩
- Brysbaert, M. (2019). How many words do we read per minute? A review and meta-analysis of reading rate. Journal of Memory and Language. ↩
- Kuperman, V., & others. (2021). A lingering question addressed: Reading rate and most efficient listening rate are highly similar. Journal of Experimental Psychology: Human Perception and Performance. ↩
@misc{jin2025intelligence,
title={Intelligence Bandwidth},
author={Jin, Haifeng},
journal={Haifeng Jin's Blog},
url={https://haifengjin.com/intelligence-bandwidth/},
year={2025},
note={}
}