
As AI models evolve and adoption grows, enterprises must perform a delicate balancing act to achieve maximum value.
That's because inference - the process of running data through a model to get an output - offers a different computational challenge than training a model.
Pretraining a model - the process of ingesting data, breaking it down into tokens and finding patterns - is essentially a one-time cost. But in inference, every prompt to a model generates tokens, each of which incur a cost.
That means that as AI model performance and use increases, so do the amount of tokens generated and their associated computational costs. For companies looking to build AI capabilities, the key is generating as many tokens as possible - with maximum speed, accuracy and quality of service - without sending computational costs skyrocketing.
As such, the AI ecosystem has been working to make inference cheaper and more efficient. Inference costs have been trending down for the past year thanks to major leaps in model optimization, leading to increasingly advanced, energy-efficient accelerated computing infrastructure and full-stack solutions.
According to the Stanford University Institute for Human-Centered AI's 2025 AI Index Report, the inference cost for a system performing at the level of GPT-3.5 dropped over 280-fold between November 2022 and October 2024. At the hardware level, costs have declined by 30% annually, while energy efficiency has improved by 40% each year. Open-weight models are also closing the gap with closed models, reducing the performance difference from 8% to just 1.7% on some benchmarks in a single year. Together, these trends are rapidly lowering the barriers to advanced AI.
As models evolve and generate more demand and create more tokens, enterprises need to scale their accelerated computing resources to deliver the next generation of AI reasoning tools or risk rising costs and energy consumption.
What follows is a primer to understand the concepts of the economics of inference, enterprises can position themselves to achieve efficient, cost-effective and profitable AI solutions at scale.
Key Terminology for the Economics of AI Inference Knowing key terms of the economics of inference helps set the foundation for understanding its importance.
Tokens are the fundamental unit of data in an AI model. They're derived from data during training as text, images, audio clips and videos. Through a process called tokenization, each piece of data is broken down into smaller constituent units. During training, the model learns the relationships between tokens so it can perform inference and generate an accurate, relevant output.
Throughput refers to the amount of data - typically measured in tokens - that the model can output in a specific amount of time, which itself is a function of the infrastructure running the model. Throughput is often measured in tokens per second, with higher throughput meaning greater return on infrastructure.
Latency is a measure of the amount of time between inputting a prompt and the start of the model's response. Lower latency means faster responses. The two main ways of measuring latency are:
Time to First Token: A measurement of the initial processing time required by the model to generate its first output token after a user prompt.
Time per Output Token: The average time between consecutive tokens - or the time it takes to generate a completion token for each user querying the model at the same time. It's also known as inter-token latency or token-to-token latency.
Time to first token and time per output token are helpful benchmarks, but they're just two pieces of a larger equation. Focusing solely on them can still lead to a deterioration of performance or cost.
To account for other interdependencies, IT leaders are starting to measure goodput, which is defined as the throughput achieved by a system while maintaining target time to first token and time per output token levels. This metric allows organizations to evaluate performance in a more holistic manner, ensuring that throughput, latency and cost are aligned to support both operational efficiency and an exceptional user experience.
Energy efficiency is the measure of how effectively an AI system converts power into computational output, expressed as performance per watt. By using accelerated computing platforms, organizations can maximize tokens per watt while minimizing energy consumption.
How the Scaling Laws Apply to Inference Cost The three AI scaling laws are also core to understanding the economics of inference:
Pretraining scaling: The original scaling law that demonstrated that by increasing training dataset size, model parameter count and computational resources, models can achieve predictable improvements in intelligence and accuracy.
Post-training: A process where models are fine-tuned for accuracy and specificity so they can be applied to application development. Techniques like retrieval-augmented generation can be used to return more relevant answers from an enterprise database.
Test-time scaling (aka long thinking or reasoning ): A technique by which models allocate additional computational resources during inference to evaluate multiple possible outcomes before arriving at the best answer.
While AI is evolving and post-training and test-time scaling techniques become more sophisticated, pretraining isn't disappearing and remains an important way to scale models. Pretraining will still be needed to support post-training and test-time scaling.
Profitable AI Takes a Full-Stack Approach In comparison to inference from a model that's only gone through pretraining and post-training, models that harness test-time scaling generate multiple tokens to solve a complex problem. This results in more accurate and relevant model outputs - but
Most recent headlines
04/09/2025
Monumental Sports & Entertainment (MSE), in collaboration with Dalet, has been a...
23/04/2025
In 2024, Thomson's Total Turnout project played a pivotal role in elevating ethical election reporting and strengthening journalist safety in Pakistan.
Wit...
23/04/2025
We heard that you could use a little pick-me-up, so get ready to sh-sh-shake it ...
23/04/2025
WBD's Max viewership climbs 6%, boosted by The White Lotus' and The Pi...
23/04/2025
March shows a further downward trend of time spent watching television; following the February period when winter holidays contributed to a shorter time spent. ...
23/04/2025
During March, audiences in Mexico increased their streaming usage by 2.1 points compared to the previous month, accounting for 24.4% of TV viewing.
Disclaimer:...
23/04/2025
AANHPI audiences over index the total U.S. for share of time spent with Netflix ...
23/04/2025
SAN JOSE, Calif. Roku has announced new TVs, new streaming devices and significant upgrades to its user interface and software platforms that are designed to st...
23/04/2025
COW Job Listing: Opportunity for a Passionate Feature Film Editor - London-Based...
23/04/2025
RM Equity Partners Acquires MAGIX Software, Appoints Robert Rutkowski as CEO to ...
23/04/2025
All Men Are Wicked Western Shot with Blackmagic Design
Brie Clayton April 23, 2025
0 Comments
Blackmagic Pocket Cinema Camera 4Ks were put to the test...
23/04/2025
NEW YORK Warner Bros. Discovery captured the largest monthly viewership increase among media distributors in March, according to Nielsen's latest Media Dist...
23/04/2025
MIAMI, Fla. Hemisphere Media Group has inked a deal with Samsung to launch WAPA+ as a FAST channel on Samsung TV Plus, a free TV streaming service that comes pr...
23/04/2025
HEBDEN BRIDGE, U.K. Calrec Audio has promoted Sid Stanley to managing director. Stanley, who joined the company in July 2018 as general manager, has been instru...
23/04/2025
WASHINGTON Federal Communications Commission Commissioner Nathan A. Simington has announced a series of staff appointments made in March and April 2025, includi...
23/04/2025
WASHINGTON The Federal Communications Commission has set its agenda for the Monday, April 28, 2025 Open Meeting, which is scheduled to start at 10:30 a.m. in th...
23/04/2025
COW Job Listing: Full-Time Video Editor, Remote
Brie Clayton April 22, 2025
0 Comments
Full-Time Video Editor
April 23, 2025COW Job Listing: Opportu...
23/04/2025
Berklee Abu Dhabi's Jazz Night Kicks off Global Celebration of Music and Cul...
23/04/2025
The E.W. Scripps Co. said it was shutting down its Scripps News over-the-air channel effective November 15 and eliminating at least 200 jobs....
23/04/2025
Audio Helps Chess.com Checkmate Cheating Matches are produced using REMI workflows By Dan Daley, Audio Editor
Wednesday, April 23, 2025 - 7:00 am
Print Th...
23/04/2025
Tech Focus: Intercoms, Part 1 - Key to Onsite, REMI, Hybrid Operations Intercoms keep increasingly disparate production locations together By Dan Daley, Audio ...
23/04/2025
Speaking my language: Bringing AI to the fore with real time audio translation f...
23/04/2025
Back to All News
Netflix Drops Trailer for Crime Drama Secrets We Keep, Set in ...
23/04/2025
Back to All News
World Book Day: Netflix Announces New TV Adaptations and Impac...
23/04/2025
Comscore to Announce First Quarter 2025 Financial ResultsRESTON, VA, April 23, 2025 Comscore, Inc. (Nasdaq: SCOR), a trusted partner for planning, transacting...
23/04/2025
In an email to RT staff today, RT Director-General, Kevin Backhurst confirmed that RT 's Voluntary Exit Programme (VEP) 2025 is open from today, Wednesday...
23/04/2025
RT will air live coverage on television, radio and online of the funeral of Pope Francis which takes place this Saturday, 26 April 2025.
Beginning at 8.30am o...
23/04/2025
As AI models evolve and adoption grows, enterprises must perform a delicate balancing act to achieve maximum value.
That's because inference - the process ...
23/04/2025
Financial services has long been at the forefront of adopting technological innovations. Today, generative AI and agentic systems are redefining the industry, f...
23/04/2025
Facebook
Twitter
LinkedIn
Enables Michelin to focus on development of its ...
23/04/2025
AI is rapidly reshaping what's possible on a PC - whether for real-time image generation or voice-controlled workflows. As AI capabilities grow, so does the...
23/04/2025
An AI agent is only as accurate, relevant and timely as the data that powers it....
23/04/2025
April 23 2025, 02:00 (PDT) Dolby Ushers in the Future of In-car Entertainment at Auto Shanghai 2025
Dolby and industry partners demonstrate robust momentum ...
22/04/2025
Chloe Sevigny at the Magic Farm premiere (photo by Michael Hurcomb/Shutterstock for Sundance Film Festival)...
22/04/2025
Music has the power to connect us to one another and to the world around us. Spotify is once again teaming up with the Museum for the United Nations - UN Live...
22/04/2025
Spotify has an unwavering commitment to supporting emerging artists across all g...
22/04/2025
This year marks the 55th anniversary of Earth Day, celebrated annually on April 22. On Spotify, listeners can explore a world of stories about our planet in our...
22/04/2025
McAvaney takes the lead for SBS' Coverage of the World Athletics Championshi...
22/04/2025
Virtual idols are poised to revolutionize entertainment, with the market projected to top $4 billion by 2029. Crafted with motion capture and game engine techno...
22/04/2025
As the audiovisual (AV) industry rapidly evolves, educational institutions are preparing the next generation of AV professionals for the future. Singapore Polyt...
22/04/2025
DENVER A new survey highlights just how important streaming has become in the entertainment landscape. Data compiled by Adtaxi shows streaming isnt just an alte...
22/04/2025
Samsung TV Plus has announced that it is now offering nearly 700 streaming channels in the U.S. more than any of the other major FAST platforms, the streaming s...
22/04/2025
PORTSMOUTH, N.H. New findings from Hub Research's latest survey on sports and television coverage reveal Amazon may have cracked the code of how to use tech...
22/04/2025
DETROIT Rockbridge Growth Equity has sold a controlling stake in GSTV to MidOcean Partners. Rockbridge will retain a minority stake in the video network which i...
22/04/2025
ATLANTA Gray Media has promoted Dana Neves to senior managing vice president, overseeing a number of Gray's television markets in the Northeast and Mid-Atla...
22/04/2025
The Infrastructure of Creative Flow: DigitalGlue's creative.space Wins NAB 2...
22/04/2025
iZotope unveils Equinox: the ultimate reverb plugin for Post and Music productio...
22/04/2025
Every way to export transparent videos from After Effects
Graham Quince April 21, 2025
0 Comments
In this video, I break down each way you can export ...
22/04/2025
April 22nd, 2025 Press Materials Available Here
Tribeca Festival 2025 Announces Short Film Lineup
Featuring How I Learned to Die Executive Produced by Osca...