Sony Pixel Power calrec Sony

What's the ROI? Getting the Most Out of LLM Inference

09/10/2024

Large language models and the applications they power enable unprecedented opportunities for organizations to get deeper insights from their data reservoirs and to build entirely new classes of applications.

But with opportunities often come challenges.

Both on premises and in the cloud, applications that are expected to run in real time place significant demands on data center infrastructure to simultaneously deliver high throughput and low latency with one platform investment.

To drive continuous performance improvements and improve the return on infrastructure investments, NVIDIA regularly optimizes the state-of-the-art community models, including Meta's Llama, Google's Gemma, Microsoft's Phi and our own NVLM-D-72B, released just a few weeks ago.

Relentless Improvements Performance improvements let our customers and partners serve more complex models and reduce the needed infrastructure to host them. NVIDIA optimizes performance at every layer of the technology stack, including TensorRT-LLM, a purpose-built library to deliver state-of-the-art performance on the latest LLMs. With improvements to the open-source Llama 70B model, which delivers very high accuracy, we've already improved minimum latency performance by 3.5x in less than a year.

We're constantly improving our platform performance and regularly publish performance updates. Each week, improvements to NVIDIA software libraries are published, allowing customers to get more from the very same GPUs. For example, in just a few months' time, we've improved our low-latency Llama 70B performance by 3.5x.

NVIDIA has increased performance on the Llama 70B model by 3.5x. In the most recent round of MLPerf Inference 4.1, we made our first-ever submission with the Blackwell platform. It delivered 4x more performance than the previous generation.

This submission was also the first-ever MLPerf submission to use FP4 precision. Narrower precision formats, like FP4, reduces memory footprint and memory traffic, and also boost computational throughput. The process takes advantage of Blackwell's second-generation Transformer Engine, and with advanced quantization techniques that are part of TensorRT Model Optimizer, the Blackwell submission met the strict accuracy targets of the MLPerf benchmark.

Blackwell B200 delivers up to 4x more performance versus previous generation on MLPerf Inference v4.1's Llama 2 70B workload. Improvements in Blackwell haven't stopped the continued acceleration of Hopper. In the last year, Hopper performance has increased 3.4x in MLPerf on H100 thanks to regular software advancements. This means that NVIDIA's peak performance today, on Blackwell, is 10x faster than it was just one year ago on Hopper.

These results track progress on the MLPerf Inference Llama 2 70B Offline scenario over the past year. Our ongoing work is incorporated into TensorRT-LLM, a purpose-built library to accelerate LLMs that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM is built on top of the TensorRT Deep Learning Inference library and leverages much of TensorRT's deep learning optimizations with additional LLM-specific improvements.

Improving Llama in Leaps and Bounds More recently, we've continued optimizing variants of Meta's Llama models, including versions 3.1 and 3.2 as well as model sizes 70B and the biggest model, 405B. These optimizations include custom quantization recipes, as well as efficient use of parallelization techniques to more efficiently split the model across multiple GPUs, leveraging NVIDIA NVLink and NVSwitch interconnect technologies. Cutting-edge LLMs like Llama 3.1 405B are very demanding and require the combined performance of multiple state-of-the-art GPUs for fast responses.

Parallelism techniques require a hardware platform with a robust GPU-to-GPU interconnect fabric to get maximum performance and avoid communication bottlenecks. Each NVIDIA H200 Tensor Core GPU features fourth-generation NVLink, which provides a whopping 900GB/s of GPU-to-GPU bandwidth. Every eight-GPU HGX H200 platform also ships with four NVLink Switches, enabling every H200 GPU to communicate with any other H200 GPU at 900GB/s, simultaneously.

Many LLM deployments use parallelism over choosing to keep the workload on a single GPU, which can have compute bottlenecks. LLMs seek to balance low latency and high throughput, with the optimal parallelization technique depending on application requirements.

For instance, if lowest latency is the priority, tensor parallelism is critical, as the combined compute performance of multiple GPUs can be used to serve tokens to users more quickly. However, for use cases where peak throughput across all users is prioritized, pipeline parallelism can efficiently boost overall server throughput.

The table below shows that tensor parallelism can deliver over 5x more throughput in minimum latency scenarios, whereas pipeline parallelism brings 50% more performance for maximum throughput use cases.

For production deployments that seek to maximize throughput within a given latency budget, a platform needs to provide the ability to effectively combine both techniques like in TensorRT-LLM.

Read the technical blog on boosting Llama 3.1 405B throughput to learn more about these techniques.

Different scenarios have different requirements, and parallelism techniques bring optimal performance for each of these scenarios. The Virtuous Cycle Over the lifecycle of our architectures, we deliver significant performance gains from ongoing software tuning and optimization. These improvements translate into additional value for customers who train and deploy on our platforms. They're able to create more capable models and applications and deploy their existing models using less infrastructure, enhancing th
LINK: https://blogs.nvidia.com/blog/llm-inference-roi/...
See more stories from nvidia

Most recent headlines

09/12/2024

Dalet Named an IDC Innovator in Media and Entertainment

Dalet, a leading technology and service provider for media-rich organizations, today announced that it has been named an IDC Innovator in the IDC Innovators: ...

09/11/2024

Dalet Expands Leadership Team to Fuel Next Stage of Growth

Dalet, a leading technology and service provider for media-rich organizations, today announced three new members of its executive team. Tara Bryant joins as Chi...

18/10/2024

Run3TV, TVision Partner on NextGen TV Measurement

NEW YORK Broadcasters who have deployed ATSC 3.0 (aka NextGen TV) have a new resource to help them better understand who's viewing 3.0 and what features the...

18/10/2024

Spectrum News Expands Into South Carolina

GREENVILLE, S.C. Spectrum News has expanded its footprint with the launch of Spectrum News South Carolina. The channel features local headlines at the top and b...

18/10/2024

FCC Proposes $146,976 Fine Against ESPN for EAS Violations

WASHINGTON The FCC has proposed a fine of $146,976 against ESPN, the statutory maximum the agency can impose for violating emergency alert system rules....

18/10/2024

Cox Communications Launches Xumo Stream Box

ATLANTA Cox Communications has launched the Xumo Stream Box, a 4K streaming device that brings together live TV and streaming, and said it is making the device ...

18/10/2024

RNZ Selects Dalet for New Editorial System

PARIS New Zealand public broadcaster RNZ has selected Dalet to transform its editorial systems with the media solutions provider's story-centric news produc...

18/10/2024

Cobalt Digital's Gene J. Zimmerman Named SMPTE Fellow

WHITE PLAINS, N.Y. Gene J. Zimmerman, founder, president and CEO of Cobalt Digital, has been named a fellow of the Society of Motion Picture and Television Engi...

18/10/2024

Harris Fox News Interview Draws 7.1M Viewers

NEW YORK Fox News Channel's (FNC) Special Report with Bret Baier is reporting that 7.8 million viewers watched its interview with vice president President K...

18/10/2024

Netflix Adds 5M New Subs in Q3

LOS GATOS, Calif. Netflix reported another generally strong quarter in its Q3 2024 earnings report, adding more than 5 million new subs as revenue increased by ...

18/10/2024

Sign up to the Creative COW Newsletter to enter the RE:Vision Effects 2024 Raffle!

Sign up to the Creative COW Newsletter to enter the RE:Vision Effects 2024 Raffl...

18/10/2024

Berklee Announces Leon V. Rosenberg Presidential Scholarship in Jazz Studies

Berklee Announces Leon V. Rosenberg Presidential Scholarship in Jazz Studies This permanently endowed scholarship honors the legacy of Leon V. Rosenberg, a pa...

18/10/2024

Berklee Alum Aliyah Khaylyn Wows The Voice Coaches with Tamar Braxton Song

Berklee Alum Aliyah Khaylyn Wows The Voice Coaches with Tamar Braxton Song The recent alums version of Braxtons Love and War for her blind audition left Micha...

17/10/2024

Exhibiting Forgiveness: An Exercise in Healing, On and Off the Set

PARK CITY, UTAH - JANUARY 20: (L-R) Jaime Ray Newman, Andr Holland, Andra Day, Titus Kaphar, Aunjanue Ellis-Taylor and John Earl Jelks attend the 2024 Sundance...

17/10/2024

It's Never Too Early To Spread Festive Cheer, Our Spotify Holiday Singles Are Here

The air is turning crisp, and it won't be long until everyone is walking aro...

17/10/2024

Coldplay Unites With Spotify and FC Barcelona To Release a Special El Clsico Shirt, Merch Collection, and Matchday Playlist

Spotify is once again collaborating with record-breaking legends from the worlds...

17/10/2024

Gold Medal Gymnast Rebeca Andrade Gives Us a Peek at Her Spotify

Rebeca Andrade knows her way around the gym floor . . . and vault, and uneven bars, and balance beam. The 25-year-old is the most decorated Latin American gymna...

17/10/2024

Your Exclusive Look Inside Our Charli xcx and Troye Sivan SWEAT Tour Afterparty

Since Charli xcx and Troye Sivan kicked off SWEAT, their joint tour of North America, in September, the live shows have become part of the cultural canon. Toget...

17/10/2024

Last Cut Media Increases EditShare Footprint to Streamline Production

Last Cut Media Increases EditShare Footprint to Streamline Production Adds capacity, AI indexing and sharing technologies Boston, MA, 17 October 2024 - EditS...

17/10/2024

BLAM vs BOLT: what's in a name?

Since its launch in 2020, Blue Lucy's flagship product, BLAM, has also been the company's only product. BLAM is a sophisticated workflow orchestration, ...

17/10/2024

Noticias Telemundo AHORA Launches on Samsung TV Plus

MIAMI Noticias Telemundo, the news division of NBCUniversal Telemundo Enterprises, has expanded the distribution of its FAST channel Noticias Telemundo AHORA to...

17/10/2024

Tegna, Dallas Mavericks Add 8 More Local Stations to TV Lineup

TYSONS, Va. and DALLAS Tegna and the Dallas Mavericks have announced that eight more stations in Texas, Louisana and Oklahoma will join six Tegna-owned outlets ...

17/10/2024

NAB Tells FCC Its Plan for Political AI Disclosure Can't Stand

The National Association of Broadcasters says the FCC will face legal obstacles if it goes ahead with its plan to address political deepfakes....

17/10/2024

Dish Launches New TV Offer That Includes Netflix

ENGLEWOOD, Colo. Dish Network has announced that new subscribers will get Netflix included with Dish TV subscriptions at no additional cost for two years....

17/10/2024

Opinion: The transformative impact of remote production and cloud migration on the media industry

John Wastcoat, SVP business development and marketing at Zixi, highlights how me...

17/10/2024

How cinematographer Haris Zambarloukos captured the colour of Beetlejuice Beetlejuice

Zambarloukos details the cameras and lenses he used to capture Tim Burtons spook...

17/10/2024

EMG/Gravity Media names Eamonn Curtin as chief commercial officer

He takes up the role immediately, having spent the past 10 years as global client director at EMG/Gravity Media By Jenny Priestley Published: October 17, 202...

17/10/2024

Council pulls plug on Mo-Sys Plumstead Power Station deal

Mo-Sys planned to refurbish the Grade II listed site, opening eight studio stages By Matthew Corrigan Published: October 17, 2024 Mo-Sys planned to refurb...

17/10/2024

Sony Expands Sports Data Business With KinaTrax Acquisition

TOKYO and BASINGSTOKE, U.K. Sony has acquired KinaTrax, a provider of markerless motion-capture technology for sports that collects in-game biomechanical perfor...

17/10/2024

Warner Bros. Discovery Plans To Launch Max in 7 New Markets

SINGAPORE Warner Bros. Discovery has announced that its streaming service Max will launch on Nov. 19 in Indonesia, Malaysia, the Philippines, Singapore and Thai...

17/10/2024

Philo Launches on LG TVs

Philo has announced that its entire lineup of live and on-demand TV is now available on LG Smart TVs....

17/10/2024

Cox Media Group Provides Nearly $1 Million for Hurricane Relief

ATLANTA In addition to providing critical information about hurricanes Helene and Milton in its ongoing news coverage, Cox Media Group (CMG) has announced that ...

17/10/2024

NAB Reiterates Opposition to FCC Blackout Reporting Plan

The National Association of Broadcasters has reiterated its opposition to an FCC proposal to require pay TV providers to notify the regulator when TV stations a...

17/10/2024

On-the-Go Lighting Gets a Powerful Boost with the Launch...

amaran, pioneer of lighting tools for creators, has just announced the launch of the amaran Ace 25x and 25c, brand-new bi-color and full-color compact LED light...

17/10/2024

Sonnet Announces Value-Priced Seven-Port USB PCIe Card

Sonnet Announces Value-Priced Seven-Port USB PCIe Card Brie Clayton October 16, 2024 0 Comments Allegro 7-Port USB 3.2 Gen 1 Type A PCIe Card Adds Six...

17/10/2024

Blackmagic Design Used on 2024 Summer and Blockbuster Season Films

Blackmagic Design Used on 2024 Summer and Blockbuster Season Films Brie Clayton October 16, 2024 0 Comments Global hits such as Abigail, Alien: Rom...

17/10/2024

PROVYS Introduces Sphere, a Scalable Broadcast Channel Management System

PROVYS Introduces Sphere, a Scalable Broadcast Channel Management System Brie Clayton October 16, 2024 0 Comments Accompanying image shows members of ...

17/10/2024

Autodesk Media & Entertainment brings new capabilities to enhance creativity and efficiency

Autodesk Media & Entertainment brings new capabilities to enhance creativity and...

17/10/2024

How Digital Twins Are Driving Efficiency and Cutting Emissions in Manufacturing

Improving the sustainability of manufacturing involves optimizing entire product lifecycles - from material sourcing and transportation to design, production, d...

17/10/2024

Waterways Wonder: Clearbot Autonomously Cleans Waters With Energy-Efficient AI

What started as two classmates seeking a free graduation trip to Bali subsidized by a university project ended up as an AI-driven sea-cleaning boat prototype bu...

17/10/2024

VEON Files its 2023 Form 20-F

17 Oct 2024 VEON Files its 2023 Form 20-F Amsterdam, 17 October 2024, 20:45: VEON Ltd. (Nasdaq: VEON, Euronext Amsterdam: VEON), a global digital operator ( VE...

17/10/2024

SVG Remote Production Forum 2024: All Sessions Now Available To Watch on SVG PLAY

SVG Remote Production Forum 2024: All Sessions Now Available To Watch on SVG PLA...

17/10/2024

DUNE: PROPHECY coming exclusively to Sky and NOW on November 18, as official trailer released

DUNE: PROPHECY coming exclusively to Sky and NOW on November 18, as official tra...

17/10/2024

Rohde & Schwarz achieves full coverage of Skylos test plan for NB-NTN devices, enabling SMS services

Rohde & Schwarz achieves full coverage of Skylos test plan for NB-NTN devices, e...

17/10/2024

Netflix Unveils the First Images of 'The Snow Girl 2: El juego del alma'

Back to All News Netflix Unveils the First Images of The Snow Girl 2: El juego del alma Entertainment 17 October 2024 GlobalSpain Link copied to clipboard ...

17/10/2024

'Turn of the Tide' Will Have a Third Season

Back to All News Turn of the Tide Will Have a Third Season Entertainment 17 October 2024 GlobalPortugal Link copied to clipboard Netflix announced today t...

17/10/2024

'The Lost Children,' the Documentary, Arrives on November 14

Back to All News The Lost Children, the Documentary, Arrives on November 14Play Video Play Video Entertainment 17 October 2024 GlobalColombia Link copied ...

17/10/2024

FOR-A powers new 12G truck for Production Crew

Native 12G-SDI for simple and immediate Ultra HD architectue...

17/10/2024

RNZ Selects Dalet for New Editorial System

Dalet, a leading technology and service provider for media-rich organizations, today announced that RNZ has partnered with Dalet to transform its editorial syst...

17/10/2024

Blancco Launches Free ROI Calculator to Help Enterprises Quantify Financial and ESG Benefits of Data Erasure

Home News & Press Press Release Blancco Launches Free ROI Calculator to He...