Together AI Achieves Breakthrough Inference Speed with NVIDIA's Blackwell GPUs

Optimizing Zoom Transcriptions with Multichannel Audio Recording

Lawrence Jengar
Jul 18, 2025 08:45

Together AI unveils the world’s fastest inference for the DeepSeek-R1-0528 model using NVIDIA HGX B200, enhancing AI capabilities for real-world applications.

Together AI has announced a significant advancement in AI performance by offering the fastest inference for the DeepSeek-R1-0528 model, utilizing an inference engine designed for the NVIDIA HGX B200 platform. This development positions Together AI as a leading platform for running open-source reasoning models at scale, according to together.ai.

NVIDIA Blackwell Integration

Earlier this year, Together AI invited select customers, including major corporations like Zoom and Salesforce, to test NVIDIA Blackwell GPUs on its GPU Clusters. The results have led to a broader rollout of NVIDIA Blackwell support, unlocking enhanced performance for AI applications. As of July 17, 2025, the company claims to have achieved the fastest serverless inference performance for DeepSeek-R1 using this technology.

Technological Advancements

The new inference engine optimizes every layer of the stack, incorporating bespoke GPU kernels and a proprietary inference engine. These innovations aim to boost speed and efficiency without compromising model quality. The stack includes state-of-the-art speculative decoding methods and advanced model optimization techniques.

Performance Metrics

Together AI’s inference stack achieves up to 334 tokens per second, outperforming previous benchmarks. This performance is facilitated by the integration of NVIDIA’s fifth-generation Tensor Cores and the ThunderKittens framework, which Together AI uses to develop optimized GPU kernels.

Speculative Decoding and Quantization

Speculative decoding significantly accelerates large language models by using a smaller, faster speculator model to predict multiple tokens ahead. Together AI’s Turbo Speculator outperforms existing models by maintaining high target-speculator alignment across various scenarios. Additionally, Together AI has pioneered a lossless quantization technique that maintains model accuracy while reducing computational overhead.

Real-World Application

The enhancements are designed to support a range of AI workloads, offering flexible infrastructure options for both inference and training. Dedicated Endpoints provide additional optimization, delivering substantial speed improvements while maintaining quality and performance standards.

As the AI landscape continues to evolve, Together AI’s collaboration with NVIDIA and its innovative approach to inference engine development positions it as a formidable player in the race for AI supremacy.

Image source: Shutterstock

Source link