TECH NEWS – On 1.6T models, Jensen Huang’s company is already pushing as much as 3,500 tokens per second out of the Chinese AI model.
DeepSeek V4 has arrived, bringing major optimizations with it, including model sizes of up to 1.6T, and Nvidia is already offering Day-0 support for it on Blackwell GPUs using NVFP4. The updated AI model uses only 27% of the inference FLOPs per token and just 10% of the KV cache when operating with a one-million-token context window. Two new models have also been introduced: a Pro model with 1.6 trillion parameters and a Flash version with 284 billion parameters. Nvidia says Blackwell GPUs provide both the scale and the low-latency performance required to run the long-context, one-million-token inference and trillion-parameter AI models enabled by V4.
“From Nvidia Blackwell data center deployments to managed NIM microservices and fine-tuning workflows, Nvidia offers multiple ways to integrate DeepSeek and other open models across different stages of development and deployment. Nvidia is an active contributor to the open source ecosystem and has released hundreds of projects under open source licenses. Nvidia remains committed to optimizing community software, and open models allow users to share their work on AI safety and resilience far more broadly,” Nvidia wrote.
Nvidia is showing throughput of nearly 3,500 TPS per GPU, specifically GB300 or Blackwell Ultra, and these are only preliminary figures that are expected to rise even further as the shared design layer receives additional optimization. The Nvidia Blackwell stack includes a wide range of technologies specifically built for models like V4, including NVFP4, Dynamo, optimized CUDA kernels, advanced parallelization methods, and more. One of the key elements of DeepSeek V4 is its use of FP4, or MXFP4, quantization to accelerate rollouts and inference runs. With FP4 in play, V4 models reduce both memory traffic and sampling latency.
It is also worth noting that Huawei’s latest Ascend chips, the Ascend 950PR and Ascend 950DT, both planned for 2026, support MXFP4 instructions as well. That strongly suggests DeepSeek V4 will also be fully compatible with China’s domestic AI chips. Thanks to Nvidia’s ongoing optimizations, future models may end up enjoying a robust ecosystem of support from the very first day.







Leave a Reply