TECH NEWS – The GeForce RTX 5090 for gamers and the RTX Pro 6000 for workstations have been affected by a bug.
CloudRift, a GPU cloud for developers, first reported crash issues with Nvidia’s high-end graphics cards. After a few days of VM use, the SKUs shut down completely. Interestingly, the GPUs can only be accessed by restarting the node system. This issue reportedly affects only the RTX 5090 and RTX Pro 6000 models, not the RTX 4090, Hopper H100, or Blackwell-based B200 models.
The bug occurs when a GPU is assigned to a VM environment using the VFIO device driver. After a Function Level Reset (FLR), the GPU does not respond, resulting in a kernel “soft lock” state that brings the host and client environments to a standstill. Recovering from this situation requires restarting the host machine, which is difficult for CloudRift given the number of guest machines they have.
This problem is not limited to CloudRift. A Proxmox user reported a similar issue in which the host machine crashed completely after the Windows client shut down. Interestingly, he claims that Nvidia responded to the problem, reproduced it, and is working on a fix. For now, it appears that the problem is specific to Blackwell-based GPUs.
CloudRift offered a $1,000 bug bounty to anyone who could fix or mitigate the problem. We expect Nvidia to release a fix soon because the problem affects critical AI workloads. Therefore, Nvidia cannot really be criticized; such bugs sometimes occur, and it is appropriate to correct them as soon as possible because those who purchase high-end GPUs expect top quality and stability.
However, it should be noted that Nvidia has struggled with driver stability over the past year (and we kept reporting about the issues the company had).
Source: WCCFTech, CloudRift, Proxmox





Leave a Reply