TECH NEWS – The xAI Grok 4 appears to be designed for success in AI performance tests, but it struggles with dynamic, strategic challenges. Grok 4 recently placed fifth in the multi-agent Step Race benchmark, which uses New York Times Connections puzzles to evaluate different AI models’ performance. Even Gemini 2.5 Flash performed better than Grok 4!
Given Grok 4’s high scores on various standardized benchmarks, one might assume that the model was gamed through overfitting to perform well on benchmarks. Overfitting occurs when the model loads training data instead of capturing important patterns within the dataset.
This does not mean, however, that xAI Grok 4 is not a useful model. After all, its reasoning capabilities have dramatically improved. It outperforms almost all other models in identifying coding errors and bugs. People are also using the large language model (LLM) to create game code and transpose it to Cursor. However, the model is still not as capable as Elon Musk would have us believe. It’s worth looking at the betting platform Kakshi, where Grok 4 has attracted only medium stakes so far.
Grok 4 takes fifth place on the Multi-Agent Step Race Benchmark: Collaboration and Deception Under Pressure (TrueSkill score: 7.9). o3 remains in first place with 9.4. pic.twitter.com/mmGmWM23h1
— Lech Mazur (@LechMazur) July 12, 2025
More info about this benchmark:https://t.co/fMT0EYLHu0https://t.co/T0VrBzLwIc
My benchmarks so far show very solid improvements in reasoning (see the NYT Connections results) but little improvement in other areas (see the Creative Writing results). More are in progress. pic.twitter.com/rHRnqmAzsX
— Lech Mazur (@LechMazur) July 13, 2025
Meanwhile, the Financial Times recently reported that xAI, Twitter’s parent company, is targeting a $200 billion valuation in an upcoming funding round. Notably, xAI raised $300 million in June through a secondary equity offering and an additional $10 billion in early July. SpaceX is reportedly investing an additional $2 billion in xAI from a recent $5 billion funding round. (How is it legal for Musk to invest in himself anyway?)
Finally, it seems that Elon Musk is paving the way for Tesla to take a stake in xAI, putting an end to the “hot potato” game of funding between the various Musk-linked entities.
Grok 4 Heavy is better than any model available at identifying issues in your codebase. Here’s the JS prompt I use with my game code to have Grok 4 Heavy find the bugs.
Python prompt in Comments👇 pic.twitter.com/HFpW1hGvMM
— Tetsuo (@tetsuoai) July 13, 2025
I took Grok 4 for a spin this weekend to build this game prototype.
I used SuperGrok Chat to generate the initial game prototype and then brought it over to Cursor to continue coding with Grok 4 MAX.
Grok 4 in Cursor is like a no-nonsense agent. Doesn’t speak much, but… pic.twitter.com/wyib2vRvsd
— Danny Limanseta (@DannyLimanseta) July 13, 2025




Leave a Reply