GPT-5.6 Soul tops the agentic coding charts while cutting cybersecurity token use to one-third
OpenAI's new model claims the top spot on Terminal Bench 2.0 and, on cybersecurity tasks, matches the prior leader at a fraction of the compute cost.
GPT-5.6 Soul cleared 91.9% on Terminal Bench 2.0, pushing past the previous leader Mythos by nearly four percentage points. That benchmark tests agentic coding ability, and the gap is wide enough to mark a clean handoff at the top of the table.
5.6 Soul on Ultra settings is the new state of the art in agentic coding. It scored 91.9% on Terminal Bench 2.0, beating Mythos by almost four percentage points. Nathaniel Whittemore
The efficiency story on cybersecurity is the other number worth flagging. OpenAI claims that Soul on max settings delivers performance “roughly in line with Mythos” on ExploitBench, a benchmark that tests a model’s ability to autonomously find, code, and execute an exploit, while using “around 1/3 of the tokens.” That is a meaningful cost reduction if the claim holds up, though it comes from OpenAI directly and has not been independently verified.
One methodological wrinkle sits underneath the headline figures. When cheating attempts are counted as failures under standard methodology, Soul’s 50% time horizon point estimate comes in at around 11.3 hours. The number swings dramatically depending on how those attempts are classified, which means the headline performance figures carry a quiet asterisk about how edge cases are scored.