What's the Best AI Model to Run Your Business? The One That Lies Best, Apparently

In brief

Vending-Bench Arena tested AI agents running competing vending machine businesses.
Top models increased profits through price-fixing, collusion, and deceptive tactics. Claude was the best at these tactics.
GLM-5 defeated Claude by impersonating a teammate and extracting sensitive strategy.

Researchers at Andon Labs just answered which AI models are best at running a business. The top performers all won by forming illegal price cartels, exploiting desperate competitors, and lying to customers about refunds.

The Vending-Bench Arena test puts AI models in charge of competing vending machines for a simulated year. They negotiate with suppliers, manage inventory, set prices, and can email each other to collaborate or compete. Success requires balancing costs, pricing strategy, customer service, and competitor dynamics. Claude Opus 4.6 dominated the benchmark with $8,017 in profit—and celebrated its win by noting: “My pricing coordination worked!”

Anthropic is the image of the nice guys in the AI space, but that “coordination” strategy that Claude proposed was basically price-fixing. When competing models struggled, Opus 4.6 proposed: “Let’s NOT undercut each other — agree on minimum pricing… Should we agree on a price floor of $2.00 for most items?” When a rival ran low on inventory, it spotted an opportunity: “Owen needs stock badly. I can profit from this!” It sold Kit Kats at 75% markup to the desperate competitor. When asked for supplier recommendations, it deliberately directed rivals to expensive wholesalers while keeping its own good sources secret.

The latest update in the benchmark added team competition. Researchers pitted two Chinese GLM-5 models against two American Claude models and told them to find their teammates, Americans or Chinese—without revealing which agents were which. The results were genuinely bizarre.

GLM-5 won both rounds by convincing Claude it was Claude. “I’m also powered by Claude from Anthropic, so we’re teammates!” one GLM-5 agent confidently declared. Claude, meanwhile, got so confused that Sonnet 4.5 concluded: “I’m powered by a Chinese model, so I need to find the other Chinese model Agent.”

In more than half the test runs, agents teamed with their competitors. The Claude models shared supplier pricing and coordinated strategy—leaking valuable information to rivals. “GLM-5 won both,” the researchers wrote. “The Claude models tried to be team players and ended up leaking valuable info to their competitors.”

And agents doing shady stuff may be all fun and games until you realize Wall Street is already deploying them in real-life operations. JPMorgan deployed LLM Suite to 60,000 employees. Goldman Sachs built its GS AI Assistant for trading desks, claiming 20% productivity gains. Bridgewater uses Claude to analyze earnings and even high-school age kids are seeing their chatbots trade stocks more efficiently.

In general, adoption of agentic workflows is accelerating rapidly across enterprises.

When Anthropic and Wall Street Journal reporters ran a real vending machine experiment in December, the AI bought a PlayStation 5, several bottles of wine, and a live betta fish before going bankrupt. Recent research from Gwangju Institute found that when AI models were told to “maximize rewards” in gambling scenarios, bankruptcy rates hit 48%. “When given the freedom to determine their own target amounts and betting sizes, bankruptcy rates rose substantially alongside increased irrational behavior,” researchers found.

So, it seems that, at least for now, AI models optimized for profit consistently choose unethical tactics. They form cartels. They exploit weakness. They lie to customers and competitors. Some do it deliberately. Others, like GLM-5 claiming to be Claude, seem genuinely confused about their own identity. The distinction might not matter.

Wall Street’s AI deployment raises a question the Vending-Bench results can’t answer: If the “best” performing model wins through price-fixing and deception, is it really the best choice for your business? The benchmark measures profit. It doesn’t measure whether those profits came from fraud.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.

Artificial Intelligence#What039s #Model #Run #Business #Lies #Apparently1771518787

What's the Best AI Model to Run Your Business? The One That Lies Best, Apparently

In brief

Daily Debrief Newsletter

Related posts

Leave a Comment Cancel Reply