PokerBench: Training Large Language Models to become Professional Poker Players
Richard Zhuang, Akshat Gupta, Richard Yang, Aniket Rahane, Zhengyu Li, Gopala Anumanchipalli
2025-01-15

Summary
This paper talks about PokerBench, a new way to test how well AI language models can play poker. The researchers created a set of poker scenarios to see if AI can handle the complex thinking needed in poker games.
What's the problem?
AI is really good at tasks like understanding and writing text, but it struggles with complex games like poker. Poker is tricky because you can't see all the information, and it needs skills like math, planning, and understanding how people think. Current AI models aren't great at playing poker, even though they're smart in other ways.
What's the solution?
The researchers made PokerBench, which is like a big poker test for AI. They created 11,000 important poker situations with help from real poker players. They tested famous AI models like GPT-4 and ChatGPT on these situations. At first, none of the AIs were very good at poker. But when the researchers gave the AIs special training on poker, they got much better. They even had the AIs play against each other to prove that the ones with better PokerBench scores won more real poker games.
Why it matters?
This matters because it helps us understand how to make AI smarter at complex tasks that involve strategy and incomplete information. If we can teach AI to be good at poker, it might help us make AI that can handle other tricky real-world problems. It also shows us where current AI training methods fall short, which could lead to new ways of teaching AI to think more like humans in complex situations. Plus, by making their test and data public, the researchers are helping other scientists work on this problem too.
Abstract
We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios. The dataset and code will be made available at: https://github.com/pokerllm/pokerbench.