Each LLM is given the same 1000 chess puzzles to solve. See puzzles.csv
. Benchmarked on Mar 25, 2024.
Model | Solved | Solved % | Illegal Moves | Illegal Moves % | Adjusted Elo |
---|---|---|---|---|---|
gpt-4-turbo-preview | 229 | 22.9% | 163 | 16.3% | 1144 |
gpt-4 | 195 | 19.5% | 183 | 18.3% | 1047 |
claude-3-opus-20240229 | 72 | 7.2% | 464 | 46.4% | 521 |
claude-3-haiku-20240307 | 38 | 3.8% | 590 | 59.0% | 363 |
claude-3-sonnet-20240229 | 23 | 2.3% | 663 | 66.3% | 286 |
gpt-3.5-turbo | 23 | 2.3% | 683 | 68.3% | 269 |
claude-instant-1.2 | 10 | 1.0% | 707 | 66.3% | 245 |
mistral-large-latest | 4 | 0.4% | 813 | 81.3% | 149 |
mixtral-8x7b | 9 | 0.9% | 832 | 83.2% | 136 |
gemini-1.5-pro-latest* | FAIL | - | - | - | - |
Published by the CEO of Kagi!
This is obvious for people who understand the basics of LLM. However, people are fooled by how intelligent these LLM sounds, so they mistake it for actually being intelligent. So, even if this is an open door, I still think it’s good someone is kicking it in to make it clear that llms are not generally intelligent.
Agreed, it’s good to have these kinds of articles so people get a better feel for what tools like this are and aren’t.