Creative AI Testing Through Minecraft: Benchmarking Innovation

As conventional methods of AI testing prove to be ineffective, AI developers are turning to more innovative ways to assess the capabilities of generative models. One such approach for a group of developers is to utilize Minecraft.

The website Minecraft Benchmark (or MC-Bench) was created collaboratively to enable direct competition among AI models in generating Minecraft creations. Users can vote on which model performed best, and only after casting their votes can they discover which AI produced each Minecraft creation.

For Adi Singh, a high school senior who launched MC-Bench, the value of Minecraft lies not just in the game itself, but in how familiar people are with it—after all, it is the best-selling video game of all time. Even those who haven’t played can still appreciate which blocky representation of a pineapple is more appealing.

“Minecraft allows people to easily track the progress of AI development,” Singh remarked in an interview with TechCrunch. “People are accustomed to Minecraft’s look and feel.”

Currently, MC-Bench is supported by a team of eight volunteer developers. According to the MC-Bench website, companies like Anthropic, Google, OpenAI, and Alibaba have provided resources to help launch the benchmarks, but they are otherwise not involved in the project.

Other games, such as Pokémon Red, Street Fighter, and Pictionary, have also been employed as experimental testing grounds for AI, partly because the art of AI testing is notoriously complex.

Researchers frequently assess AI models using standardized tests, but many of these assessments give AIs an advantage when they play to their strengths. Given how they are trained, models often excel in specific narrow problem-solving tasks, particularly those involving rote memorization or basic extrapolation.

In simple terms, it’s challenging to interpret the fact that OpenAI’s GPT-4 can score 88 on the LSAT but struggles to determine how many ‘r’s are in the word ‘strawberry’. Anthropic’s Claude 3.7 Sonnet achieved 62.3% accuracy in a standardized software development test, yet plays Pokémon worse than most five-year-olds.

While MC-Bench serves as a programming benchmark by prompting models to generate code for suggested builds, like «Frosty the Snowman» or «a charming tropical beach hut on a pristine sand shore,» most users find it easier to evaluate which snowman looks better rather than digging through code. This approach makes the project more engaging and helps gather more data on which models consistently deliver superior results.

The question of how well these metrics reflect AI’s true utility remains. Nevertheless, Singh believes this is a significant indication.

“The current leaderboard closely matches my personal experience using these models, unlike many purely text-based tests,” Singh stated. “Perhaps MC-Bench can help companies gauge whether they are on the right track.”

Source