Modern LLMs posses impressive capabilities such as encyclopedic memory/knowledge, rapid reasoning speed, and ubiquitous availability. However, they also exhibit limitations, sometimes referred to as jagged intelligence.
Even state-of-the-art LLMs can fail at seemingly simple tasks, like determining which number is bigger, “3.9 or 3.11?” or answering the infamous question, “How many R’s are in the word ‘strawberry’?”
To explore multimodal capabilities, I’ve created a playful benchmark. Using ChatGPT’s image generation tool, I produced strawberry images where some seeds resemble the letter “R.”
See two examples below:
I ran the benchmark against several multimodal models to see how they perform using OpenRouter .
Here are the results:
Model | 6 R’s | 5 R’s |
---|---|---|
Anthropic Claude Opus-4 | 6 🟢 | 5 🟢 |
Anthropic Claude Sonnet-4 | 5 🔴 | 5 🟢 |
Google Gemini 2.5 Flash | 7 🔴 | 6 🔴 |
Google Gemini 2.5 Pro | 6 🟢 | 6 🔴 |
Meta-Llama 4 Maverick | 7 🔴 | 7 🔴 |
Meta-Llama 4 Scout | 8 🔴 | 7 🔴 |
Mistral Medium-3 | 9 🔴 | 7 🔴 |
OpenAI GPT-4.1 | 7 🔴 | 6 🔴 |
OpenAI GPT-4.1 mini | 7 🔴 | 6 🔴 |
OpenAI GPT-4.1 nano | 9 🔴 | 7 🔴 |
OpenAI o3 | 7 🔴 | 6 🔴 |
Modern LLMs easily count R’s in the word “strawberry,” yet still struggle visually interpreting and counting them in images. Only the most advanced models appear reliably capable of handling this visual task.
As you’ve likely noticed, I intentionally avoided using exactly 3 R’s on the strawberries to prevent models from merely recalling the answer from training data. Additionally, I had stickers made with this particular strawberry for other AI geeks who appreciate the joke. If you’d like some stickers, send me an email and I’ll gladly send them out for free, provided you cover postage costs and while my supply lasts.