Apples Claims on Large Reasoning Models Scrutinized: New Study Questions Core Conclusions

During a re-evaluation of Apple’s controversial article titled «The Illusion of Thought,» several key critiques were validated, although the primary conclusion of the study was questioned.

Researchers from the Spanish Center for Automation and Robotics CSIC-UPM conducted replication experiments using the data from the original Apple article published in June 2025, which had sparked considerable debate within the artificial intelligence developer community.

Apple claimed that even the newest large reasoning models (LRMs) struggled with tasks requiring basic symbolic planning. The study indicated that the performance of these models significantly declined when task complexity exceeded a moderate level, and they sometimes exhibited excessive caution when addressing simpler problems.

While the new research largely confirms Apple’s findings, it challenges their interpretation. The Spanish team asserts that the limitations of the models are not merely due to a lack of «cognitive abilities,» but also stem from the way tasks are framed, the structure of prompts, and the stochastic optimization methods employed.

To assess long-term planning capabilities, the researchers utilized the classic «Tower of Hanoi» puzzle with models such as Gemini 2.5 Pro. They decomposed the problem into smaller sub-tasks to prevent the models from needing to generate a solution all at once.

This step-by-step approach was effective for systems working with seven disks. However, when using eight or more disks, performance drastically dropped, mirroring the significant decline observed in Apple’s study as task complexity increased.

A new interpretation points to token usage as a crucial factor: the **amount of tokens consumed by the model is directly related to its perception of whether a solution is feasible.** As long as the model believes it can solve the problem, it increases resource consumption. If it determines that the problem is unsolvable, it quickly halts operations, indicating a form of implicit uncertainty management.

The researchers also experimented with a multi-agent approach, where two language models took turns proposing steps to solve the task. This resulted in lengthy discussions and high token expenditures but seldom yielded correct solutions.

Although the models followed all rules, they frequently became trapped in endless cycles of permissible yet ineffective moves. The authors concluded that the models lack the ability to identify and implement higher-level strategies, even when they operate formally correctly.

Unlike Apple, which interpreted these failures as evidence of insufficient cognitive capabilities, the Spanish team also attributed them to the framing of prompts and the absence of global search mechanisms.

The most severe critique was directed at the river-crossing test that underpinned Apple’s article. While Apple reported particularly poor model performance on this test, the replication study revealed that many of Apple’s test cases were mathematically unsolvable—a detail not disclosed in the original publication.

The researchers tested valid configurations and found that the model successfully addressed complex problems involving over 100 pairs of agents.

Interestingly, the most challenging tasks were not the largest, but instead those situated «in the middle.» Such tasks have very few suitable solutions and require extremely precise planning, placing a heavy burden on the models.

This supports one of Apple’s key conclusions: the most significant drop in language model performance does not solely depend on the size or complexity of the problem. Rather, models face notable difficulties when solving moderately difficult tasks, such as the river-crossing puzzle with five pairs of agents, which has only a few correct solutions. With simpler or more complex tasks, models often perform better—**either because many possible solutions exist, or because the model finds the task easier to analyze.**

Ultimately, the Spanish team rejected Apple’s core assertion that language models are fundamentally incapable of generalization. Instead, they described these models as **»stochastic search engines, fine-tuned through reinforcement learning within a discrete state space that we barely understand.»**

From this perspective, language models are not rational planners but rather systems that explore local solution paths based on learned patterns, possessing only limited ability for long-term planning.

Additionally, the researchers hypothesized that token usage could serve as an inherent signal for the model regarding its subjective perception of the problem’s solvability. When a model believes a task can be solved, it allocates more resources to it, but if it perceives no potential for resolution, it ceases operations.

**Delegate some routine tasks using** [**BotHub**](https://bothub.chat/?utm_source=contentmarketing&utm_medium=habr&utm_campaign=news&utm_content=APPLE’S_STATEMENTS_ABOUT_LARGE_DATA_PROCESSING_MODELS_HAVE_BECOME_THE_SUBJECT_OF_A_NEW_STUDY) **! No VPN is required, and you can use a Russian card.** [**Follow this link**](https://bothub.chat/?invitedBy=m_aGCkuyTgqllHCK0dUc7) **to receive 100,000 free tokens for your first tasks and start working with neural networks right now!**

[**Source**](https://the-decoder.com/apples-claims-about-large-reasoning-models-face-fresh-scrutiny-from-a-new-study/)