Google Introduces Implicit Caching for Affordable Access to Latest AI Models

Опубликовано: 10 мая, 2025

Google is introducing a feature in its Gemini API that the company claims will reduce costs for third-party developers utilizing its latest AI models. This feature, referred to as «implicit caching,» is said to potentially save up to 75% on «repetitive context» provided to the models through the Gemini API.

The feature is compatible with Google’s Gemini 2.5 Pro and 2.5 Flash models. Developers can access these models via BotHub using a referral link. This could be welcomed news for developers as the costs associated with using AI models continue to rise.

Caching is a common practice in the AI industry, allowing the reuse of frequently accessed or precomputed data within models to lower computational demands and expenses. For instance, caches can hold answers to commonly posed questions, sparing the model from generating responses anew for the same query.

Previously, Google offered explicit caching of model prompts, which required developers to identify their most frequently used prompts. While savings were expected, this method often demanded significant manual oversight.

Some developers expressed dissatisfaction with how Google’s explicit caching worked for the Gemini 2.5 Pro, noting it could lead to unexpectedly high API bills. Complaints escalated last week, prompting the Gemini team to apologize and promise improvements.

Unlike explicit caching, implicit caching operates automatically. Enabled by default for the Gemini 2.5 models, it passes on cost savings when an API request matches previous requests stored in the cache.

“When you send a request to one of the Gemini 2.5 models, if the request shares the same prefix as prior requests, it qualifies for caching,” Google explained in its blog. “We’ll dynamically transfer the savings to you.”

The minimum token count for requests to qualify for implicit caching is 1024 for the 2.5 Flash model and 2048 for the 2.5 Pro model, according to Google’s developer documentation. This is not a particularly large number, suggesting that automatic savings should be easily achievable. Tokens represent raw bits of data that models operate on, with one thousand tokens roughly equating to 750 words.

Given that Google’s previous claims about cost savings via caching did not hold up, there are aspects of this new feature that users should monitor closely. For instance, Google advises developers to place repetitive context at the beginning of requests to increase the likelihood of implicit caching. Any context that may vary between requests should be appended at the end, according to the company.

Furthermore, Google has not provided any third-party validation that this new implicit caching system will deliver the promised automatic savings. It remains to be seen what early adopters will report.