For your AI feature, you might pass the same input tokens (content) over and over to a model. For these use cases, you can instead cache this content, meaning that you pass the content to the model once, store it, and reference it in subsequent requests.
Context caching can significantly reduce latency and cost for repetitive tasks involving a large amount of content, like large amounts of text, an audio file, or a video file. Some common use cases for cached content include detailed persona documents, codebases, or manuals.
Gemini models offer two different caching mechanisms:
Implicit caching: automatically enabled on most models, no guaranteed cost savings
Explicit caching: can be optionally and manually enabled on most models, usually results in cost savings
Explicit caching is useful in cases where you want to more likely guarantee cost savings, but with some added developer work.
For both implicit and explicit caching, the cachedContentTokenCount field in
your response's metadata indicates the number of tokens in the cached part of
your input. For explicit caching, make sure to review pricing
information at the bottom of this page.
Supported models
Caching is supported when using the following models:
gemini-3.1-pro-previewgemini-3-flash-previewgemini-3.1-flash-lite-previewgemini-2.5-progemini-2.5-flashgemini-2.5-flash-lite
Media-generating models (for example, the Nana Banana models like
gemini-3.1-flash-image-preview), do not support context caching.
Cached content size limits
Each model has a minimum token count requirement for cached content. The maximum is dictated by the model's context window.
- Gemini Pro models: 4096 tokens minimum
- Gemini Flash models: 1024 tokens minimum
Additionally, the maximum size of content you can cache using a blob or text is 10 MB.
Implicit caching
Implicit caching is enabled by default and available for most Gemini models.
Google automatically passes on cost savings if your request hits the cached content. Here are some ways to increase the chance that your request uses implicit caching:
- Try putting large and common content at the beginning of your prompt.
- Try to send requests with a similar prefix in a short amount of time.
The number of tokens in the cached part of your input is provided in the
cachedContentTokenCount field in the metadata of a response.
Explicit caching
Explicit caching is not enabled by default, and it's an optional capability of the Gemini models.
Here's how you can set up and work with explicit content caches:
Manage explicit caches, including:
Note that explicit content caches interact with implicit caching, potentially leading to additional caching beyond the explicit cached content. You can prevent cache data retention by disabling implicit caching and not creating explicit caches. For more information, see Enable and disable caching.
Create and use an explicit cache
Creating and using an explicit content cache requires the following:
Important information about creating and using an explicit cache
Your cache must be aligned with your app's prompt requests and your server prompt template:
The cache is specific to a Gemini API provider. Your app's prompt request must use the same provider.
For Firebase AI Logic, we strongly recommend using explicit content caches only with the Vertex AI Gemini API. All the information and examples on this page are specific to that Gemini API provider.The cache is specific to a Gemini model. Your app's prompt request must use the same model.
The cache is specific to a location when using the Vertex AI Gemini API.
The location for the explicit cache must match the location of the server prompt template and the location where you access the model in your app's prompt request.
Also, be aware of the following limitations and requirements for explicit caching:
Once an explicit cache is created, you can't change anything about the cache except the TTL or expiration time.
You can cache any supported input file MIME type or even just text provided within the cache creation request.
If you want to include a file in the cache, you must provide the file as a Cloud Storage URI. It can't be a browser URL or YouTube URL.
Additionally, access restrictions to the file are checked at cache-creation-time, and access restrictions are not checked again at user-request-time. For this reason, make sure that any data included in the explicit cache is suitable for any user making a request that includes that cache.
If you want to use system instructions or tools (like code execution, URL context, or grounding with Google Search), then the cache itself must contain their configurations. They cannot be configured in the server prompt template or in your app's prompt request. Note that server prompt templates do not yet support function calling (or chat). For details about how to configure system instructions and tools in your cache, see the REST API of the Vertex AI Gemini API.
Step 1: Create the cache
Create the cache by directly using the REST API of the Vertex AI Gemini API.
The following is an example that creates an explicit cache of a PDF file as its content.
Syntax:
PROJECT_ID="PROJECT_ID"
MODEL_ID="GEMINI_MODEL" # for example, gemini-3-flash-preview
LOCATION="LOCATION" # location for both the cache and the model
MIME_TYPE="MIME_TYPE"
CACHED_CONTENT_URI="CLOUD_STORAGE_FILE_URI" # must be a Cloud Storage URI
CACHE_DISPLAY_NAME="CACHE_DISPLAY_NAME" # optional
TTL="CACHE_TIME_TO_LIVE" # optional (if not specified, defaults to 3600s)
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents \
-d @- <<EOF
{
"model":"projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}",
"contents": [
{
"role": "user",
"parts": [
{
"fileData": {
"mimeType": "${MIME_TYPE}",
"fileUri": "${CACHED_CONTENT_URI}"
}
}
]
}
],
"displayName": "${CACHE_DISPLAY_NAME}",
"ttl": "${TTL}"
}
EOF
Example request:
PROJECT_ID="my-amazing-app"
MODEL_ID="gemini-3-flash-preview"
LOCATION="global"
MIME_TYPE="application/pdf"
CACHED_CONTENT_URI="gs://cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf"
CACHE_DISPLAY_NAME="Gemini - A Family of Highly Capable Multimodal Model (PDF)"
TTL="7200s"
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents \
-d @- <<EOF
{
"model":"projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}",
"contents": [
{
"role": "user",
"parts": [
{
"fileData": {
"mimeType": "${MIME_TYPE}",
"fileUri": "${CACHED_CONTENT_URI}"
}
}
]
}
],
"displayName": "${CACHE_DISPLAY_NAME}",
"ttl": "${TTL}"
}
EOF
Example Response:
The response includes a fully-qualified resource name which is globally
unique to the cache (note that the last segment is the cache ID). You'll use
this entire name value in the next step of the workflow.
{
"name": "projects/861083271981/locations/global/cachedContents/4545031458888089601",
"model": "projects/my-amazing-app/locations/global/publishers/google/models/gemini-3-flash-preview",
"createTime": "2024-06-04T01:11:50.808236Z",
"updateTime": "2024-06-04T01:11:50.808236Z",
"expireTime": "2024-06-04T02:11:50.794542Z"
}
Step 2: Reference the cache in a server prompt template
After creating the cache, reference it by name within the cachedContent
property of a
server prompt template.
Make sure you follow these requirements when creating your server prompt template:
Use the fully-qualified resource
namefrom the response when you created the cache. This is not the optional display name that you specified in the request.The location for the server prompt template must match the location of the cache.
To use system instructions or tools, they must be configured as part of the cache and not as part of the server prompt template.
Syntax:
{{cachedContent name="YOUR_CACHE_RESOURCE_NAME"}}
{{role "user"}}
{{userPrompt}}
Example:
{{cachedContent name="projects/861083271981/locations/global/cachedContents/4545031458888089601"}}
{{role "user"}}
{{userPrompt}}
Alternatively, the value of the name parameter in the server prompt template
can be a
dynamic input variable.
For example,
{{cachedContent name=someVariable}}name of the cache as an input for the request from
your app.
Step 3: Reference the server prompt template in the request from your app
Be very careful about the following when writing your request:
Use the Vertex AI Gemini API since the cache was created with that Gemini API provider.
The location where you access the model in your app's prompt request must match the location of the server prompt template and the cache.
Swift
// ...
// Initialize the Vertex AI Gemini API backend service
// Create a `TemplateGenerativeModel` instance
// Make sure to specify the same location as the server prompt template and the cache
let model = FirebaseAI.firebaseAI(backend: .vertexAI(location: "LOCATION"))
.templateGenerativeModel()
do {
let response = try await model.generateContent(
// Specify your template ID
templateID: "TEMPLATE_ID"
)
if let text = response.text {
print("Response Text: \(text)")
}
} catch {
print("An error occurred: \(error)")
}
print("\n")
Kotlin
// ...
// Initialize the Vertex AI Gemini API backend service
// Create a `TemplateGenerativeModel` instance
// Make sure to specify the same location as the server prompt template and the cache
val model = Firebase.ai(backend = GenerativeBackend.vertexAI(location = "LOCATION"))
.templateGenerativeModel()
val response = model.generateContent(
// Specify your template ID
"TEMPLATE_ID",
)
val text = response.text
println(text)
Java
// ...
// Initialize the Vertex AI Gemini API backend service
// Create a `TemplateGenerativeModel` instance
// Make sure to specify the same location as the server prompt template and the cache
TemplateGenerativeModel generativeModel = FirebaseAI.getInstance().templateGenerativeModel();
TemplateGenerativeModelFutures model = TemplateGenerativeModelFutures.from(generativeModel);
Future<GenerateContentResponse> response = model.generateContent(
// Specify your template ID
"TEMPLATE_ID"
);
addCallback(response,
new FutureCallback<GenerateContentResponse>() {
public void onSuccess(GenerateContentResponse result) {
System.out.println(result.getText());
}
public void onFailure(Throwable t) {
reportError(t);
}
}
executor);
Web
// ...
// Initialize the Vertex AI Gemini API backend service
// Make sure to specify the same location as the server prompt template and the cache
const ai = getAI(app, { backend: new VertexAIBackend('LOCATION') });
// Create a `TemplateGenerativeModel` instance
const model = getTemplateGenerativeModel(ai);
const result = await model.generateContent(
// Specify your template ID
'TEMPLATE_ID'
);
const response = result.response;
const text = response.text();
Dart
// ...
// Initialize the Vertex AI Gemini API backend service
// Create a `TemplateGenerativeModel` instance
// Make sure to specify the same location as the server prompt template and the cache
var _model = FirebaseAI.vertexAI(location: 'LOCATION').templateGenerativeModel()
var response = await _model.generateContent(
// Specify your template ID
'TEMPLATE_ID',
);
var text = response?.text;
print(text);
Unity
// ...
// Initialize the Vertex AI Gemini API backend service
// Make sure to specify the same location as the server prompt template and the cache
var firebaseAI = FirebaseAI.GetInstance(FirebaseAI.Backend.VertexAI(location: "LOCATION"));
// Create a `TemplateGenerativeModel` instance
var model = firebaseAI.GetTemplateGenerativeModel();
try
{
var response = await model.GenerateContentAsync(
// Specify your template ID
"TEMPLATE_ID"
);
Debug.Log($"Response Text: {response.Text}");
}
catch (Exception e) {
Debug.LogError($"An error occurred: {e.Message}");
}
Manage explicit caches
This section describes managing explicit content caches, including how to list all caches, get metadata about a cache, update the TTL or expiration time of a cache, and delete a cache.
You manage explicit caches using the REST API of the Vertex AI Gemini API.
Once an explicit content cache is created, you can't change anything about the cache except the TTL or expiration time.
List all caches
You can list all the explicit caches available for your project. This command will only return the caches in the specified location.
PROJECT_ID="PROJECT_ID"
LOCATION="LOCATION"
curl \
-X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents
Get metadata about a cache
It's not possible to retrieve or view the actual cached content. However, you
can retrieve metadata about an explicit cache, including name, model,
display_name, usage_metadata, create_time, update_time, and
expire_time.
You need to provide the CACHE_ID, which is the final segment in the
fully-qualified resource name of the cache.
PROJECT_ID="PROJECT_ID"
LOCATION="LOCATION"
CACHE_ID="CACHE_ID" # the final segment in the `name` of the cache
curl \
-X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents/${CACHE_ID}
Update the TTL or expiration time for a cache
When you create an explicit cache, you can optionally set the ttl or the
expire_time.
ttl: The TTL (time-to-live) for the cache, specifically the number of seconds and nanoseconds that the cache lives after it's created or after thettlis updated before it expires. When you set thettl, theexpireTimeof the cache is automatically updated.expire_time: ATimestamp(like2024-06-30T09:00:00.000000Z) that specifies the absolute date and time when the cache expires.
If you don't set either of these values, the default TTL is 1 hour. There are no minimum or maximum bounds on the TTL.
For existing explicit caches, you can add or update the ttl or expire_time.
You need to provide the CACHE_ID, which is the final segment in the
fully-qualified resource name of the cache.
Update ttl
PROJECT_ID="PROJECT_ID"
LOCATION="LOCATION"
CACHE_ID="CACHE_ID" # the final segment in the `name` of the cache
TTL="CACHE_TIME_TO_LIVE"
curl \
-X PATCH \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents/${CACHE_ID} -d \
'{
"ttl": "'$TTL'"
}'
Update expire_time
PROJECT_ID="PROJECT_ID"
LOCATION="LOCATION"
CACHE_ID="CACHE_ID" # the final segment in the `name` of the cache
EXPIRE_TIME="ABSOLUTE_TIME_CACHE_EXPIRES"
curl \
-X PATCH \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents/${CACHE_ID} -d \
'{
"expire_time": "'$EXPIRE_TIME'"
}'
Delete a cache
When an explicit cache is no longer needed, you can delete it.
You need to provide the CACHE_ID, which is the final segment in the
fully-qualified resource name of the cache.
PROJECT_ID="PROJECT_ID"
LOCATION="LOCATION"
CACHE_ID="CACHE_ID" # the final segment in the `name` of the cache
curl \
-X DELETE \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents/${CACHE_ID}
Pricing for explicit caching
Explicit caching is a paid feature designed to reduce cost. Pricing is based on the following factors:
Input tokens for cache creation: For both implicit and explicit caching, you're billed for the input tokens used to create the cache at the standard input token price.
Storage of cache: For explicit caching, there are also storage costs based on how long caches are stored. There are no storage costs for implicit caching. For more information, see the pricing for the Vertex AI Gemini API.
Usage of cached content: Explicit caching ensures a discount when explicit caches are referenced, meaning you get a discount on the input tokens when they reference an existing cache. For Gemini 2.5 and later models, this discount is 90%.
The number of tokens in the cached part of your input is provided in the
cachedContentTokenCount field in the metadata of a response.