The latest Gemini models, like Gemini 3.1 Flash Image (Nano Banana 2), are available to use with Firebase AI Logic! Learn more.

Gemini 2.0 Flash and Flash-Lite models will shut down on June 1, 2026. To avoid service disruption, update to a newer model like gemini-3.1-flash-lite. Learn more.

All Imagen models will shut down on June 24, 2026. Learn about migrating your apps to use Nano Banana.

Context caching in Firebase AI Logic

For your AI feature, you might pass the same input tokens (content) over and over to a model. For these use cases, you can instead cache this content, meaning that you pass the content to the model once, store it, and reference it in subsequent requests.

Context caching can significantly reduce latency and cost for repetitive tasks involving a large amount of content, like large amounts of text, an audio file, or a video file. Some common use cases for cached content include detailed persona documents, codebases, or manuals.

Gemini models offer two different caching mechanisms:

Implicit caching: automatically enabled on most models, no guaranteed cost savings
Explicit caching: can be optionally and manually enabled on most models, usually results in cost savings

Explicit caching is useful in cases where you want to more likely guarantee cost savings, but with some added developer work.

For both implicit and explicit caching, the cachedContentTokenCount field in your response's metadata indicates the number of tokens in the cached part of your input. For explicit caching, make sure to review pricing information at the bottom of this page.

Supported models

Caching is supported when using the following models:

gemini-3.1-pro-preview
gemini-3.5-flash
gemini-3.1-flash-lite
gemini-2.5-pro
gemini-2.5-flash
gemini-2.5-flash-lite

Media-generating models (for example, the Nana Banana models like gemini-3.1-flash-image-preview), do not support context caching.

Cached content size limits

Each model has a minimum token count requirement for cached content. The maximum is dictated by the model's context window.

Gemini Pro models: 4096 tokens minimum
Gemini Flash models: 1024 tokens minimum

Additionally, the maximum size of content you can cache using a blob or text is 10 MB.

Implicit caching

Implicit caching is enabled by default and available for most Gemini models.

Google automatically passes on cost savings if your request hits the cached content. Here are some ways to increase the chance that your request uses implicit caching:

Try putting large and common content at the beginning of your prompt.
Try to send requests with a similar prefix in a short amount of time.

The number of tokens in the cached part of your input is provided in the cachedContentTokenCount field in the metadata of a response.

Explicit caching

Explicit caching is not enabled by default, and it's an optional capability of the Gemini models.

Here's how you can set up and work with explicit content caches:

Create and then use an explicit cache
Manage explicit caches, including:

Note that explicit content caches interact with implicit caching, potentially leading to additional caching beyond the explicit cached content. You can prevent cache data retention by disabling implicit caching and not creating explicit caches. For more information, see Enable and disable caching.

Create and use an explicit cache

Creating and using an explicit content cache requires the following:

Create an explicit cache.
Reference the cache in a server prompt template.
Reference the server prompt template in a prompt request from your app.

Important information about creating and using an explicit cache

Your cache must be aligned with your app's prompt requests and your server prompt template:

The cache is specific to a Gemini API provider. Your app's prompt request must use the same provider.
For Firebase AI Logic, we strongly recommend using explicit content caches only with the Vertex AI Gemini API. All the information and examples on this page are specific to that Gemini API provider.
The cache is specific to a Gemini model. Your app's prompt request must use the same model.
The cache is specific to a location when using the Vertex AI Gemini API.
The location for the explicit cache must match the location of the server prompt template and the location where you access the model in your app's prompt request.

Also, be aware of the following limitations and requirements for explicit caching:

Once an explicit cache is created, you can't change anything about the cache except the TTL or expiration time.
You can cache any supported input file MIME type or even just text provided within the cache creation request.
If you want to include a file in the cache, you must provide the file as a Cloud Storage URI. It can't be a browser URL or YouTube URL.

Additionally, access restrictions to the file are checked at cache-creation-time, and access restrictions are not checked again at user-request-time. For this reason, make sure that any data included in the explicit cache is suitable for any user making a request that includes that cache.
If you want to use system instructions or tools (like code execution, URL context, Grounding with Google Search, or Grounding with Google Maps), then the cache itself must contain their configurations. They cannot be configured in the server prompt template or in your app's prompt request. Note that server prompt templates do not yet support function calling (or chat). For details about how to configure system instructions and tools in your cache, see the REST API of the Vertex AI Gemini API.

Step 1: Create the cache

Create the cache by directly using the REST API of the Vertex AI Gemini API.

The following is an example that creates an explicit cache of a PDF file as its content.

Syntax:

PROJECT_ID="PROJECT_ID"
MODEL_ID="GEMINI_MODEL"  # for example, gemini-3.5-flash
LOCATION="LOCATION"  # location for both the cache and the model
MIME_TYPE="MIME_TYPE"
CACHED_CONTENT_URI="CLOUD_STORAGE_FILE_URI"  # must be a Cloud Storage URI
CACHE_DISPLAY_NAME="CACHE_DISPLAY_NAME"  # optional
TTL="CACHE_TIME_TO_LIVE"  # optional (if not specified, defaults to 3600s)

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents \
-d @- <<EOF
{
  "model":"projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}",
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "mimeType": "${MIME_TYPE}",
            "fileUri": "${CACHED_CONTENT_URI}"
          }
        }
      ]
    }
  ],
  "displayName": "${CACHE_DISPLAY_NAME}",
  "ttl": "${TTL}"
}
EOF

Example request:

PROJECT_ID="my-amazing-app"
MODEL_ID="gemini-3.5-flash"
LOCATION="global"
MIME_TYPE="application/pdf"
CACHED_CONTENT_URI="gs://cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf"
CACHE_DISPLAY_NAME="Gemini - A Family of Highly Capable Multimodal Model (PDF)"
TTL="7200s"

curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents \
-d @- <<EOF
{
  "model":"projects/${PROJECT_ID}/locations/${LOCATION}/publishers/google/models/${MODEL_ID}",
  "contents": [
    {
      "role": "user",
      "parts": [
        {
          "fileData": {
            "mimeType": "${MIME_TYPE}",
            "fileUri": "${CACHED_CONTENT_URI}"
          }
        }
      ]
    }
  ],
  "displayName": "${CACHE_DISPLAY_NAME}",
  "ttl": "${TTL}"
}
EOF

Example Response:

The response includes a fully-qualified resource name which is globally unique to the cache (note that the last segment is the cache ID). You'll use this entire name value in the next step of the workflow.

{
  "name": "projects/861083271981/locations/global/cachedContents/4545031458888089601",
  "model": "projects/my-amazing-app/locations/global/publishers/google/models/gemini-3.5-flash",
  "createTime": "2024-06-04T01:11:50.808236Z",
  "updateTime": "2024-06-04T01:11:50.808236Z",
  "expireTime": "2024-06-04T02:11:50.794542Z"
}

Step 2: Reference the cache in a server prompt template

After creating the cache, reference it by name within the cachedContent property of a server prompt template.

Make sure you follow these requirements when creating your server prompt template:

Use the fully-qualified resource name from the response when you created the cache. This is not the optional display name that you specified in the request.
The location for the server prompt template must match the location of the cache.
To use system instructions or tools, they must be configured as part of the cache and not as part of the server prompt template.

Syntax:

{{cachedContent name="YOUR_CACHE_RESOURCE_NAME"}}

{{role "user"}}
{{userPrompt}}

Example:

{{cachedContent name="projects/861083271981/locations/global/cachedContents/4545031458888089601"}}

{{role "user"}}
{{userPrompt}}

Alternatively, the value of the name parameter in the server prompt template can be a dynamic input variable. For example, {{cachedContent name=someVariable}} lets you to include the name of the cache as an input for the request from your app.

Step 3: Reference the server prompt template in the request from your app

Be very careful about the following when writing your request:

Use the Vertex AI Gemini API since the cache was created with that Gemini API provider.
The location where you access the model in your app's prompt request must match the location of the server prompt template and the cache.

Swift

// ...

// Initialize the Vertex AI Gemini API backend service
// Create a `TemplateGenerativeModel` instance
// Make sure to specify the same location as the server prompt template and the cache
let model = FirebaseAI.firebaseAI(backend: .vertexAI(location: "LOCATION"))
                                  .templateGenerativeModel()

do {
    let response = try await model.generateContent(
        // Specify your template ID
        templateID: "TEMPLATE_ID"
    )
    if let text = response.text {
        print("Response Text: \(text)")
    }
} catch {
    print("An error occurred: \(error)")
}
print("\n")

Kotlin

// ...

// Initialize the Vertex AI Gemini API backend service
// Create a `TemplateGenerativeModel` instance
// Make sure to specify the same location as the server prompt template and the cache
val model = Firebase.ai(backend = GenerativeBackend.vertexAI(location = "LOCATION"))
                        .templateGenerativeModel()

val response = model.generateContent(
    // Specify your template ID
    "TEMPLATE_ID",
)

val text = response.text
println(text)

Java

// ...

// Initialize the Vertex AI Gemini API backend service
// Create a `TemplateGenerativeModel` instance
// Make sure to specify the same location as the server prompt template and the cache
TemplateGenerativeModel generativeModel = FirebaseAI.getInstance().templateGenerativeModel();

TemplateGenerativeModelFutures model = TemplateGenerativeModelFutures.from(generativeModel);

Future<GenerateContentResponse> response = model.generateContent(
    // Specify your template ID
    "TEMPLATE_ID"
);
addCallback(response,
      new FutureCallback<GenerateContentResponse>() {
          public void onSuccess(GenerateContentResponse result) {
            System.out.println(result.getText());
          }
          public void onFailure(Throwable t) {
            reportError(t);
          }
    }
executor);

Web

// ...

// Initialize the Vertex AI Gemini API backend service
// Make sure to specify the same location as the server prompt template and the cache
const ai = getAI(app, { backend: new VertexAIBackend('LOCATION') });

// Create a `TemplateGenerativeModel` instance
const model = getTemplateGenerativeModel(ai);

const result = await model.generateContent(
  // Specify your template ID
  'TEMPLATE_ID'
);

const response = result.response;
const text = response.text();

Dart

// ...

// Initialize the Vertex AI Gemini API backend service
// Create a `TemplateGenerativeModel` instance
// Make sure to specify the same location as the server prompt template and the cache
var _model = FirebaseAI.vertexAI(location: 'LOCATION').templateGenerativeModel()

var response = await _model.generateContent(
        // Specify your template ID
        'TEMPLATE_ID',
      );

var text = response?.text;
print(text);

Unity

// ...

// Initialize the Vertex AI Gemini API backend service
// Make sure to specify the same location as the server prompt template and the cache
var firebaseAI = FirebaseAI.GetInstance(FirebaseAI.Backend.VertexAI(location: "LOCATION"));

// Create a `TemplateGenerativeModel` instance
var model = firebaseAI.GetTemplateGenerativeModel();

try
{
  var response = await model.GenerateContentAsync(
      // Specify your template ID
      "TEMPLATE_ID"
  );
  Debug.Log($"Response Text: {response.Text}");
}
catch (Exception e) {
  Debug.LogError($"An error occurred: {e.Message}");
}

Manage explicit caches

This section describes managing explicit content caches, including how to list all caches, get metadata about a cache, update the TTL or expiration time of a cache, and delete a cache.

You manage explicit caches using the REST API of the Vertex AI Gemini API.

Once an explicit content cache is created, you can't change anything about the cache except the TTL or expiration time.

List all caches

You can list all the explicit caches available for your project. This command will only return the caches in the specified location.

PROJECT_ID="PROJECT_ID"
LOCATION="LOCATION"

curl \
-X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents

Get metadata about a cache

It's not possible to retrieve or view the actual cached content. However, you can retrieve metadata about an explicit cache, including name, model, display_name, usage_metadata, create_time, update_time, and expire_time.

You need to provide the CACHE_ID, which is the final segment in the fully-qualified resource name of the cache.

PROJECT_ID="PROJECT_ID"
LOCATION="LOCATION"
CACHE_ID="CACHE_ID"  # the final segment in the `name` of the cache

curl \
-X GET \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents/${CACHE_ID}

Update the TTL or expiration time for a cache

When you create an explicit cache, you can optionally set the ttl or the expire_time.

ttl: The TTL (time-to-live) for the cache, specifically the number of seconds and nanoseconds that the cache lives after it's created or after the ttl is updated before it expires. When you set the ttl, the expireTime of the cache is automatically updated.
expire_time: A Timestamp (like 2024-06-30T09:00:00.000000Z) that specifies the absolute date and time when the cache expires.

If you don't set either of these values, the default TTL is 1 hour. There are no minimum or maximum bounds on the TTL.

For existing explicit caches, you can add or update the ttl or expire_time. You need to provide the CACHE_ID, which is the final segment in the fully-qualified resource name of the cache.

Update ttl

PROJECT_ID="PROJECT_ID"
LOCATION="LOCATION"
CACHE_ID="CACHE_ID"  # the final segment in the `name` of the cache
TTL="CACHE_TIME_TO_LIVE"

curl \
-X PATCH \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents/${CACHE_ID} -d \
'{
  "ttl": "'$TTL'"
}'

Update expire_time

PROJECT_ID="PROJECT_ID"
LOCATION="LOCATION"
CACHE_ID="CACHE_ID"  # the final segment in the `name` of the cache
EXPIRE_TIME="ABSOLUTE_TIME_CACHE_EXPIRES"

curl \
-X PATCH \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
https://${LOCATION}-aiplatform.googleapis.com/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION}/cachedContents/${CACHE_ID} -d \
'{
  "expire_time": "'$EXPIRE_TIME'"
}'

Delete a cache

When an explicit cache is no longer needed, you can delete it.