Introducing LiteRT: Google's high-performance runtime for on-device AI, formerly known as TensorFlow Lite. Learn more

AI Edge RAG guide for Android

The AI Edge RAG SDK provides the fundamental components to construct a Retrieval Augmented Generation (RAG) pipeline with the LLM Inference API. A RAG pipeline provides LLMs with access to user-provided data, which can include updated, sensitive, or domain-specific information. With the added information retrieval capabilities from RAG, LLMs can generate more accurate and context-aware responses for specific use cases.

This guide walks you through a basic implementation of a sample application using the LLM Inference API with the AI Edge RAG SDK. This guide focuses on constructing a RAG pipeline. For more information on using the LLM Inference API, see the LLM Inference for Android guide.

You can find the complete sample application on GitHub. To get started, build the application, read through the user-provided data (sample_context.txt), and ask the LLM questions relating to information in the text file.

Run the example application

This guide refers to an example of a basic text generation app with RAG for Android. You can use the sample app as a starting point for your own Android app, or refer to it when modifying an existing app.

The application is optimized for higher-end devices such as Pixel 8, Pixel 9, S23 and S24. Connect an Android device to your workstation and ensure you have a current version of Android Studio. For more information, see the Android setup guide.

Download the application code

The following instructions show you how to create a local copy of the example code using the git command line tool.

Clone the git repository using the following command:

git clone https://github.com/google-ai-edge/ai-edge-apis

After creating a local version of the example code, you can import the project into Android Studio and run the app.

Download a model

The sample application is configured to use Gemma-3 1B. Gemma-3 1B is part of the Gemma family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models. The model contains 1B parameters and open weights.

Download Gemma-3 1B

After downloading Gemma-3 1B from Hugging Face, push the model to your device:

cd ~/Downloads
tar -xvzf gemma3-1b-it-int4.tar.gz

$ adb shell rm -r /data/local/tmp/llm/ # Remove any previously loaded models
$ adb shell mkdir -p /data/local/tmp/llm/
$ adb push output_path /data/local/tmp/llm/model_version.task

You can also use other models with the sample application, but it may require additional configuration steps.

Set up an embedder

The embedder takes chunks of text from the user-provided data and turns them into vectorized numeric representations that capture its semantic meaning. The LLM refers to these embeddings to identify relevant vectors, and incorporates the most semantically relevant chunks in the generated output.

The sample application is designed to work with two embedders, the Gemini embedder and Gecko embedder.

Set up with Gecko embedder

By default, the sample app is configured to use the Gecko embedder (GeckoEmbeddingModel), and runs the model completely on-device.

Download Gecko 110m-en

The Gecko embedder is available as float and quantized models, with multiple versions for different sequence lengths. For more information, see the Gecko model card.

The model specifications can be found in the model filename. For example:

Gecko_256_fp32.tflite: Float model that supports sequences of up to 256 tokens.
Gecko_1024_quant.tflite: Quantized model that supports sequences of up to 1024 tokens.

The sequence length is the maximum chunk size the model can embed. For example, the Gecko_256_fp32.tflite model is passed a chunk that exceeds the sequence length, the model will embed the first 256 tokens and truncate the remainder of the chunk.

Push the tokenizer model (sentencepiece.model) and the Gecko embedder to your device:

adb push sentencepiece.model /data/local/tmp/sentencepiece.model
adb push Gecko_256_fp32.tflite /data/local/tmp/gecko.tflite

The embedding model is compatible with both CPU and GPU. By default, the sample app is configured to extract embeddings with the Gecko model on GPU.

companion object {
  ...
  private const val USE_GPU_FOR_EMBEDDINGS = true
}

Set up with Gemini Embedder

The Gemini Embedder (GeminiEmbedder) creates embeddings using the Gemini Cloud API. This requires a Google Gemini API key to run the application, which you can obtain from the Google Gemini API setup page.

Get a Gemini API key in Google AI Studio

Add your Gemini API key and set COMPUTE_EMBEDDINGS_LOCALLY to false in RagPipeline.kt:

companion object {
  ...
  private const val COMPUTE_EMBEDDINGS_LOCALLY = false
  private const val GEMINI_API_KEY = "<API_KEY>"
}

How it works

This section provides more in-depth information on the RAG pipeline components of the application. You can view most of the code at RagPipeline.kt.

Dependencies

The RAG SDK uses the com.google.ai.edge.localagents:localagents-rag library. Add this dependency to the build.gradle file of your Android app:

dependencies {
    ...
    implementation("com.google.ai.edge.localagents:localagents-rag:0.1.0")
    implementation("com.google.mediapipe:tasks-genai:0.10.22")
}

User-provided data

The user-provided data in the application is a text file named sample_context.txt, which is stored in the assets directory. The application takes chunks of the text file, creates embeddings of those chunks, and refers to the embeddings when generating output text.

The following code snippet can be found in MainActivity.kt:

class MainActivity : ComponentActivity() {
  lateinit var chatViewModel: ChatViewModel
...
    chatViewModel.memorizeChunks("sample_context.txt")
...
}

Chunking

For simplicity, the sample_context.txt file includes <chunk_splitter> tags that the sample application uses to create chunks. Embeddings are then created for each chunk. In production applications, the size of chunks is a key consideration. When a chunk is too large, the vector does not contain enough specificity to be useful; and when it is too small, it does not contain enough context.

The sample application handles the chunking through the memorizeChunks function in RagPipeline.kt.

Embedding

The application offers two pathways for text embedding:

Gecko embedder: Local (on-device) text embedding extraction with the Gecko model.
Gemini Embedder: Cloud-based text embedding extraction with the Generative Language Cloud API.

The sample application selects the embedder based on whether the user intends to compute embeddings locally or through Google Cloud. The following code snippet can be found in RagPipeline.kt:

private val embedder: Embedder<String> = if (COMPUTE_EMBEDDINGS_LOCALLY) {
  GeckoEmbeddingModel(
    GECKO_MODEL_PATH,
    Optional.of(TOKENIZER_MODEL_PATH),
    USE_GPU_FOR_EMBEDDINGS,
    )
  } else {
    GeminiEmbedder(
      GEMINI_EMBEDDING_MODEL,
      GEMINI_API_KEY
      )
  }

Database

The sample application uses SQLite (SqliteVectorStore) to store text embeddings. You can also use the DefaultVectorStore database for non-persistent vector storage.

The following code snippet can be found in RagPipeline.kt:

private val config = ChainConfig.create(
    mediaPipeLanguageModel, PromptBuilder(QA_PROMPT_TEMPLATE1),
    DefaultSemanticTextMemory(
        SqliteVectorStore(768), embedder
    )
)

The sample app sets the embedding dimension to 768, which refers to the length of each vector in the vector database.

Chain

The RAG SDK provides chains, which combines several RAG components into a single pipeline. You can use chains to orchestrate retrieval and query models. The API is based on the Chain interface.

The sample application uses the Retrieval and Inference chain. The following code snippet can be found in RagPipeline.kt:

private val retrievalAndInferenceChain = RetrievalAndInferenceChain(config)

The chain is invoked when the model generates responses:

suspend fun generateResponse(
    prompt: String,
    callback: AsyncProgressListener<LanguageModelResponse>?
): String =
    coroutineScope {
        val retrievalRequest =
            RetrievalRequest.create(
                prompt,
                RetrievalConfig.create(2, 0.0f, TaskType.QUESTION_ANSWERING)
            )
        retrievalAndInferenceChain.invoke(retrievalRequest, callback).await().text
    }