Introducing LiteRT: Google's high-performance runtime for on-device AI, formerly known as TensorFlow Lite. Learn more

AI Edge RAG guide

The AI Edge RAG SDK provides the fundamental components to construct a Retrieval Augmented Generation (RAG) pipeline with the LLM Inference API. A RAG pipeline provides LLMs with access to user-provided data, which can include updated, sensitive, or domain-specific information. With the added information retrieval capabilities from RAG, LLMs can generate more accurate and context-aware responses for specific use cases.

The AI Edge RAG SDK is available for Android and can be run completely on-device. Start using the SDK by following the Android guide, which walks you through a basic implementation of a sample application using RAG.

RAG Pipeline

Setting up a RAG pipeline with the AI Edge RAG SDK contains the following key steps:

Import data: Provide the textual data that the LLM will use when generating output.
Split and index the data: Break the data into small chunks for indexing in a database.
Generate embeddings: Use an embedder to vectorize the chunks to store in a vector database.
Retrieve information: Define how relevant information is identified and retrieved to address user prompts. For a given prompt, the retrieval component searches through the vector database to identify relevant information.
Generate text with LLM: Use a large language model to generate output text based on the information retrieved from the vector database.

Key Modules

The AI Edge RAG SDK provides the following key modules and APIs for the RAG pipeline:

Language Models: The LLM models with open-prompt API, either local (on-device) or server-based. The API is based on the LanguageModel interface.
Text Embedding Models: Convert structured and unstructured text into embedding vectors for semantic search. The API is based on the Embedder interface.
Vector Stores: The vector store holds the embeddings and metadata derived from data chunks. It can be queried to get similar chunks or exact matches. The API is based on the VectorStore interface.
Semantic Memory: Serve as a semantic retriever for retrieving top-k relevant chunks given a query. The API is based on the SemanticMemory interface.
Text Chunking: Splits user data into smaller pieces to facilitate indexing. The API is based on the TextChunker interface.

The SDK provides chains, which combines several RAG components in a single pipeline. You can use chains to orchestrate retrieval and query models. The API is based on the Chain interface. To get started, try the Retrieval and Inference chain or Retrieval chain.