Language detection guide

Example UI that shows an input sentence in French that is correctly
identified as French in the output.

The MediaPipe Language Detector task lets you identify the language of a piece of text. This task operates on text data with a machine learning (ML) model and outputs a list of predictions, where each prediction consists of an ISO 639-1 language code and a probability.

Try it!

Get Started

Start using this task by following one of these implementation guides for your target platform. These platform-specific guides walk you through a basic implementation of this task, including a recommended model, and code example with recommended configuration options:

Task details

This section describes the capabilities, inputs, outputs, and configuration options of this task.

Features

  • Score threshold - Filter results based on prediction scores
  • Label allowlist and denylist - Specify the categories detected
Task inputs Task outputs
Language Detector accepts the following input data type:
  • String
Language Detector outputs a list of predictions containing:
    • Language code: An ISO 639-1 (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) language / locale code (e.g. "en" for English, "uz" for Uzbek, "ja-Latn” for Japanese (romaji)) as a string.
    • Probability: the confidence score for this prediction, expressed as a probability between zero and one as floating point value.

Configurations options

This task has the following configuration options:

Option Name Description Value Range Default Value
max_results Sets the optional maximum number of top-scored language predictions to return. If this value is less than zero, all available results are returned. Any positive numbers -1
score_threshold Sets the prediction score threshold that overrides the one provided in the model metadata (if any). Results below this value are rejected. Any float Not set
category_allowlist Sets the optional list of allowed language codes. If non-empty, language predictions whose language code is not in this set will be filtered out. This option is mutually exclusive with category_denylist and using both results in an error. Any strings Not set
category_denylist Sets the optional list of language codes that are not allowed. If non-empty, language predictions whose language code is in this set will be filtered out. This option is mutually exclusive with category_allowlist and using both results in an error. Any strings Not set

Models

We offer a default, recommended model when you start developing with this task.

This model is built to be lightweight (315 KB) and uses embedding-based, neural network classification architecture. The model identifies language using an ISO 639-1 language code, and can identify 110 languages. For a list of languages supported by the model, see the label file, which lists languages by their ISO 639-1 code.

Model name Input shape Quantization type Model card Versions
Language Detector string UTF-8 none (float32) info Latest

Task benchmarks

Here's the task benchmarks for the whole pipeline based on the above pre-trained models. The latency result is the average latency on Pixel 6 using CPU / GPU.

Model Name CPU Latency GPU Latency
Language Detector 0.31ms -