Google Releases EmbeddingGemma-State Of The Art On-Device Embedding
Written by Nikos Vaggalis   
Thursday, 02 October 2025

 Google has released a small, specialized, but powerful model that can run on low resource devices.

EmbeddingGemma is a new open embedding model that delivers value for money for its size. Based on the Gemma 3 architecture, it is trained on 100+ languages and is small enough to run on less than 200MB of RAM with quantization.

At this point let's not conflate an embedding model with a Large Language Level. The embedding comes before the LLM takes action. For instance when doing RAG, you generate the embedding of a user’s prompt and calculate the similarity with the embeddings of all the documents in question. When the relevant chunks are found, they are then passed together with the user's query to the LLM to let it perform its GenAI magic and give an answer back to the user.

Other than that, the model is 300 million parameters wide and supports embeddings of up to 768 dimensions and is tuned for performance and minimal resource consumption. Usually when doing RAG even by using a local LLM with the use of say Ollama, you generate the embeddings first by calling an embedding API of a remote service like OpenAIs. This is so because generating embeddings is a resource hungry task that requires strong hardware, so its easier to offload it to a server like OpenAI's. To generate your embeddings efficiently on your lowly mobile phone itself is a game changer. This is also crucial for complete privacy and control, as models using private data can run locally without connecting to external servers.

As such it makes for a good choice when wanting to build local on device runnable AI powered applications. Of course, for that you also need a framework like Cactus which we covered in Cactus Lets You Build LLM Powered Applications On Your Mobile Phone:

Cactus is also cross-platform, so you can build AI applications using popular frameworks like Flutter, React Native, and Kotlin Multiplatform. Key features are:

  • Supports GGUF Models: Works with any GGUF model from Hugging Face, including Qwen, Gemma, Llama, and DeepSeek.
  • Multi-Modal AI: Run various models including LLMs, VLMs, Embedding Models, and TTS (Text-to-Speech) models.

It can also generate embeddings offline. For instance taking Flutter:

import 'package:cactus/cactus.dart';

final lm = await CactusLM.download(
modelUrl: 'https://huggingface.co/nomic-ai/nomic-embed-text-v2-moe-GGUF/resolve/main/nomic-embed-text-v2-moe.Q8_0.gguf',
contextSize: 2048,
generateEmbeddings: true,
);
lm.init()

final text = 'Your text to embed';
final result = await lm.embedding(text);

The embedding parameters include :

mode:

  • "local": Only use device model
  • "remote": Only use cloud API
  • "localfirst": Try local, fallback to cloud if it fails
  • "remotefirst": Try cloud, fallback to local if it fails, so that you can fallback to a cloud embedding API.

Now you can replace "nomic-embed-text-v2-moe.Q8_0.gguf" with the EmbeddingGemma gguf file from HuggingFace.

Google's engineers themselves make the following suggestions:

  • For on-device, offline use cases: EmbeddingGemma is your best choice, optimized for privacy, speed, and efficiency.

  • For most large-scale, server-side applications: Explore our state-of-the-art Gemini Embedding model via the Gemini API for highest quality and maximum performance.

So if you're building local first AI powered applications, EmbeddingGemma is the way to go.

 

More Information

Introducing EmbeddingGemma: The Best-in-Class Open Model for On-Device Embeddings  

Related Articles

Cactus Lets You Build LLM Powered Applications On Your Mobile Phone 

 

To be informed about new articles on I Programmer, sign up for our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook or Linkedin.

Banner


Meta Introduces Smart Glasses Toolkit
26/09/2025

Meta has announced a developer preview of the Meta Wearables Device Access Toolkit, which will be made available later this year. The toolkit lets developers create apps to work with Meta's  [ ... ]



Kubernetes 1.34 Adds Dynamic Resource Allocation
02/10/2025

Kubernetes 1.34 has been released with improvements including distributed resource allocation support, and enhanced in-cluster traffic routing. 


More News

pico book

 

Comments




or email your comment to: comments@i-programmer.info

Last Updated ( Thursday, 02 October 2025 )