Gemma 4 Multimodal Capabilities and Limitations
{{Gemma 4}}'s native audio is exclusive to the {{E2B}} and {{E4B}} edge variants, capped at 30-second clips and 60-second video at 1 FPS; the larger {{26B A4B}} ({{Mixture of Experts}}) and {{31B dense}} models accept text, image, and video but no audio. Best practice is to place media before text in the prompt.
## Modality matrix Gemma 4 Model Family Overview ships four variants released April 2, 2026 under Apache 2.0. Their multimodal support is asymmetric: | Variant | Effective params | Text | Image | Video | Audio | |---|---|---|---|---|---| | E2B | ~2.3B (5.1B total) | Yes | Yes | Yes (+audio track) | Yes | | E4B | ~4.5B (8B total) | Yes | Yes | Yes (+audio track) | Yes | | 26B A4B (Mixture of Experts (MoE), ~4B active) | 26B | Yes | Yes | Yes (silent) | No | | 31B dense | 31B | Yes | Yes | Yes (silent) | No | Native audio is the dividing line: only the small Per-Layer Embedding edge models inherited the Gemma 3n audio stack. The larger models will happily process a video file but ignore its soundtrack — set `load_audio_from_video=False` in the Hugging Face processor for those variants. ## Audio constraints - **30-second maximum per clip.** Longer recordings must be chunked before submission. The official guidance is to use voice activity detection (VAD) — e.g. Silero VAD or WebRTC VAD — to cut on speech boundaries rather than fixed time intervals, which would otherwise slice mid-word. - **Speech only.** Training data was speech in 140+ languages; music and non-speech sounds were not part of training, so genre detection, sound-event classification, or instrument identification work poorly or not at all. - **No speaker diarization.** The model cannot natively tag "Speaker A / Speaker B". For multi-party transcripts, run a dedicated diarizer (pyannote.audio, NeMo) first and pass per-speaker clips. - **No word-level timestamps.** Output is plain transcribed text. If you need karaoke-style alignment, use Whisper or a forced-aligner like WhisperX on the same clip. - **Encoder.** The audio path is a redesigned USM-style conformer encoder, ~50% smaller than the Gemma 3n encoder, with a 40 ms frame duration (25 frames/sec) tuned for lower-latency on-device ASR. Input is converted to log-mel spectrograms before the conformer stack. - **Tasks.** Beyond transcription, the encoder supports spoken question-answering, audio summarization, automatic speech translation (AST), and multi-turn voice conversation — not just one-shot ASR. ## Video constraints - **60-second maximum**, at the default sampling rate. - **1 frame per second.** This is the implicit default — 60 seconds = 60 frames. For motion-heavy content (sports, gestures, fast cuts) you will lose information between frames. - **Higher effective FPS via manual extraction.** Use ffmpeg to dump frames at your target rate, then submit them as an ordered image list. You trade context window for temporal resolution: 30 frames at 2 FPS still covers 15 seconds, etc. - **Variable resolution.** Images (including extracted video frames) accept configurable token budgets of 70, 140, 280, 560, or 1120 tokens per image. Lower budgets are appropriate for classification and video frames; reserve the 1120 budget for OCR and document parsing. - **No soundtrack on big models.** As above, 26B and 31B drop the audio track silently. ## Prompt construction Place **media before text** in every prompt. The model card is explicit: image, audio, and video segments should precede the textual question or instruction. Inverting the order measurably degrades quality on visual question answering and ASR-conditioned tasks. In the Hugging Face chat template this means listing the `{type: "image"|"audio"|"video", ...}` content blocks first and the `{type: "text", ...}` block last within a single user turn. ## Context window The edge E2B/E4B variants have a 128K context window; the 26B A4B and 31B dense variants support 256K. Long multimodal prompts can hit this ceiling quickly — a single 60-second video at 1 FPS with the 1120-token image budget consumes ~67K tokens before any text. Tune the image token budget down for video workloads. ## Pitfalls - Submitting >30 s audio without chunking truncates silently rather than erroring on some runtimes. - Using `load_audio_from_video=True` on the 26B/31B models is a no-op and wastes preprocessing time. - Music transcription (lyrics, song identification) consistently fails — out of distribution. - Speaker change detection in multi-party audio is unreliable; the model often attributes all dialogue to one speaker. - Putting the text instruction before media often produces generic answers that ignore the attached file.