May 20, 2026 · 7 min read

How to translate any image in your browser (2026 guide)

Beyond manga: vision-capable AI translates speech bubbles, doujin captions, foreign tweets, JRPG screenshots, light-novel covers, signs and labels. The 2026 setup for in-browser image translation.

The browser-translation problem isn’t just manga. It’s anything where the text you want to read lives as pixels, not characters: doujin captions, foreign-language tweets in screenshots, JRPG dialogue, Pixiv tags, light-novel covers, product labels in foreign-language shopping pages. Web browsers can translate paragraph text natively — they can’t see inside an image. Until recently you had to.

Why text-in-image is a different problem than text-in-html

Google Translate, DeepL, every browser’s built-in translator — they all do the same trick: walk the page’s HTML, find text nodes, send those strings to a backend, swap them inline. None of that works for image content. The text isn’t in the DOM. It’s a pattern of pixels inside a JPEG or PNG.

Classic workarounds tried to bridge the gap and mostly failed:

  • Page-OCR translators ran optical character recognition on the whole image, then dumped a wall of translated text in a sidebar. Wrong reading order, no per-bubble context, ugly.
  • Browser-bundled OCR translation (a few extensions tried this) was either accurate-but-slow on big images or fast-but-character-set-limited (worked for English logos, choked on dense Japanese kanji).
  • Tab-out workflows (Google Lens, phone camera apps, desktop OCR utilities) meant you had to leave the page you were on. Friction killed adoption.

The 2026 pivot: image-edit AI

Two model categories changed the game. First, vision-capable LLMs — from Google, OpenAI, Anthropic and others — can see images natively. They don’t just transcribe characters; they understand which text belongs to which element, the relationship between text regions, and what tone the translation should carry. Second, and more important for in-panel translation, Google’s Gemini image-edit family can re-render an image with the translated text drawn back where the original text was — same bubble, same hand, art untouched. The OCR-then-translate-then-overlay pipeline collapses into a single model call.

For image translation specifically, this means:

  • Reading-order is solved without manual annotation. The model picks up that vertical Japanese reads right-to-left and that horizontal Korean reads left-to-right.
  • Per-region context is preserved. A speech bubble is a unit. A sound effect is a unit. A caption is a unit. Each gets translated in its own framing, not concatenated into a paragraph.
  • Tone is tunable. “Formal samurai dialogue” reads differently from “casual rom-com” and the model can be prompted to match. Old OCR pipelines couldn’t see tone at all.

Use cases beyond manga

The same engine that handles a manga panel works on any image with text. The use cases that benefit:

  • Manga / manhwa / manhua raws. The loud use case. Speech bubbles, narration boxes, sound effects, all rendered into the original layout.
  • Doujinshi and fan works. Same image-text shape as manga but for content the official translators will never touch.
  • Pixiv captions and artist tags. Most Pixiv text lives inside posted images. Auto-translating those unlocks an enormous amount of Japanese art context.
  • Foreign-language tweets / Reddit posts. People screenshot tweets constantly. The text is image content from the browser’s perspective.
  • JRPG / visual-novel / Asian-game screenshots. Whether you’re looking up a guide for an untranslated game or just want to read the dialogue in a screenshot someone posted, vision-LLM translation handles it cleanly.
  • Light-novel covers and book spines. Title text on covers is image content. Knowing what you’re looking at is half of curation.
  • Foreign signs, menus, product labels in screenshots. Travel research, online shopping from foreign stores, recipe sites in other languages.

Getting good results

A few practical notes for any image-translation workflow, regardless of which tool you use:

  • Pick an image-edit-capable model. For translation that puts English back where the bubble was — rather than just transcribing the text to a sidebar — you need a model that can re-renderthe image with edits. Google’s Gemini image-edit family is the standout in 2026; most other vision models can describe what they see but can’t re-draw the image with new text. MochiTranslate uses Gemini specifically for this reason.
  • Give the model context for tone. A samurai-period manga doesn’t sound like a slice-of-life comedy. Most tools let you pick a tone preference.
  • Translate a sample first if quality matters. Cheaper models are good enough for casual reading; for art books, poetry, or any text where word-choice carries weight, use a premium model for the parts you care about.
  • Build a glossary for long series. Recurring proper nouns and attack names benefit from a consistent translation across chapters. Some translators support glossaries; for ones that don’t, you can usually paste names into the prompt context.

Whatever’s in the image, in your language

Image-text translation in the browser finally works the way it should in 2026: no leaving the page, no desktop pipeline, no guessing at the source script. Hover an image, read it in your language.

The tool we built — MochiTranslate — runs on any <img> ≥ 200px on any site. Hover, translate. Whatever language is on the image, English back into the same layout. Same studio as MochiDim, same one-time pricing, no-tracking philosophy.

If this was useful

The extensions we make solve this end-to-end.

One-time payment, lifetime license, no tracking inside the extensions. The studio that wrote this article.