COMPUTER VISION / ZERO-SHOT

Vision Anything

Drop an image, type the labels you care about, and a CLIP-class model ranks them in your browser. Then enter full attention to inspect each label's per-pixel relevance map. No backend.

⌘Drop an image, or click to uploadPNG · JPG · WebP

Try a sample

Labels

CLIP works best with full prompts. Try "a photo of a" prefixes.

a photo of a cata photo of a doga photo of a birda photo of a persona photo of fooda photo of a cara photo of a buildinga photo of a flower

Classification uses clip-vit-base-patch16 (~150 MB). The attention map uses clipseg-rd64-refined (~140 MB), a small Transformer decoder bolted on top of CLIP that turns a text prompt into a per-pixel probability map. Both run via WebAssembly and cache after the first load.