COMPUTER VISION / ZERO-SHOT
Vision Anything
Drop an image, type the labels you care about, and a CLIP-class model ranks them in your browser. Then enter full attention to inspect each label's per-pixel relevance map. No backend.
Try a sample
Labels
CLIP works best with full prompts. Try "a photo of a" prefixes.
a photo of a cata photo of a doga photo of a birda photo of a persona photo of fooda photo of a cara photo of a buildinga photo of a flower
Classification uses clip-vit-base-patch16 (~150 MB). The attention map uses clipseg-rd64-refined (~140 MB), a small Transformer decoder bolted on top of CLIP that turns a text prompt into a per-pixel probability map. Both run via WebAssembly and cache after the first load.