Lesson 5 · 10 min
Screenshot analysis and UI agents
Reading a UI screenshot — and the early-2026 reality of agents that click pixels. Where it works, where it still fails.
Two distinct workloads
Screenshot understanding — read a screenshot and answer questions about it. Mature in 2026: support tickets, debugging help, dashboard summarization. Works.
UI agents — a model that operates a UI by emitting (action, coordinates) commands: click, type, scroll. Anthropic's Computer Use, OpenAI Operator, and several open-source projects all ship this. Works narrowly — for repetitive workflows on a stable UI — and fails fast outside that envelope.