Ai Engineering 3 min read

TII Releases Falcon Perception Open-Source Vision Model

Falcon Perception introduces an early-fusion Transformer architecture that outperforms Meta's SAM 3 in dense image segmentation and OCR-guided grounding.

The Technology Innovation Institute (TII) has released Falcon Perception and Falcon OCR, two open-source vision models built on an early-fusion Transformer architecture. For developers integrating vision into applications, these models eliminate the standard pipeline of routing image data through a frozen backbone before hitting a language decoder. Both models are available on Hugging Face under the permissive TII Falcon License 2.0.

Early-Fusion Architecture

Falcon Perception replaces the traditional modular vision pipeline with an early-fusion Transformer architecture. It processes image patches and text tokens in a shared parameter space from the first layer using a single 0.6-billion parameter unified backbone.

The model applies a hybrid attention mask to manage context. This configuration uses bidirectional attention for image tokens to establish global visual context alongside causal attention for prediction tokens during autoregressive generation. A lightweight token interface handles continuous spatial outputs, allowing the model to generate parallel high-resolution mask predictions.

Benchmark Performance

Falcon Perception targets dense image segmentation and open-vocabulary grounding driven by natural language instructions. On the SA-Co benchmark, the 0.6B parameter model scored a 68.0 Macro-F1, outperforming Meta’s SAM 3 at 62.3. TII introduced PBench as a diagnostic benchmark for compositional prompts testing spatial constraints, object relations, and text-reading capabilities. Falcon Perception scored a 57.0 average Macro-F1 on PBench. SAM 3 scored 44.4, and the larger Qwen3-VL-30B scored 52.7.

In the PBench Dense split for crowded scenes, Falcon Perception scored 72.6, well ahead of Qwen3-VL-30B’s 8.9. The architecture excels at OCR-guided grounding tasks. It can disambiguate specific objects by reading text directly off them, a task where traditional segmentation models struggle.

Document Intelligence

TII released Falcon OCR alongside the perception model. This 300-million parameter model focuses entirely on document text recognition and multi-column layouts. It achieved 80.3% on olmOCR and 88.64 on OmniDocBench. TII reports it has the highest throughput among open-source OCR models currently available. If you build a RAG application dealing with complex PDF layouts, this model provides a highly efficient text extraction layer.

Optimized Inference Stack

The custom attention patterns in Falcon Perception require specific optimizations for AI inference deployments. The release includes code for a vLLM docker server and MLX integration to run LLMs locally on Apple Silicon. The server relies on PyTorch’s FlexAttention to process variable-length sequences efficiently.

A paged inference engine utilizes virtual page tables to eliminate memory waste from padding. For repeated queries on the same image, an LRU High-Resolution Feature Cache skips redundant upsampling steps. On an NVIDIA H100, latencies measure roughly 100ms for prefill, 200ms for upsampling, and 50ms per instance for decoding. A cached upsample reduces the 200ms step to zero.

Integrating these models requires replacing multi-step vision pipelines with a single unified call. Update your inference infrastructure to support FlexAttention and paged KV caching to benefit from the zero-latency upsampling on repeated image queries.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading