Details
- H Company released Holotron-12B, a 12B multimodal computer-use model post-trained from NVIDIA's open Nemotron-Nano-12B-v2-VL-BF16 on proprietary data for screen understanding, grounding, and UI interactions, trained on 14 billion tokens.
- Involves H Company (NVIDIA Inception Program participant), NVIDIA research labs, and the model is available on Hugging Face under NVIDIA Open Model License.
- Features hybrid SSM-attention architecture from Nemotron for high-throughput inference, handling long contexts with multiple images; achieves 2x higher throughput than Holo2-8B on WebVoyager benchmark at 8.9k tokens/s on single H100 GPU.
- Improves over base Nemotron (WebVoyager from 35.1% to 80.5%) and Holo2-8B; surpasses on localization benchmarks like OS-World-G, GroundUI, WebClick; contrasts with static vision models by optimizing for interactive agentic workloads.
- NVIDIA's Nemotron Nano 12B v2 VL supports multi-image (up to 4 at 1k x 2k) document intelligence and video understanding; aligns with Nemotron 3 Super's hybrid MoE for agentic efficiency in multi-agent systems.
Impact
Holotron-12B advances efficient multimodal agents for production-scale computer use, leveraging NVIDIA's Nemotron foundation to cut memory costs and boost throughput for tasks like data generation and RL. It positions H Company competitively against models from OpenAI and Anthropic in agentic AI, signaling a shift toward hybrid architectures for enterprise deployments amid NVIDIA's Nemotron 3 push.