Spatial Autoregressive Modeling of DINOv3 Embeddings for Unsupervised Anomaly Detection

Ertunc Erdil*, Nico Schulthess, Guney Tombak, Ender Konukoglu
Computer Vision Lab., ETH Zurich, Zurich, Switzerland

Abstract

DINO models provide rich patch-level representations that have recently enabled strong performance in unsupervised anomaly detection (UAD). Most existing methods extract patch embeddings from “normal” images and model them independently, ignoring spatial and neighborhood relationships between patches. In addition, the normative distribution is often modeled as memory banks or prototype-based representations, which require storing large numbers of features and performing costly comparisons at inference time, leading to substantial memory and computational overhead. In this work, we address both limitations by proposing a simple and efficient framework that explicitly models spatial and contextual dependencies between patch embeddings using a 2D autoregressive (AR) model. Instead of storing embeddings or clustering prototypes, our approach learns a compact parametric model of the normative distribution via an AR convolutional neural network (CNN). At test time, anomaly detection reduces to a single forward pass through the network and enables fast and memory-efficient inference. We evaluate our method on the BMAD benchmark, which comprises three medical imaging datasets, and compare it against existing work including recent DINO-based methods. Experimental results demonstrate that explicitly modeling spatial dependencies achieves competitive anomaly detection performance while substantially reducing inference time and memory requirements.

Results

(hover points to see exact values).
BraTS2021 — AUROC vs Runtime
BTCV+LiTs — AUROC vs Runtime
RESC — AUROC vs Runtime
BraTS2021 — AUPR vs Runtime
BTCV+LiTs — AUPR vs Runtime
RESC — AUPR vs Runtime