Through the PRISM

Abstract

Effective visual communication stems from the harmony of multiple design principles, such as readability, contrast, alignment, overlap, and coherence, which collectively govern clarity and intent of the communicator. While human designers reason holistically over these principles, machine agents typically condense them into a single heuristic score, offering limited interpretability and diagnostic precision.

To address this gap, we introduce PRISM (PRinciple-aware, Interpretable, and Structure-guided Design Modifications), a benchmark that systematically perturbs professional layouts from the Crello dataset along measurable design principles. The benchmark comprises 100K perturbed training samples and 10K perturbed validation designs, each isolating a specific principle violation for controlled analysis of multimodal reasoning about design quality. We show that models like Qwen-2.5-VL and GPT-4o-mini are largely insensitive to targeted principle degradations, whereas GPT-4o exhibits global awareness without fine-grained disentanglement.

Building on these insights, we propose a multi-scale evaluation framework that integrates lightweight scorers for quantitative assessment, instruction-tuned vision-language models for localised feedback, and prompt-based methods for global reasoning. Our framework provides interpretable explanations of design failures. Using these localised insights, we show targeted refinements that improve layout quality. Together, PRISM and our framework lay the foundation for interpretable design-literate multimodal reasoning systems.

PRISM Perturbation Example

Unperturbed

Original composed layout from Crello.

Coherence

Text replaced with a different theme to break thematic consistency.

Readability

Font attributes like size, color and line height modified to reduce legibility.

Contrast

Element colors adjusted to reduce perceptual contrast.

Alignment

Text boxes and icons displaced to disrupt alignment.

Overlap

Icons, texts and elements overlapped to induce occlusion.

Framework

Building on PRISM, we develop two complementary modules for design evaluation. The Scorer quantitatively assesses whether each principle is satisfied. The Localiser provides spatially grounded feedback identifying regions responsible for each violation.

Method overview showing Scorer and Localiser pipeline.

Scorer. Independent binary classifiers trained on a design-aware SigLIP-v2 backbone — one per principle. Each scorer outputs a probability of principle adherence, providing interpretable decomposition of overall design quality into measurable dimensions aligned with human reasoning.

Localiser. Qwen-2.5-VL instruction-tuned on PRISM's element-level annotations. Rather than generating free-form text, the localiser outputs the exact IDs of elements violating the queried principle — improving reliability and reproducibility over generative responses.

Model Sensitivity Analysis

We evaluate GPT-4o, GPT-4o-mini, and Qwen-2.5-VL on PRISM's validation set. GPT-4o shows global sensitivity to principle violations but lacks fine-grained disentanglement. In contrast, GPT-4o-mini and Qwen-2.5-VL are largely insensitive to targeted degradations, highlighting the need for design-aware training to achieve interpretable multimodal reasoning.

Localisation performance across principles

Model	Readability		Contrast		Overlap
Model	IoU	F1-score	IoU	F1-score	IoU	F1-score
QwenBase	0.3645	0.4312	0.2952	0.3804	0.3215	0.3770
QwenPrompted	0.3603	0.4147	0.3186	0.4049	0.3297	0.3984
GPT-4o-mini	0.4028	0.4718	0.3456	0.4478	0.3729	0.4526
GPT-4o	0.5532	0.6037	0.5164	0.6017	0.5417	0.6210
QwenExpert (Ours)	0.7833	0.7998	0.6761	0.7196	0.7328	0.7730
Δ Ours − GPT-4o	0.230↑	0.196↑	0.160↑	0.118↑	0.191↑	0.152↑

Localisation performance across principles. Fine-tuning Qwen-2.5-VL with PRISM annotations leads to substantial improvements in IoU and F1 compared to base, prompted, GPT-4o-mini and GPT-4o.

Editing Pipeline

Using feedback from the Localiser, we implement a beam search-based editing pipeline that proposes and scores candidate refinements at each iteration. The Scorer evaluates each candidate, retaining the top-k edits that best improve the targeted principle while preserving the overall layout structure. As a representative demonstration, we show coherence editing, for example, replacing a thematically inconsistent background or removing an irrelevant icon without disrupting other design attributes. The pipeline generalises naturally to all five principles.

Score: 0.19

(a) Incoherent Design

→

Score: 0.62

(b) Background replaced

→

Score: 0.66

Starting from an incoherent design (a), the pipeline first identifies the background as a key source of incoherence and proposes a replacement (b) that significantly improves the coherence score. A subsequent edit removes an irrelevant icon, further enhancing thematic consistency (c). This example demonstrates how targeted, principle-aware edits can iteratively refine design quality while preserving overall structure.

BibTeX

@inproceedings{gandhi2026prism, author = {Gandhi, Mona and Joseph, K J and Parthasarathy, Srinivasan and Nag, Sayan}, title = {Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, year = {2026}, }

Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs

PRISM systematically perturbs professional layouts along five measurable design principles, enabling interpretable, principle-specific evaluation of visual design quality.