Visual Perception in AI & Data Science — Research Repository

AI & Data Science Preprint PDF DOI

Representation Fr\'echet Loss for Visual Generation

Jiawei Yang, Zhengyang Geng, Xuan Ju, Yonglong Tian, Yue Wang · 2026

We show that Fr\'echet Distance (FD), long considered impractical as a training objective, can in fact be effectively optimized in the representation space. Our idea is simple: decouple the population…

Read Paper →

AI & Data Science Preprint PDF DOI

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, Zili Wang, Hui Zhang, Haonan Wang, Hang Zhou, Yifan Pu, Xingxuan Li, Fangneng Zhan, Bo Li, Lidong Bing, Yuxin Song, Ziwei Liu, Wenhu Chen, Jingdong Wang, Xinchao Wang, Xiaojuan Qi, Shijian Lu, Bin Wang · 2026

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, …

Read Paper →

AI & Data Science Preprint PDF DOI

Stop Holding Your Breath: CT-Informed Gaussian Splatting for Dynamic Bronchoscopy

Andrea Dunn Beltran, Daniel Rho, Aarav Mehta, Xinqi Xiong, Raul San Jose Estepar, Ron Alterovitz, Marc Niethammer, Roni Sengupta · 2026

Bronchoscopic navigation relies on registering endoscopic video to a preoperative CT scan, but respiratory motion deforms the airway by 5-20 mm, creating CT-to-body divergence that limits localization…

Read Paper →

AI & Data Science Preprint PDF DOI

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

Lincan Li, Zheng Chen, Yushun Dong · 2026

Electroencephalogram (EEG) signals are vital for automated seizure detection, but their inherent noise makes robust representation learning challenging. Existing graph construction methods, whether co…

Read Paper →

AI & Data Science Preprint PDF DOI

Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering

Furkan K{i}nl{i} · 2026

Night Photography Rendering (NPR) poses a significant challenge due to the extreme contrast between dark and illuminated areas in scenes, stemming from concurrent capture of severely dark regions alon…

Read Paper →

AI & Data Science Preprint PDF DOI

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin · 2026

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). H…

Read Paper →

AI & Data Science Preprint PDF DOI

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

Andrew Bond, Ilkin Umut Melanlioglu, Erkut Erdem, Aykut Erdem · 2026

Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often fail to preserve underlying 3D geometry or physic…

Read Paper →

AI & Data Science Preprint PDF DOI

AesRM: Improving Video Aesthetics with Expert-Level Feedback

Yujin Han, Yujie Wei, Yefei He, Xinyu Liu, Tianle Li, Zichao Yu, Andi Han, Shiwei Zhang, Tingyu Weng, Difan Zou · 2026

Despite rapid advances in photorealistic video generation, real-world applications such as filmmaking require video aesthetics, e.g., harmonious colors and cinematic lighting, beyond visual fidelity. …

Read Paper →

AI & Data Science Preprint PDF DOI

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Neemias B da Silva, Rodrigo Minetto, Daniel Silver, Thiago H Silva · 2026

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral d…

Read Paper →

AI & Data Science Preprint PDF DOI

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Jialu Shen, Han Lyu, Suyang Zhong, Hanzheng Li, Haoyi Tao, Nan Wang, Changhong Chen, Xi Fang · 2026

Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-spec…

Read Paper →

AI & Data Science Preprint PDF DOI

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Sharayu Nilesh Deshmukh, Kailash A. Hambarde, Joana C. Costa, Hugo Proenca, Tiago Roxo · 2026

Current DeepFake detection scenarios are mostly binary, yet data manipulation can vary across audio, video, or both, whose variability is not captured in binary settings. Four-class audio-visual formu…

Read Paper →

AI & Data Science Preprint PDF DOI

Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

Jing Zhang, Wentao Jiang, Tao Huang, Zhiwei Wang, Jianxin Liu, Jian Chen, Ping Ye, Gang Wang, Zengmao Wang, Bo Du, Dacheng Tao · 2026

Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning, yet existing methods typically excel at only one of these capabilities: specialized detectors offer…

Read Paper →

AI & Data Science Preprint PDF DOI

A Pattern Language for Resilient Visual Agents

Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll · 2026

Integrating multimodal foundation models into enterprise ecosystems presents a fundamental software architecture challenge. Architects must balance competing quality attributes: the high latency and n…

Read Paper →

AI & Data Science Preprint PDF DOI

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Ce Chen, Yi Ren, Yuanming Li, Viktor Goriachko, Zhenhui Ye, Zujin Guo, Zhibin Hong, Mingming Gong · 2026

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this f…

Read Paper →

AI & Data Science Preprint PDF DOI

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Xiuying Chen · 2026

Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-st…

Read Paper →

AI & Data Science Preprint PDF DOI

From LLM-Driven Trading Card Generation to Procedural Relatedness: A Pok\'emon Case Study

Johannes Pfau, Panagiotis Vrettis · 2026

Since the dawn of Trading Card Games, the genre has grown into a multi-billion-dollar industry engaging millions of analog and digital players worldwide. Popular TCGs rely on regular updates, balance …

Read Paper →

AI & Data Science Preprint PDF DOI

ClimateVID -- Social Media Videos Analysis and Challenges Involved

Shiqi Xu, Moritz Burmester, Katharina Prasse, Isaac Bravo, Stefanie Walter, Margret Keuper · 2026

The pervasive growth of digital content, specifically short videos on social media platforms, has significantly altered how topics are discussed and understood in public discourse. In this work, we ad…

Read Paper →

AI & Data Science Preprint PDF DOI

TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

Dingbao Shao, Song Wu, Shenyi Wang, Ye Wang, Ziheng Tang, Fei Liu, Jiang Lin, Xinyu Chen, Qian Wang, Ying Tai, Jian Yang, Zili Yi · 2026

Due to the scarcity of large-scale in-the-wild triplet data and the improper use of masks, the performance of video virtual try-on models remains limited. In this paper, we first introduce **TripVVT-1…

Read Paper →

AI & Data Science Preprint PDF DOI

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

Kenneth J. K. Ong · 2026

As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effe…

Read Paper →

AI & Data Science Preprint PDF DOI

Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction

Shipeng Liu, Liang Zhao, Dengfeng Chen, Zhanping Song · 2026

Tunnel inspection requires outputs that can support defect localization, measurement, severity grading, and engineering documentation. Existing training-free foundation-model pipelines usually stop at…

Read Paper →

Browse Research Papers

Representation Fr\'echet Loss for Visual Generation

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Stop Holding Your Breath: CT-Informed Gaussian Splatting for Dynamic Bronchoscopy

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

AesRM: Improving Video Aesthetics with Expert-Level Feedback

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

A Pattern Language for Resilient Visual Agents

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

From LLM-Driven Trading Card Generation to Procedural Relatedness: A Pok\'emon Case Study

ClimateVID -- Social Media Videos Analysis and Challenges Involved

TripVVT: A Large-Scale Triplet Dataset and a Coarse-Mask Baseline for In-the-Wild Video Virtual Try-On

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction

Browse by Category

Research Type

Publish Your Research