Visual Perception — Research Repository

Engineering Preprint PDF DOI

OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

Junyoung Lee, Sookwan Han, Jeonghwan Kim, Inhee Lee, Mingi Choi, Jisoo Kim, Wonjung Woo, Hanbyul Joo · 2026

Human-robot collaboration has been studied primarily in dyadic or sequential settings. However, real homes require multiadic collaboration, where multiple humans and robots share a workspace, acting c…

Read Paper →

AI & Data Science Preprint PDF DOI

Representation Fr\'echet Loss for Visual Generation

Jiawei Yang, Zhengyang Geng, Xuan Ju, Yonglong Tian, Yue Wang · 2026

We show that Fr\'echet Distance (FD), long considered impractical as a training objective, can in fact be effectively optimized in the representation space. Our idea is simple: decouple the population…

Read Paper →

AI & Data Science Preprint PDF DOI

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, Zili Wang, Hui Zhang, Haonan Wang, Hang Zhou, Yifan Pu, Xingxuan Li, Fangneng Zhan, Bo Li, Lidong Bing, Yuxin Song, Ziwei Liu, Wenhu Chen, Jingdong Wang, Xinchao Wang, Xiaojuan Qi, Shijian Lu, Bin Wang · 2026

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, …

Read Paper →

AI & Data Science Preprint PDF DOI

Stop Holding Your Breath: CT-Informed Gaussian Splatting for Dynamic Bronchoscopy

Andrea Dunn Beltran, Daniel Rho, Aarav Mehta, Xinqi Xiong, Raul San Jose Estepar, Ron Alterovitz, Marc Niethammer, Roni Sengupta · 2026

Bronchoscopic navigation relies on registering endoscopic video to a preoperative CT scan, but respiratory motion deforms the airway by 5-20 mm, creating CT-to-body divergence that limits localization…

Read Paper →

AI & Data Science Preprint PDF DOI

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

Lincan Li, Zheng Chen, Yushun Dong · 2026

Electroencephalogram (EEG) signals are vital for automated seizure detection, but their inherent noise makes robust representation learning challenging. Existing graph construction methods, whether co…

Read Paper →

Computer Science Preprint PDF DOI

Essential, Yet Overlooked: Identity Verification Barriers for Blind and Low Vision People in Government Services

Ryan John Oommen, Tanusree Sharma · 2026

Identity verification is a critical gateway to accessing government services and public benefits, yet contemporary systems are typically designed around visual interaction, leaving blind and low visio…

Read Paper →

Physics Preprint PDF DOI

Deeply virtual pion production through two-loop order

Wen Chen, Feng Feng, Yu Jia, Qing-Tao Song, Guang Tang, Zhe-Yu Wang · 2026

Deeply virtual meson production (DVMP) is among the most prominent channels to extract the nucleon's generalized parton distributions (GPDs) at $ep$ scattering facilities such as {\tt JLab} and the up…

Read Paper →

AI & Data Science Preprint PDF DOI

Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering

Furkan K{i}nl{i} · 2026

Night Photography Rendering (NPR) poses a significant challenge due to the extreme contrast between dark and illuminated areas in scenes, stemming from concurrent capture of severely dark regions alon…

Read Paper →

AI & Data Science Preprint PDF DOI

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin · 2026

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). H…

Read Paper →

AI & Data Science Preprint PDF DOI

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

Andrew Bond, Ilkin Umut Melanlioglu, Erkut Erdem, Aykut Erdem · 2026

Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often fail to preserve underlying 3D geometry or physic…

Read Paper →

AI & Data Science Preprint PDF DOI

AesRM: Improving Video Aesthetics with Expert-Level Feedback

Yujin Han, Yujie Wei, Yefei He, Xinyu Liu, Tianle Li, Zichao Yu, Andi Han, Shiwei Zhang, Tingyu Weng, Difan Zou · 2026

Despite rapid advances in photorealistic video generation, real-world applications such as filmmaking require video aesthetics, e.g., harmonious colors and cinematic lighting, beyond visual fidelity. …

Read Paper →

AI & Data Science Preprint PDF DOI

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

Neemias B da Silva, Rodrigo Minetto, Daniel Silver, Thiago H Silva · 2026

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral d…

Read Paper →

Engineering Preprint PDF DOI

LiDAR-based Dynamic Blockage Prediction: A Data-driven Approach for Learning Interactive Bayesian Models

Saleemullah Memon, Ali Krayani, Pamela Zontone, Lucio Marcenaro, David Martin Gomez, Carlo Regazzoni · 2026

Vehicular sensing-based intelligence has made substantial progress in transportation systems, leading to higher levels of safety and sustainability for smart cities and autonomous systems. This paper …

Read Paper →

AI & Data Science Preprint PDF DOI

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Jialu Shen, Han Lyu, Suyang Zhong, Hanzheng Li, Haoyi Tao, Nan Wang, Changhong Chen, Xi Fang · 2026

Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-spec…

Read Paper →

AI & Data Science Preprint PDF DOI

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Sharayu Nilesh Deshmukh, Kailash A. Hambarde, Joana C. Costa, Hugo Proenca, Tiago Roxo · 2026

Current DeepFake detection scenarios are mostly binary, yet data manipulation can vary across audio, video, or both, whose variability is not captured in binary settings. Four-class audio-visual formu…

Read Paper →

AI & Data Science Preprint PDF DOI

Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

Jing Zhang, Wentao Jiang, Tao Huang, Zhiwei Wang, Jianxin Liu, Jian Chen, Ping Ye, Gang Wang, Zengmao Wang, Bo Du, Dacheng Tao · 2026

Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning, yet existing methods typically excel at only one of these capabilities: specialized detectors offer…

Read Paper →

Neuroscience Preprint PDF DOI

Multisensory learning recruits visual neurons into an olfactory memory engram

Zeynep Okray, Nils Otto, Anna A. Cook, Clifford Talbot, Ashwin Miriyala, Martin Klappenbach, Ciara Stern, Kieran Desmond, Paola Vargas-Gutierrez, Scott Waddell · 2026

Associating multiple sensory cues with a single experience or object is a fundamental process that improves object recognition and memory performance. However, neural mechanisms that bind sensory feat…

Read Paper →

AI & Data Science Preprint PDF DOI

A Pattern Language for Resilient Visual Agents

Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll · 2026

Integrating multimodal foundation models into enterprise ecosystems presents a fundamental software architecture challenge. Architects must balance competing quality attributes: the high latency and n…

Read Paper →

Computer Science Preprint PDF DOI

When and How AI Should Assist Brainstorming for AI Impact Assessment

Jarod Govers, Sanja Scepanovic, Daniele Quercia · 2026

A key task in AI practice is to assess potential impacts to prevent harm. Current AI tools assisting AI impact assessment have not been designed or evaluated for collaborative team brainstorming, and …

Read Paper →

Engineering Preprint PDF DOI

Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA

Feeza Khan Khanzada, Jaerock Kwon · 2026

Learned driving agents often degrade when deployed in unseen environments. This paper studies a deliberately bounded instance of that problem in the CARLA simulator: zero-shot transfer of a closed-loo…

Read Paper →

Browse Research Papers

OmniRobotHome: A Multi-Camera Platform for Real-Time Multiadic Human-Robot Interaction

Representation Fr\'echet Loss for Visual Generation

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Stop Holding Your Breath: CT-Informed Gaussian Splatting for Dynamic Bronchoscopy

LLM as Clinical Graph Structure Refiner: Enhancing Representation Learning in EEG Seizure Diagnosis

Essential, Yet Overlooked: Identity Verification Barriers for Blind and Low Vision People in Government Services

Deeply virtual pion production through two-loop order

Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

AesRM: Improving Video Aesthetics with Expert-Level Feedback

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

LiDAR-based Dynamic Blockage Prediction: A Data-driven Approach for Learning Interactive Bayesian Models

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

Multisensory learning recruits visual neurons into an olfactory memory engram

A Pattern Language for Resilient Visual Agents

When and How AI Should Assist Brainstorming for AI Impact Assessment

Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA

Browse by Category

Research Type

Publish Your Research