Bharat Biswal — Research Repository

AI & Data Science Preprint PDF DOI

Representation Fr\'echet Loss for Visual Generation

Jiawei Yang, Zhengyang Geng, Xuan Ju, Yonglong Tian, Yue Wang · 2026

We show that Fr\'echet Distance (FD), long considered impractical as a training objective, can in fact be effectively optimized in the representation space. Our idea is simple: decouple the population…

Read Paper →

AI & Data Science Preprint PDF DOI

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, Zili Wang, Hui Zhang, Haonan Wang, Hang Zhou, Yifan Pu, Xingxuan Li, Fangneng Zhan, Bo Li, Lidong Bing, Yuxin Song, Ziwei Liu, Wenhu Chen, Jingdong Wang, Xinchao Wang, Xiaojuan Qi, Shijian Lu, Bin Wang · 2026

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, …

Read Paper →

Computer Science Preprint PDF DOI

Essential, Yet Overlooked: Identity Verification Barriers for Blind and Low Vision People in Government Services

Ryan John Oommen, Tanusree Sharma · 2026

Identity verification is a critical gateway to accessing government services and public benefits, yet contemporary systems are typically designed around visual interaction, leaving blind and low visio…

Read Paper →

AI & Data Science Preprint PDF DOI

Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering

Furkan K{i}nl{i} · 2026

Night Photography Rendering (NPR) poses a significant challenge due to the extreme contrast between dark and illuminated areas in scenes, stemming from concurrent capture of severely dark regions alon…

Read Paper →

AI & Data Science Preprint PDF DOI

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin · 2026

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). H…

Read Paper →

AI & Data Science Preprint PDF DOI

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

Andrew Bond, Ilkin Umut Melanlioglu, Erkut Erdem, Aykut Erdem · 2026

Modern visual world modeling systems increasingly rely on high-capacity architectures and large-scale data to produce plausible motion, yet they often fail to preserve underlying 3D geometry or physic…

Read Paper →

AI & Data Science Preprint PDF DOI

AesRM: Improving Video Aesthetics with Expert-Level Feedback

Yujin Han, Yujie Wei, Yefei He, Xinyu Liu, Tianle Li, Zichao Yu, Andi Han, Shiwei Zhang, Tingyu Weng, Difan Zou · 2026

Despite rapid advances in photorealistic video generation, real-world applications such as filmmaking require video aesthetics, e.g., harmonious colors and cinematic lighting, beyond visual fidelity. …

Read Paper →

AI & Data Science Preprint PDF DOI

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Jialu Shen, Han Lyu, Suyang Zhong, Hanzheng Li, Haoyi Tao, Nan Wang, Changhong Chen, Xi Fang · 2026

Spectra are a prevalent yet highly information-dense form of scientific imagery, presenting substantial challenges to multimodal large language models (MLLMs) due to their unstructured and domain-spec…

Read Paper →

AI & Data Science Preprint PDF DOI

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Sharayu Nilesh Deshmukh, Kailash A. Hambarde, Joana C. Costa, Hugo Proenca, Tiago Roxo · 2026

Current DeepFake detection scenarios are mostly binary, yet data manipulation can vary across audio, video, or both, whose variability is not captured in binary settings. Four-class audio-visual formu…

Read Paper →

AI & Data Science Preprint PDF DOI

Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

Jing Zhang, Wentao Jiang, Tao Huang, Zhiwei Wang, Jianxin Liu, Jian Chen, Ping Ye, Gang Wang, Zengmao Wang, Bo Du, Dacheng Tao · 2026

Ultrasound interpretation requires both precise lesion localization and holistic clinical reasoning, yet existing methods typically excel at only one of these capabilities: specialized detectors offer…

Read Paper →

Neuroscience Preprint PDF DOI

Multisensory learning recruits visual neurons into an olfactory memory engram

Zeynep Okray, Nils Otto, Anna A. Cook, Clifford Talbot, Ashwin Miriyala, Martin Klappenbach, Ciara Stern, Kieran Desmond, Paola Vargas-Gutierrez, Scott Waddell · 2026

Associating multiple sensory cues with a single experience or object is a fundamental process that improves object recognition and memory performance. However, neural mechanisms that bind sensory feat…

Read Paper →

AI & Data Science Preprint PDF DOI

A Pattern Language for Resilient Visual Agents

Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll · 2026

Integrating multimodal foundation models into enterprise ecosystems presents a fundamental software architecture challenge. Architects must balance competing quality attributes: the high latency and n…

Read Paper →

Engineering Preprint PDF DOI

Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA

Feeza Khan Khanzada, Jaerock Kwon · 2026

Learned driving agents often degrade when deployed in unseen environments. This paper studies a deliberately bounded instance of that problem in the CARLA simulator: zero-shot transfer of a closed-loo…

Read Paper →

AI & Data Science Preprint PDF DOI

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

Ce Chen, Yi Ren, Yuanming Li, Viktor Goriachko, Zhenhui Ye, Zujin Guo, Zhibin Hong, Mingming Gong · 2026

Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this f…

Read Paper →

AI & Data Science Preprint PDF DOI

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

Fengxian Ji, Jingpu Yang, Zirui Song, Yuanxi Wang, Zhexuan Cui, Yuke Li, Qian Jiang, Xiuying Chen · 2026

Despite the rapid progress of large vision-language models (LVLMs), fine-grained, state-conditioned GUI interaction remains challenging. Current evaluations offer limited coverage, imprecise target-st…

Read Paper →

Computer Science Preprint PDF DOI

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

Guang Yang, Xing Hu, Xiang Chen, Xin Xi · 2026

Multimodal large language models (MLLMs) are increasingly used to translate visual artifacts into code, from UI mockups into HTML to scientific plots into Python scripts. A circuit diagram can be view…

Read Paper →

AI & Data Science Preprint PDF DOI

ClimateVID -- Social Media Videos Analysis and Challenges Involved

Shiqi Xu, Moritz Burmester, Katharina Prasse, Isaac Bravo, Stefanie Walter, Margret Keuper · 2026

The pervasive growth of digital content, specifically short videos on social media platforms, has significantly altered how topics are discussed and understood in public discourse. In this work, we ad…

Read Paper →

AI & Data Science Preprint PDF DOI

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

Kenneth J. K. Ong · 2026

As Vision-Language Models (VLMs) become increasingly integrated into decision-making systems, it is essential to understand how visual inputs influence their behavior. This paper investigates the effe…

Read Paper →

AI & Data Science Preprint PDF DOI

Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction

Shipeng Liu, Liang Zhao, Dengfeng Chen, Zhanping Song · 2026

Tunnel inspection requires outputs that can support defect localization, measurement, severity grading, and engineering documentation. Existing training-free foundation-model pipelines usually stop at…

Read Paper →

Physics Preprint PDF DOI

The Large Array Survey Telescope-Pipeline. II. Image Subtraction and Transient Detection

R. Konno, E. O. Ofek, A. Krassilchtchikov, Y. Shvartzvald, S. Ben-Ami, D. Polishook, C. Tishler, E. Segre, S. Garrappa, E. A. Zimmermann, A. Horowicz, P. Chen, A. Gal-Yam, M. Engel, Y. M. Shani, S. A. Spitzer, S. Fainer, O. Yaron, A. Blumenzweig · 2026

Context. The Large Array Survey Telescope (LAST) is a wide-field visual-band survey designed to explore the variable and transient sky with high cadence. Its raw data stream is automatically processed…

Read Paper →

Browse Research Papers

Representation Fr\'echet Loss for Visual Generation

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Essential, Yet Overlooked: Identity Verification Barriers for Blind and Low Vision People in Government Services

Beyond Pixel Fidelity: Minimizing Perceptual Distortion and Color Bias in Night Photography Rendering

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Beyond Gaussian Bottlenecks: Topologically Aligned Encoding of Vision-Transformer Feature Spaces

AesRM: Improving Video Aesthetics with Expert-Level Feedback

SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images

Are DeepFakes Realistic Enough? Exploring Semantic Mismatch as a Novel Challenge

Echo-{\alpha}: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation

Multisensory learning recruits visual neurons into an olfactory memory engram

A Pattern Language for Resilient Visual Agents

Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting

From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation

ClimateVID -- Social Media Videos Analysis and Challenges Involved

The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models

Training-Free Tunnel Defect Inspection and Engineering Interpretation via Visual Recalibration and Entity Reconstruction

The Large Array Survey Telescope-Pipeline. II. Image Subtraction and Transient Detection

Browse by Category

Research Type

Publish Your Research