Visual Perception — Research Repository

AI & Data Science Preprint PDF DOI

RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation

Yucheng Chen, Yang Yu, Yufei Shi, Conghao Xiong, Xulei Yang, Si Yong Yeo · 2026

Radiology report generation (RRG) has emerged as a promising approach to alleviate radiologists' workload and reduce human errors by automatically generating diagnostic reports from medical images. A …

Read Paper →

Physics Preprint PDF DOI

Computation of frequency- and time-domain Jacobians in optical tomography with Monte Carlo simulations

Pauliina Hirvi, Jaakko Olkkonen, Qianqian Fang, Ilkka Nissila · 2026

Significance: Jacobians, or spatially resolved sensitivity profiles, are central to image reconstruction in model-based optical tomography of biological tissue. Although Monte Carlo (MC) simulations a…

Read Paper →

AI & Data Science Preprint PDF DOI

SpatialGrammar: A Domain-Specific Language for LLM-Based 3D Indoor Scene Generation

Song Tang, Kaiyong Zhao, Yuliang Li, Qingsong Yan, Penglei Sun, Junyi Zou, Qiang Wang, Xiaowen Chu · 2026

Automatically generating interactive 3D indoor scenes from natural language is crucial for virtual reality, gaming, and embodied AI. However, existing LLM-based approaches often suffer from spatial er…

Read Paper →

AI & Data Science Preprint PDF DOI

Revealing the Impact of Visual Text Style on Attribute-based Descriptions Produced by Large Visual Language Models

Xiaomeng Wang, Martha Larson, Zhengyu Zhao · 2026

When the visual style of text is considered, a wide variety can be observed in font, color, and size. However, when a word is read, its meaning is independent of the style in which it has been written…

Read Paper →

AI & Data Science Preprint PDF DOI

Residual Gaussian Splatting for Ultra Sparse-View CBCT Reconstruction

Jian Lin, Jiancheng Fang, Shaoyu Wang, Changan Lai, Yikun Zhang, Yang Chen, Qiegen Liu · 2026

While 3D Gaussian splatting (3DGS) offers explicit and efficient scene representations for cone-beam computed tomography reconstruction, conventional photometric optimization inherently suffers from s…

Read Paper →

AI & Data Science Preprint PDF DOI

HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

Thibault Baneras Roux, Jane Wottawa, Mickael Rouvier, Teva Merlin, Richard Dufour · 2026

Conventionally, Automatic Speech Recognition (ASR) systems are evaluated on their ability to correctly recognize each word contained in a speech signal. In this context, the word error rate (WER) metr…

Read Paper →

AI & Data Science Preprint PDF DOI

Self-Supervised Learning of Plant Image Representations

Ilyass Moummad, Kawtar Zaher, Herve Goeau, Jean-Christophe Lombardo, Pierre Bonnet, Alexis Joly · 2026

Automated plant recognition plays a crucial role in biodiversity monitoring and conservation, yet current approaches rely heavily on supervised learning, which is limited by the availability of expert…

Read Paper →

AI & Data Science Preprint PDF DOI

REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

Hankyeol Lee, Wooyeol Baek, Seongdo Kim, Jongyoo Kim · 2026

Recent generative models have shown strong performance in generating diverse 3D assets from 2D images, a fundamental research topic in computer vision and graphics. However, these models still struggl…

Read Paper →

AI & Data Science Preprint PDF DOI

Towards All-Day Perception for Off-Road Driving: A Large-Scale Multispectral Dataset and Comprehensive Benchmark

Shuo Wang, Jilin Mei, Wenfei Guan, Shuai Wang, Yan Xing, Chen Min, Yu Hu · 2026

Off-road nighttime autonomous driving suffers from unreliable visible-light perception, making infrared modality crucial for accurate freespace detection. However, progress remains limited due to the …

Read Paper →

AI & Data Science Preprint PDF DOI

Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

Mengfei Zhang, Jinlu Zhang, Zhigang Tu · 2026

Modeling 4D human-object interaction (HOI) is a compelling challenge in computer vision and an essential technology powering virtual and mixed-reality applications. While existing works have achieved …

Read Paper →

AI & Data Science Preprint PDF DOI

Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

Yu Tian, Jiawei Chen, Lifan Zheng, Mingxiang Tao, Xinyi Zeng, Zhaoxia Yin, Hang Su, Xian Sun · 2026

We introduce Skills-Coach, a novel automated framework designed to significantly enhance the self-evolution of skills within Large Language Model (LLM)-based agents. Addressing the current fragmentati…

Read Paper →

AI & Data Science Preprint PDF DOI

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

Yang Zhang, Jiangyuan Zhao, Chenyou Fan, Fangzheng Yan, Tian Li, Haitong Tang, Sen Fu, Xuan'er Wu, Qizhen Weng, Weinan Zhang, Xiu Li, Chi Zhang, Chenjia Bai, Xuelong Li · 2026

Vision-Language-Action (VLA) models advance robotic control via strong visual-linguistic priors. However, existing VLAs predominantly frame pretraining as supervised behavior cloning, overlooking the …

Read Paper →

Computer Science Preprint PDF DOI

ReVo: A Cross-Layer Reliable Volumetric Videoconferencing System

Ankur Aditya, Diptyaroop Maji, Lingdong Wang, Bhavya Ramakrishna, Ramesh Sitaraman, Prashant Shenoy · 2026

Volumetric videoconferencing enables immersive six Degrees of Freedom interactions by jointly transmitting visual appearance and 3D geometry. However, delivering volumetric video over today's networks…

Read Paper →

Engineering Preprint PDF DOI

BUT System Description for CHiME-9 MCoRec Challenge

Dominik Klement, Alexander Polok, Nguyen Hai Phong, Prachi Singh, Lukas Burget · 2026

Multi-talker automatic speech recognition (ASR) in conversational recordings remains an open problem, particularly in scenarios with large portion of overlapping speech where identifying and transcrib…

Read Paper →

AI & Data Science Preprint PDF DOI

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Qiyao Wang, Haoran Hu, Longze Chen, Hongbo Wang, Hamid Alinejad-Rokny, Yuan Lin, Min Yang · 2026

With the advancement of multimodal large language models (MLLMs) and coding agents, the website development has shifted from manual programming to agent-based project-level code synthesis. Existing be…

Read Paper →

AI & Data Science Preprint PDF DOI

Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis

David Fernandez, Pedram MohajerAnsari, Amir Salarpour, Mert D. Pese · 2026

Vision-language models (VLMs) are increasingly used in autonomous driving because they combine visual perception with language-based reasoning, supporting more interpretable decision-making, yet their…

Read Paper →

AI & Data Science Preprint PDF DOI

Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift

Haiyang Zhao · 2026

Visual model-based reinforcement learning (MBRL) agents can perform well on the training distribution, but often break down once the test environment shifts. In visual MBRL, recognizing that a shift h…

Read Paper →

Biology & Life Sciences Preprint PDF DOI

Personalizing Cancer Models under Data Scarcity via Parameter Decomposition

Logan Rose, Jonathan Martinez, Juho Kim, Jing Qin, Boris Aguilar, David Murrugarra · 2026

Personalized cancer modeling for clinical applications requires robust and efficient parameter calibration, particularly in settings with limited patient data. This need is especially critical for med…

Read Paper →

AI & Data Science Preprint PDF DOI

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, Jiancheng Gui, Luoyuan Zhang, Xian Sun, Fuwei Huang, Moye Chen, Zhuo Lin, Hanyu Liu, Qingxin Gui, Qingzhe Han, Yuyang Wen, Huiping Liu, Rongkang Wang, Yaqi Zhang, Hongliang Wei, Chi Chen, You Li, Kechen Fang, Jie Zhou, Yuxuan Li, Guoyang Zeng, Chaojun Xiao, Yankai Lin, Xu Han, Maosong Sun, Zhiyuan Liu, Yuan Yao · 2026

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from human-lev…

Read Paper →

AI & Data Science Preprint PDF DOI

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Bingli Wang, Huanze Tang, Haijun Lv, Zhishan Lin, Lixin Gu, Lei Feng, Qipeng Guo, Kai Chen · 2026

In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on s…

Read Paper →

Browse Research Papers

RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation

Computation of frequency- and time-domain Jacobians in optical tomography with Monte Carlo simulations

SpatialGrammar: A Domain-Specific Language for LLM-Based 3D Indoor Scene Generation

Revealing the Impact of Visual Text Style on Attribute-based Descriptions Produced by Large Visual Language Models

Residual Gaussian Splatting for Ultra Sparse-View CBCT Reconstruction

HATS: An Open data set Integrating Human Perception Applied to the Evaluation of Automatic Speech Recognition Metrics

Self-Supervised Learning of Plant Image Representations

REVIVE 3D: Refinement via Encoded Voluminous Inflated prior for Volume Enhancement

Towards All-Day Perception for Off-Road Driving: A Large-Scale Multispectral Dataset and Comprehensive Benchmark

Uni-HOI:A Unified framework for Learning the Joint distribution of Text and Human-Object Interaction

Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO

PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

ReVo: A Cross-Layer Reliable Volumetric Videoconferencing System

BUT System Description for CHiME-9 MCoRec Challenge

InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?

Understanding Adversarial Transferability in Vision-Language Models for Autonomous Driving: A Cross-Architecture Analysis

Detecting is Easy, Adapting is Hard: Local Expert Growth for Visual Model-Based Reinforcement Learning under Distribution Shift

Personalizing Cancer Models under Data Scarcity via Parameter Decomposition

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Browse by Category

Research Type

Publish Your Research