Solomon Endlich in Engineering — Research Repository

Engineering Preprint PDF DOI

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

Li Li, Ming Cheng, Weixin Zhu, Yannan Wang, Juan Liu, Ming Li · 2026

Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and s…

Read Paper →

Engineering Preprint PDF DOI

An Explainable Approach to Document-level Translation Evaluation with Topic Modeling

Hyeokmin Lee, Youngkyu Kim, Byounghyun Yoo · 2026

The advent of NMT has expanded the scope of translation beyond isolated sentences, enabling context to be preserved across paragraphs and documents. However, current evaluation metrics largely remain …

Read Paper →

Engineering Preprint PDF DOI

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

Girish, Mohd Mujtaba Akhtar, Orchid Chetia Phukan, Arun Balaji Buduru · 2026

The rapid advancement of Audio Large Language Models (ALMs), driven by Neural Audio Codecs (NACs), has led to the emergence of highly realistic speech deepfakes, commonly referred to as CodecFakes (CF…

Read Paper →

Engineering Preprint PDF DOI

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Nune Tadevosyan, Vitaly Lavrukhin, Boris Ginsburg · 2026

Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings rema…

Read Paper →

Engineering Preprint PDF DOI

MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model

Zhiwei Liu, Yuyan Wang, Yuechen Jiang, Yupeng Cao, Tianlei Zhu, Xiaorui Guo, Zhiyang Deng, Zhiyuan Yao, Xiao-Yang Liu, Jimin Huang, Sophia Ananiadou · 2026

Financial misinformation poses significant threats to financial market stability and individuals' investment decisions. The multilingual environment and the inherent complexity of financial informatio…

Read Paper →

Engineering Preprint PDF DOI

4D Radar Gaussian Modeling and Scan Matching with RCS

Fernando Amodeo, Luis Merino, Fernando Caballero · 2026

4D millimeter-wave (mmWave) radars are increasingly used in robotics, as they offer robustness against adverse environmental conditions. Besides the usual XYZ position, they provide Doppler velocity m…

Read Paper →

Engineering Preprint PDF DOI

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

Qixi Zheng, Yuxiang Zhao, Tianrui Wang, Wenxi Chen, Kele Xu, Yikang Li, Qinyuan Chen, Xipeng Qiu, Kai Yu, Xie Chen · 2026

Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion …

Read Paper →

Engineering Preprint PDF DOI

An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

Tianhui Su, Tien-Ping Tan, Salima Mdhaffar, Yannick Esteve, Aghilas Sini · 2026

Real-time speech synthesis requires balancing inference latency and acoustic fidelity for interactive applications. Conventional continuous text-to-speech pipelines require computationally intensive n…

Read Paper →

Engineering Preprint PDF DOI

Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech

Gus Lathouwers, Lingyun Gao, Catia Cucchiarini, Helmer Strik · 2026

Automatic Speech Recognition (ASR) is increasingly used in applications involving child speech, such as language learning and literacy acquisition. However, the effectiveness of such applications is l…

Read Paper →

Engineering Preprint PDF DOI

Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR

Ziwei Li, Lukuang Dong, Saierdaer Yusuyin, Xianyu Zhao, Zhijian Ou · 2026

Integrating pretrained speech encoders with large language models (LLMs) is promising for ASR, but performance and data efficiency depend on the speech-language interface. A common choice is a learned…

Read Paper →

Engineering Preprint PDF DOI

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

Changi Hong, Yoonah Song, Hwayoung Park, Chaewoon Bang, Dayeon Gu, Do Hyun Lee, Hong Kook Kim · 2026

Recently, artificial intelligence-based dubbing technology has advanced, enabling automated dubbing (AD) to convert the source speech of a video into target speech in different languages. However, nat…

Read Paper →

Engineering Preprint PDF DOI

Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

Yuan Xie, Jiaqi Song, Guang Qiu, Xianliang Wang, Ming Lei, Jie Gao, Jie Wu · 2026

Integrating large language models (LLMs) into automatic speech recognition (ASR) has become a dominant paradigm. Although recent LLM-based ASR models have shown promising performance on public benchma…

Read Paper →

Engineering Preprint PDF DOI

XR-CareerAssist: An Immersive Platform for Personalised Career Guidance Leveraging Extended Reality and Multimodal AI

N.D. Tantaroudas, A.J. McCracken, I. Karachalios, E. Papatheou, V. Pastrikakis · 2026

Conventional career guidance platforms rely on static, text-driven interfaces that struggle to engage users or deliver personalised, evidence-based insights. Although Computer-Assisted Career Guidance…

Read Paper →

Engineering Preprint PDF DOI

Semantic Communication with an LLM-enabled Knowledge Base

Wuxia Hu, Caili Guo, Yang Yang, Chunyan Feng, Kuiyuan Ding, Shiwen Mao · 2026

Semantic communication (SC) can achieve superior coding and transmission performance based on the knowledge contained in the semantic knowledge base (KB). However, conventional KBs consist of source K…

Read Paper →

Engineering Preprint PDF DOI

Towards Considerate Human-Robot Coexistence: A Dual-Space Framework of Robot Design and Human Perception in Healthcare

Yuanchen Bai, Zijian Ding, Ruixiang Han, Niti Parikh, Wendy Ju, Angelique Taylor · 2026

The rapid advancement of robotics, spanning expanded capabilities, more intuitive interaction, and more integration into real-world workflows, is reshaping what it means for humans and robots to coexi…

Read Paper →

Engineering Preprint PDF DOI

MIMO Capacity Enhancement by Grating Walls: A Physics-Based Proof of Principle

Xiaolu Yang, Oscar Cespedes Vicente, Christophe Caloz · 2026

This paper investigates the passive enhancement of MIMO spectral efficiency through boundary engineering in a simplified two dimensional indoor proof of principle model. The propagation channel is con…

Read Paper →

Engineering Preprint PDF DOI

T5Gemma-TTS Technical Report

Chihiro Arata, Kiyoshi Kurihara · 2026

Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence…

Read Paper →

Engineering Preprint PDF DOI

HARNESS: Lightweight Distilled Arabic Speech Foundation Models

Vrunda N. Sukhadia, Shammur Absar Chowdhury · 2026

Large self-supervised speech (SSL) models achieve strong downstream performance, but their size limits deployment in resource-constrained settings. We present HArnESS, an Arabic-centric self-supervise…

Read Paper →

Engineering Preprint PDF DOI

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

Ashwini Dasare, Nirmesh Shah, Ashishkumar Gudmalwar, Pankaj Wasnik · 2026

Evaluating AI generated dubbed content is inherently multi-dimensional, shaped by synchronization, intelligibility, speaker consistency, emotional alignment, and semantic context. Human Mean Opinion S…

Read Paper →

Engineering Preprint PDF DOI

AutoSiMP: Autonomous Topology Optimization from Natural Language via LLM-Driven Problem Configuration and Adaptive Solver Control

Shaoliang Yang, Jun Wang, Yunsheng Wang · 2026

We present AutoSiMP, an autonomous pipeline that transforms a natural-language structural problem description into a validated, binary topology without manual configuration. The pipeline comprises fiv…

Read Paper →

Browse Research Papers

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

An Explainable Approach to Document-level Translation Evaluation with Topic Modeling

Indic-CodecFake meets SATYAM: Towards Detecting Neural Audio Codec Synthesized Speech Deepfakes in Indic Languages

Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization

MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model

4D Radar Gaussian Modeling and Scan Matching with RCS

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

An Ultra-Low Latency, End-to-End Streaming Speech Synthesis Architecture via Block-Wise Generation and Depth-Wise Codec Decoding

Utterance-Level Methods for Identifying Reliable ASR-Output for Child Speech

Phonemes vs. Projectors: An Investigation of Speech-Language Interfaces for LLM-based ASR

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

XR-CareerAssist: An Immersive Platform for Personalised Career Guidance Leveraging Extended Reality and Multimodal AI

Semantic Communication with an LLM-enabled Knowledge Base

Towards Considerate Human-Robot Coexistence: A Dual-Space Framework of Robot Design and Human Perception in Healthcare

MIMO Capacity Enhancement by Grating Walls: A Physics-Based Proof of Principle

T5Gemma-TTS Technical Report

HARNESS: Lightweight Distilled Arabic Speech Foundation Models

Can Hierarchical Cross-Modal Fusion Predict Human Perception of AI Dubbed Content?

AutoSiMP: Autonomous Topology Optimization from Natural Language via LLM-Driven Problem Configuration and Adaptive Solver Control

Browse by Category

Research Type

Publish Your Research