Gerhard Bauch in Computer Science — Research Repository

Computer Science Preprint PDF DOI

COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training

Akhmed Sakip, Erland Hilman Fuadi, Omar Sayedelahl, Zonghang Li, Jianshu She, Alham Fikri Aji, Steve Liu, Eric Xing, Qirong Ho · 2026

Training large language models requires jointly configuring two interdependent aspects of the system: the global batch size, which governs statistical efficiency, and the 3D parallelism strategy, whic…

Read Paper →

Computer Science Preprint PDF DOI

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

Hanna Foerster, Ilia Shumailov, Cheng Zhang, Yiren Zhao, Jamie Hayes, Robert Mullins · 2026

Dynamic quantization emerged as a practical approach to increase the utilization and efficiency of the machine learning serving flow. Unlike static quantization, which applies quantization offline, dy…

Read Paper →

Computer Science Preprint PDF DOI

Differentially Private Contrastive Learning via Bounding Group-level Contribution

Kecen Li, Chen Gong, Zinan Lin, Tianhao Wang, Xiaokui Xiao · 2026

Differentially private (DP) contrastive learning aims to learn general-purpose representations from sensitive data, alleviating the privacy leakage concerns of organizations deploying or sharing embed…

Read Paper →

Computer Science Preprint PDF DOI

Hands-on PDC in Undergraduate Computing Education

Hala ElAarag, Anas Gamal Aly · 2026

Parallel and Distributed Computing (PDC) is a critical yet conceptually challenging area of the undergraduate computer science curriculum. While students often encounter these concepts in theory, few …

Read Paper →

Computer Science Preprint PDF DOI

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

Mingbo Hao, Changwei Yan, Haoyu Cui, Zhihao Yan, Yizhi Ding, Zhangrui Qian, Weiwei Shan · 2026

The rapid growth of LLMs demands high-throughput, memory-capacity-intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out…

Read Paper →

Computer Science Preprint PDF DOI

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

Sean Nian, Jiahao Fang, Qilong Feng, Zhiyu Wu, Fan Lai · 2026

KV cache restoration has emerged as a dominant bottleneck in serving long-context LLM workloads, including multi-turn conversations, retrieval-augmented generation, and agentic pipelines. Existing app…

Read Paper →

Computer Science Preprint PDF DOI

Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

Liang Guo, Ge Song, Litao Deng, Jianhui Sun, Chufeng Hu, Lu Zhang, Zhen Ma, Shouwei Chen, Weiran Liu, Sarang Masti Sreeshylan, Xiaoxuan Meng · 2026

Modern Deep Learning Recommendation Models (DLRMs) follow scaling laws with sequence length, driving the frontier toward ultra-long User Interaction History (UIH). However, the industry-standard "Fat …

Read Paper →

Computer Science Preprint PDF DOI

Scalable LLM-based Coding of Dialogue in Healthcare Simulation: Balancing Coding Performance, Processing Time, and Environmental Impact

Kiyoshige Garces, Gloria Milena Fernandez-Nieto, Linxuan Zhao, Sachini Samaraweera, Dragan Gasevic, Roberto Martinez-Maldonado, Vanessa Echeverria · 2026

Research shows that dialogue, the interactive process through which participants articulate their thinking, plays a central role in constructing shared understanding, coordinating action, and shaping …

Read Paper →

Computer Science Preprint PDF DOI

Efficient Batch Search Algorithm for B+ Tree Index Structures with Level-Wise Traversal on FPGAs

Max Tzschoppe, Martin Wilhelm, Sven Groppe, Thilo Pionteck · 2026

This paper introduces a search algorithm for index structures based on a B+ tree, specifically optimized for execution on a field-programmable gate array (FPGA). Our implementation efficiently travers…

Read Paper →

Computer Science Preprint PDF DOI

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving

Wenyan Chen, Chengzhi Lu, Yanying Lin, Dmitrii Ustiugov · 2026

Speculative decoding (SD) is a widely used approach for accelerating decode-heavy LLM inference workloads. While online inference workloads are highly dynamic, existing SD systems are rigid and take a…

Read Paper →

Computer Science Preprint PDF DOI

DiP-SD: Distributed Pipelined Speculative Decoding for Efficient LLM Inference at the Edge

Yaodan Xu, Sheng Zhou, Zhisheng Niu · 2026

Speculative decoding has emerged as a promising technique for large language model (LLM) inference by accelerating autoregressive decoding via draft-then-verify. This paper studies a new edge scenario…

Read Paper →

Computer Science Preprint PDF DOI

A GPU-Accelerated Framework for Multi-Attribute Range Filtered Approximate Nearest Neighbor Search

Zhonggen Li, Haoran Yu, Zixuan Xu, Yifan Zhu, Yunjun Gao · 2026

Range-filtered approximate nearest neighbor search (RFANNS) is increasingly critical for modern vector databases. However, existing solutions suffer from severe index inflation and construction overhe…

Read Paper →

Computer Science Preprint PDF DOI

FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

Shuyao Qi, Haoyuan Liu, Shizhen Zhao · 2026

Fine-grained, per-micro-batch load balancing is essential for efficient Mixture-of-Experts (MoE) training, yet every prior dynamic scheduling scheme pays for it with extra communication that is hard t…

Read Paper →

Computer Science Preprint PDF DOI

ReaLB: Real-Time Load Balancing for Multimodal MoE Inference

Yingping Wang, Yi Wu, Xiangyu Wu, Junwei Cui, Weilin Cai, Zhijiang Guo, Jiayi Huang · 2026

Mixture-of-Experts (MoE) architectures are widely used in modern large language models and multimodal models. However, inference efficiency is often limited by highly dynamic and skewed expert workloa…

Read Paper →

Computer Science Preprint PDF DOI

sumo3Dviz: A three dimensional traffic visualisation

Kevin Riehl, Julius Schlapbach, Anastasios Kouvelas, Michail A. Makridis · 2026

Traffic microsimulation software such as SUMO generate rich spatio-temporal data describing individual vehicle movements, interactions, and support the development of control strategies. While numeric…

Read Paper →

Computer Science Preprint PDF DOI

Design Rules for Extreme-Edge Scientific Computing on AI Engines

Zhenghua Ma, G Abarajithan, Dimitrios Danopoulos, Olivia Weng, Francesco Restuccia, Ryan Kastner · 2026

Extreme-edge scientific applications use machine learning models to analyze sensor data and make real-time decisions. Their stringent latency and throughput requirements demand small batch sizes and r…

Read Paper →

Computer Science Preprint PDF DOI

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Ziyang Liu · 2026

Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since …

Read Paper →

Computer Science Preprint PDF DOI

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

Sanjeev Rao Ganjihal · 2026

Key-value (KV) cache memory management is the primary bottleneck limiting throughput and cost-efficiency in large-scale GPU inference serving. Current systems suffer from three compounding inefficienc…

Read Paper →

Computer Science Preprint PDF DOI

Bit-Flip Vulnerability of Shared KV-Cache Blocks in LLM Serving Systems

Yuji Yamamoto, Satoshi Matsuura · 2026

Rowhammer on GPU DRAM has enabled adversarial bit flips in model weights; shared KV-cache blocks in LLM serving systems present an analogous but previously unexamined target. In vLLM's Prefix Caching,…

Read Paper →

Computer Science Preprint PDF DOI

Towards Deep Encrypted Training: Low-Latency, Memory-Efficient, and High-Throughput Inference for Privacy-Preserving Neural Networks

Nges Brian Njungle, Eric Jahns, Michel A. Kinsy · 2026

Privacy-preserving machine learning (PPML) has become increasingly important in applications where sensitive data must remain confidential. Homomorphic Encryption (HE) enables computation directly on …

Read Paper →

Browse Research Papers

COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

Differentially Private Contrastive Learning via Bounding Group-level Contribution

Hands-on PDC in Undergraduate Computing Education

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

Versioned Late Materialization for Ultra-Long Sequence Training in Recommendation Systems at Scale

Scalable LLM-based Coding of Dialogue in Healthcare Simulation: Balancing Coding Performance, Processing Time, and Environmental Impact

Efficient Batch Search Algorithm for B+ Tree Index Structures with Level-Wise Traversal on FPGAs

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving

DiP-SD: Distributed Pipelined Speculative Decoding for Efficient LLM Inference at the Edge

A GPU-Accelerated Framework for Multi-Attribute Range Filtered Approximate Nearest Neighbor Search

FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

ReaLB: Real-Time Load Balancing for Multimodal MoE Inference

sumo3Dviz: A three dimensional traffic visualisation

Design Rules for Extreme-Edge Scientific Computing on AI Engines

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

Bit-Flip Vulnerability of Shared KV-Cache Blocks in LLM Serving Systems

Towards Deep Encrypted Training: Low-Latency, Memory-Efficient, and High-Throughput Inference for Privacy-Preserving Neural Networks

Browse by Category

Research Type

Publish Your Research