Kevin Hemery in Computer Science — Research Repository

Computer Science Preprint PDF DOI

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Yanting Wang, Chenlong Yin, Ying Chen, Jinyuan Jia · 2026

Long-context large language models (LLMs)-for example, Gemini-3.1-Pro and Qwen-3.5-are widely used to empower many real-world applications, such as retrieval-augmented generation, autonomous agents, a…

Read Paper →

Computer Science Preprint PDF DOI

Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing

Silvio Martinico, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini · 2026

Multivector retrieval models achieve state-of-the-art effectiveness through fine-grained token-level representations, but their deployment incurs substantial computational and memory costs. Current so…

Read Paper →

Computer Science Preprint PDF DOI

Succinct Graph Representations and Algorithmic Applications

Ahammed Ullah, Alex Pothen · 2026

We propose new graph representations that exploit dense local structure to improve time and space simultaneously. Given an undirected graph $G$, we define a dual clique cover (DCC) representation of $…

Read Paper →

Computer Science Preprint PDF DOI

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

Milan Shah, Sheng Di, Michela Becchi · 2026

In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on …

Read Paper →

Computer Science Preprint PDF DOI

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

Jin Xin Ng, Ori Livneh, Richard O'Grady, Josh Don, Peng Ding, Samuel Grossman, Luis Otero, Chris Kennelly, David Lo, Carlos Villavieja · 2026

Modern large multicore systems often run multiple workloads that share CPUs under schedulers such as Linux CFS. To keep CPUs busy, these schedulers load-balance runnable work, causing each workload to…

Read Paper →

Computer Science Preprint PDF DOI

ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

Wenxiang Lin, Xinglin Pan, Ruibo Fan, Shaohuai Shi, Xiaowen Chu · 2026

Communication has emerged as a critical bottleneck in the distributed training of large language models (LLMs). While numerous approaches have been proposed to reduce communication overhead, the poten…

Read Paper →

Computer Science Preprint PDF DOI

AME-PIM: Can Memory be Your Next Tensor Accelerator?

Emanuele Venieri, Simone Manoni, Alberto Florian, Jaehyun Park, Kyomin Sohn, Andrea Bartolini · 2026

High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited…

Read Paper →

Computer Science Preprint PDF DOI

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Zi-Wei Lin, Tian-Sheuan Chang · 2026

Deploying Large Language Models (LLMs) on resource-constrained edge devices faces critical bottlenecks in memory bandwidth and power consumption. While ternary quantization (e.g., BitNet b1.58) signif…

Read Paper →

Computer Science Preprint PDF DOI

RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Yan-Cheng Guo, Tian-Sheuan Chang, Jian-Wei Su · 2026

Digital computing-in-memory (DCIM) has emerged as a promising solution for large language model (LLM) acceleration by minimizing data transfers between external DRAM and on-chip accelerators while mai…

Read Paper →

Computer Science Preprint PDF DOI

Low-Complexity Run-Length-Limited ISI-Mitigation (RLIM) Codes for Molecular Communication

Melih Sahin, Ozgur B. Akan · 2026

Molecular communication suffers from severe inter-symbol interference, which makes constrained coding essential for reliable transmission. Run-length-limited ISI-mitigation codes are attractive becaus…

Read Paper →

Computer Science Preprint PDF DOI

Efficient Training on Multiple Consumer GPUs with RoundPipe

Yibin Luo, Shiwei Gao, Huichuan Zheng, Youyou Lu, Jiwu Shu · 2026

Fine-tuning Large Language Models (LLMs) on consumer-grade GPUs is highly cost-effective, yet constrained by limited GPU memory and slow PCIe interconnects. Pipeline parallelism combined with CPU offl…

Read Paper →

Computer Science Preprint PDF DOI

Adaptive Self-Organization in Anonymous Dynamic Networks

Garrett Parzych, Joshua J. Daymude · 2026

We introduce the problem of adaptive self-organization in which the nodes of an anonymous, synchronous dynamic network must distributively change the collective distribution of their responses (or "co…

Read Paper →

Computer Science Preprint PDF DOI

Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

Yuang Yan, Ian Karlin, Ryan Grant · 2026

For NVIDIA GPUs, CUDA is the primary interface through which applications orchestrate GPU execution, yet much of the logic that realizes CUDA operations resides in NVIDIA's closed-source userspace dri…

Read Paper →

Computer Science Preprint PDF DOI

FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Minghe Wang, Trever Schirmer, Mohammadreza Malekabbasi, David Bermbach · 2026

Mixture-of-Experts (MoE) models offer high capacity with efficient inference cost by activating a small subset of expert models per input. However, deploying MoE models requires all experts to reside …

Read Paper →

Computer Science Preprint PDF DOI

Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

Yiqi Liu, Noelle Crawford, Michael Wang, Jilong Xue, Jian Huang · 2026

To overcome the well-known memory bottleneck of AI chips, 3D stacked architectures that employ advanced packaging technology with high-density through-silicon vias (TSVs) pins have proven to be a prom…

Read Paper →

Computer Science Preprint PDF DOI

Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators

Hyunsung Yoon, Sungju Ryu, Jae-Joon Kim · 2026

As the size of Deep Neural Networks (DNNs) increases dramatically to achieve high accuracy, the DNNs require a large amount of computations and memory footprint. Pruning, which produces a sparse neura…

Read Paper →

Computer Science Preprint PDF DOI

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Bodon Jeong, Hongsu Byun, Youngjae Kim, Weikuan Yu, Kyungkeun Lee, Jihoon Yang, Sungyong Park · 2026

The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which of…

Read Paper →

Computer Science Preprint PDF DOI

FloatSOM: GPU-Accelerated, Distributed, Topology-Flexible Self-Organizing Maps

Tony Xu, Sarah Klamt, Katherine Turner, Anne Brustle, Felix Marsh-Wakefield, Givanna Putri · 2026

GPU-accelerated Self-Organizing Map (SOM) implementations are among the most competitive options for large-scale SOM analysis, but growing dataset sizes increasingly challenge their practical use beca…

Read Paper →

Computer Science Preprint PDF DOI

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

Hanna Foerster, Ilia Shumailov, Cheng Zhang, Yiren Zhao, Jamie Hayes, Robert Mullins · 2026

Dynamic quantization emerged as a practical approach to increase the utilization and efficiency of the machine learning serving flow. Unlike static quantization, which applies quantization offline, dy…

Read Paper →

Computer Science Preprint PDF DOI

Compressing ACAS-Xu Lookup Tables with Binary Decision Diagrams

Martin Boniol (ISAE-SUPAERO), Julien Brunel, Jean-Baptiste Chaudron (ISAE-SUPAERO), Christophe Garion (ISAE-SUPAERO), Xavier Thirioux (ISAE-SUPAERO) · 2026

The Airborne Collision Avoidance System Xu (ACAS-Xu) relies on large certified Look-Up Tables (LUTs) that encode the exact decision logic used in operation. Neural-network-based approximations have be…

Read Paper →

Browse Research Papers

FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

Efficient Multivector Retrieval with Token-Aware Clustering and Hierarchical Indexing

Succinct Graph Representations and Algorithmic Applications

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

AME-PIM: Can Memory be Your Next Tensor Accelerator?

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Low-Complexity Run-Length-Limited ISI-Mitigation (RLIM) Codes for Molecular Communication

Efficient Training on Multiple Consumer GPUs with RoundPipe

Adaptive Self-Organization in Anonymous Dynamic Networks

Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

FaaSMoE: A Serverless Framework for Multi-Tenant Mixture-of-Experts Serving

Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

FloatSOM: GPU-Accelerated, Distributed, Topology-Flexible Self-Organizing Maps

Quantamination: Dynamic Quantization Leaks Your Data Across the Batch

Compressing ACAS-Xu Lookup Tables with Binary Decision Diagrams

Browse by Category

Research Type

Publish Your Research