Koen Bertels in Computer Science — Research Repository

Computer Science Preprint PDF DOI

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

Milan Shah, Sheng Di, Michela Becchi · 2026

In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on …

Read Paper →

Computer Science Preprint PDF DOI

ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

Wenxiang Lin, Xinglin Pan, Ruibo Fan, Shaohuai Shi, Xiaowen Chu · 2026

Communication has emerged as a critical bottleneck in the distributed training of large language models (LLMs). While numerous approaches have been proposed to reduce communication overhead, the poten…

Read Paper →

Computer Science Preprint PDF DOI

AME-PIM: Can Memory be Your Next Tensor Accelerator?

Emanuele Venieri, Simone Manoni, Alberto Florian, Jaehyun Park, Kyomin Sohn, Andrea Bartolini · 2026

High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited…

Read Paper →

Computer Science Preprint PDF DOI

GenAI in Software Engineering: The Role of Technology Acceptance Models

Oscar Johansson, Jurgen Borstler, Nauman bin Ali · 2026

Context: Many organizations are keen to incorporate generative~AI (GenAI) into their software development processes. Technology acceptance models, such as the Unified Theory of Acceptance and Use of T…

Read Paper →

Computer Science Preprint PDF DOI

SandSim: Curve-Guided Gaussian Splatting for Reconstructing Sand Painting Processes

Yilin Wang, Haojie Huang, Chen Li, Yang Li, Changbo Wang, Chenhui Li · 2026

Sand painting is a process-driven art where visual appearance emerges from granular accumulation. Given a single image, reconstructing a plausible sand painting process requires modeling coherent stro…

Read Paper →

Computer Science Preprint PDF DOI

Static Attribution of Android Residential Proxy Malware Using Graph Kernels

Peter Clark, Yong Guan, Zhonghao Liao · 2026

Android residential proxy applications represent a growing class of potentially-unwanted programs (PUPs) that covertly route third-party traffic through end-user devices, enabling ad fraud, credential…

Read Paper →

Computer Science Preprint PDF DOI

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

Sina Heidari, Dimitrios S. Nikolopoulos · 2026

Deep learning compilers and vendor libraries deliver strong baseline performance but are bounded by finite, engineer-curated catalogs. When these omit needed optimizations, practitioners substitute ha…

Read Paper →

Computer Science Preprint PDF DOI

TimeMM: Time-as-Operator Spectral Filtering for Dynamic Multimodal Recommendation

Wei Yang, Rui Zhong, Zihan Lin, Xiaodan Wang, Cheng Chen, Huan Ren, Yao Hu · 2026

Multimodal recommendation improves user modeling by integrating collaborative signals with heterogeneous item content. In real applications, user interests evolve over time and exhibit nonstationary d…

Read Paper →

Computer Science Preprint PDF DOI

AI Observability for Large Language Model Systems: A Multi-Layer Analysis of Monitoring Approaches from Confidence Calibration to Infrastructure Tracing

Twinkll Sisodia · 2026

The deployment of large language models (LLMs) in production environments has created an urgent need for observability systems that span the full stack -- from model internals to GPU kernels. Yet exis…

Read Paper →

Computer Science Preprint PDF DOI

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Huriyeh Babak, Melanie Schaller · 2026

Efficient GPU execution of convolution operators is governed by memory-access efficiency, on-chip data reuse, and execution mapping rather than arithmetic throughput alone. This paper presents a contr…

Read Paper →

Computer Science Preprint PDF DOI

Polynomial Kernels for Spanning Tree with Diversity Requirements

Petr A. Golovach, Diptapriyo Majumdar, Saket Saurabh · 2026

Given a connected undirected graph $G$, a spanning tree is a subgraph $T$ of $G$ such that $V(T) = V(G)$ and $T$ is a tree. A collection of $\ell$ spanning trees $T_1,\ldots,T_\ell$ is pairwise $k$-di…

Read Paper →

Computer Science Preprint PDF DOI

Opto-Atomic Spatio-Temporal Holographic Correlators for High-Speed 3D CNNs

Xi Shen, Bowen Qi, Tabassom Hamidfar, Selim M. Shahriar · 2026

Three-dimensional convolutional neural networks (3D CNNs) have demonstrated remarkable performance in video recognition tasks by processing both spatial and temporal features. However, the cubic scali…

Read Paper →

Computer Science Preprint PDF DOI

A more versatile model for enumerative kernelization: a case study for Vertex Cover

Marin Bougeret, Guilherme C. M. Gomes, Ignasi Sau · 2026

Enumerative kernelization is a recent promising at the intersection of parameterized complexity and enumeration algorithms, with two proposed models. The first, known as enum-kernels and due to Creign…

Read Paper →

Computer Science Preprint PDF DOI

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Long Cheng, Ritchie Zhao, Timmy Liu, Mindy Li, Xianjie Qiao, Kefeng Duan, Yu-Jung Chen, Xiaoming Chen, Bita Darvish Rouhani, June Yang · 2026

Sparse-attention decoders rely on exact Top-K selection to choose the most important key-value entries for each query token. In long-context LLM serving, this Top-K stage runs once per decode query an…

Read Paper →

Computer Science Preprint PDF DOI

Exploiting pre-optimized kernels with polyhedral transformations for CGRA compilation

Yuxuan Wang, Maria Jose Belda, Fernando Castro, Katzalin Olcoz, David Atienza, Giovanni Ansaloni · 2026

Modern computing workloads commonly involve matrix-matrix multiplication (mmul) as a core computing pattern. Coarse-Grained Reconfigurable Arrays (CGRAs) can flexibly and efficiently support it, since…

Read Paper →

Computer Science Preprint PDF DOI

GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems

Baodi Shan, Mauricio Araya-Polo, Barbara Chapman · 2026

Distributed GPU applications increasingly rely on kernel-level, cross-node coordination to reduce launch overheads and improve compute-communication overlap, but such support is lacking. On OFI-based …

Read Paper →

Computer Science Preprint PDF DOI

EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads

Kyungmi Lee, Zhiye Song, Eun Kyung Lee, Xin Zhang, Tamar Eilam, Anantha P. Chandrakasan · 2026

As AI workloads drive increases in datacenter power consumption, accurate GPU power estimation is critical for proactive power management. However, existing power models face a scalability bottleneck …

Read Paper →

Computer Science Preprint PDF DOI

Demonstrating a Future for MLIR-native DSL Compilers on a NumPy-like Example

Karl F. A. Friebel, Jascha A. Ohlmann, Jeronimo Castrillon · 2026

Compilers for general-purpose languages have been shown to be at a disadvantage when it comes to specialized application domains as opposed to their Domain-Specific Language (DSL) counterparts. Howeve…

Read Paper →

Computer Science Preprint PDF DOI

FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

Shuyao Qi, Haoyuan Liu, Shizhen Zhao · 2026

Fine-grained, per-micro-batch load balancing is essential for efficient Mixture-of-Experts (MoE) training, yet every prior dynamic scheduling scheme pays for it with extra communication that is hard t…

Read Paper →

Computer Science Preprint PDF DOI

UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

Size Zheng, Xuegui Zheng, Li-wen Chang, Jidong Zhai · 2026

The exponential growth in Large Language Model (LLM) parameters has transformed model training into an increasingly resource-intensive endeavor. With the stagnation of Moore's Law and the widening dis…

Read Paper →

Browse Research Papers

Exploring Sparse Matrix Multiplication Kernels on the Cerebras CS-3

ZipCCL: Efficient Lossless Data Compression of Communication Collectives for Accelerating LLM Training

AME-PIM: Can Memory be Your Next Tensor Accelerator?

GenAI in Software Engineering: The Role of Technology Acceptance Models

SandSim: Curve-Guided Gaussian Splatting for Reconstructing Sand Painting Processes

Static Attribution of Android Residential Proxy Malware Using Graph Kernels

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

TimeMM: Time-as-Operator Spectral Filtering for Dynamic Multimodal Recommendation

AI Observability for Large Language Model Systems: A Multi-Layer Analysis of Monitoring Approaches from Confidence Calibration to Infrastructure Tracing

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Polynomial Kernels for Spanning Tree with Diversity Requirements

Opto-Atomic Spatio-Temporal Holographic Correlators for High-Speed 3D CNNs

A more versatile model for enumerative kernelization: a case study for Vertex Cover

Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

Exploiting pre-optimized kernels with polyhedral transformations for CGRA compilation

GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems

EnergAIzer: Fast and Accurate GPU Power Estimation Framework for AI Workloads

Demonstrating a Future for MLIR-native DSL Compilers on a NumPy-like Example

FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training

UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

Browse by Category

Research Type

Publish Your Research