Michael Cochez in Computer Science — Research Repository

Computer Science Preprint PDF DOI

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

Jin Xin Ng, Ori Livneh, Richard O'Grady, Josh Don, Peng Ding, Samuel Grossman, Luis Otero, Chris Kennelly, David Lo, Carlos Villavieja · 2026

Modern large multicore systems often run multiple workloads that share CPUs under schedulers such as Linux CFS. To keep CPUs busy, these schedulers load-balance runnable work, causing each workload to…

Read Paper →

Computer Science Preprint PDF DOI

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Zi-Wei Lin, Tian-Sheuan Chang · 2026

Deploying Large Language Models (LLMs) on resource-constrained edge devices faces critical bottlenecks in memory bandwidth and power consumption. While ternary quantization (e.g., BitNet b1.58) signif…

Read Paper →

Computer Science Preprint PDF DOI

On Coded Caching Systems with Decentralized Linear Coding Placement

Yinbin Ma, Daniela Tuninetti · 2026

Coded caching is a technique that leverages locally cached contents at the end users to reduce the network's peak-time communication load. Coded caching has been shown to achieve significant performan…

Read Paper →

Computer Science Preprint PDF DOI

A Semantic Quantum Circuit Cache for Scalable and Distributed Quantum-Classical Workflows

Mar Tejedor, Javier Conejero, Rosa M. Badia · 2026

Hybrid quantum--classical workflows often execute large ensembles of circuits that differ syntactically but implement identical operations, leading to substantial redundant computation. To address thi…

Read Paper →

Computer Science Preprint PDF DOI

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Bodon Jeong, Hongsu Byun, Youngjae Kim, Weikuan Yu, Kyungkeun Lee, Jihoon Yang, Sungyong Park · 2026

The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which of…

Read Paper →

Computer Science Preprint PDF DOI

CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering

Yushi Sun, Lei Chen · 2026

The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) has significantly advanced Knowledge Graph Question Answering (KGQA). However, existing LLM-driven KGQA system…

Read Paper →

Computer Science Preprint PDF DOI

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

Shouxu Lin, Zhiyuan Guo, Jiaxin Lin · 2026

LLM inference is constrained by GPU memory capacity and bandwidth. Tiered memory architectures mitigate this by allowing the GPU to offload memory to the remote tier. However, existing memory offloadi…

Read Paper →

Computer Science Preprint PDF DOI

Slice Agent: Identifying and Isolating Slices in Shared Open Radio Unit

Felipe Arnholda, Flavio Rocha, Lucio Prade, Cristiano Bonato Both · 2026

Network Slice as a Service (NSaaS) is a key enabler of Beyond Fifth Generation (5G) and Sixth Generation (6G) networks, supporting next-generation applications such as extended reality (XR), immersive…

Read Paper →

Computer Science Preprint PDF DOI

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

Mingbo Hao, Changwei Yan, Haoyu Cui, Zhihao Yan, Yizhi Ding, Zhangrui Qian, Weiwei Shan · 2026

The rapid growth of LLMs demands high-throughput, memory-capacity-intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out…

Read Paper →

Computer Science Preprint PDF DOI

SimdQuickHeap: The QuickHeap Reconsidered

Johannes Breitling, Ragnar Groot Koerkamp, Marvin Williams · 2026

Priority queues are data structures that maintain a dynamic collection of elements and allow inserting new elements and removing the smallest element. The most widely known and used priority queue is …

Read Paper →

Computer Science Preprint PDF DOI

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Huriyeh Babak, Melanie Schaller · 2026

Efficient GPU execution of convolution operators is governed by memory-access efficiency, on-chip data reuse, and execution mapping rather than arithmetic throughput alone. This paper presents a contr…

Read Paper →

Computer Science Preprint PDF DOI

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

Sean Nian, Jiahao Fang, Qilong Feng, Zhiyu Wu, Fan Lai · 2026

KV cache restoration has emerged as a dominant bottleneck in serving long-context LLM workloads, including multi-turn conversations, retrieval-augmented generation, and agentic pipelines. Existing app…

Read Paper →

Computer Science Preprint PDF DOI

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

Wang Fan, Wei Cao, Xi Zha, Kedi Ma, MingQian Sun, Jialin Chen, Fengzhe Zhang, Fan Zhang · 2026

Long contexts improve capabilities of large language models but pose serious hardware challenges: compute and memory footprints grow linearly with sequence length. Particularly, the decoding phase con…

Read Paper →

Computer Science Preprint PDF DOI

Green-Red Watermarking for Recommender Systems

Lei Zhou, Min Gao, Zongwei Wang, Yibing Bai, Wentao Li · 2026

The widespread open-sourcing of advanced recommendation algorithms and the rising threat of model extraction attacks have made safeguarding the intellectual property of recommender systems an imperati…

Read Paper →

Computer Science Preprint PDF DOI

Tessera: Secure, Near-Line-Rate Weight Streaming for UMA Edge Accelerators

Animan Naskar · 2026

Deploying proprietary Deep Neural Networks (DNNs) on commodity edge devices demands hardware-backed Digital Rights Management (DRM) capable of withstanding both software-level and physical adversaries…

Read Paper →

Computer Science Preprint PDF DOI

GreenDyGNN: Runtime-Adaptive Energy-Efficient Communication for Distributed GNN Training

Arefin Niam, Tevfik Kosar, M. S. Q. Zulkar Nine · 2026

Distributed GNN training is dominated by remote feature fetching, which can be very costly. Multi-hop neighborhood sampling crosses partition boundaries and triggers fine-grained RPCs whose fixed init…

Read Paper →

Computer Science Preprint PDF DOI

Source-Code Analysis of iFogSim for Simulating Distributed IoT Architectures: Coverage, Challenges, and Enhancements

Milliam Maxime Zekeng Ndadji · 2026

Simulation is an indispensable tool for validating distributed IoT architectures before physical deployment, and iFogSim has emerged as one of the most widely adopted platform in the fog and edge comp…

Read Paper →

Computer Science Preprint PDF DOI

Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale

Kashish Mittal, Di Yu, Roozbeh Ketabi, Arushi Arora, Brendon Lapp, Peng Zhang · 2026

Training massive-scale deep learning models on datasets spanning tens of terabytes presents critical challenges in hardware utilization and training reproducibility. In this paper, we identify and res…

Read Paper →

Computer Science Preprint PDF DOI

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

Hongyao Liu, Liuqun Zhai, Junyi Wang, Zhengru Fang · 2026

Efficient inference for on-device Large Language Models (LLMs) remains challenging due to limited hardware resources and the high cost of the prefill stage, which processes the full input context to c…

Read Paper →

Computer Science Preprint PDF DOI

Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms

Ari Azarafrooz · 2026

AI-agent guardrails are memoryless: each message is judged in isolation, so an adversary who spreads a single attack across dozens of sessions slips past every session-bound detector because only the …

Read Paper →

Browse Research Papers

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

On Coded Caching Systems with Decentralized Linear Coding Placement

A Semantic Quantum Circuit Cache for Scalable and Distributed Quantum-Classical Workflows

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

Slice Agent: Identifying and Isolating Slices in Shared Open Radio Unit

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

SimdQuickHeap: The QuickHeap Reconsidered

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

Green-Red Watermarking for Recommender Systems

Tessera: Secure, Near-Line-Rate Weight Streaming for UMA Edge Accelerators

GreenDyGNN: Runtime-Adaptive Energy-Efficient Communication for Distributed GNN Training

Source-Code Analysis of iFogSim for Simulating Distributed IoT Architectures: Coverage, Challenges, and Enhancements

Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale

SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference

Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms

Browse by Category

Research Type

Publish Your Research