Michael Cochez — Research Repository

Computer Science Preprint PDF DOI

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

Jin Xin Ng, Ori Livneh, Richard O'Grady, Josh Don, Peng Ding, Samuel Grossman, Luis Otero, Chris Kennelly, David Lo, Carlos Villavieja · 2026

Modern large multicore systems often run multiple workloads that share CPUs under schedulers such as Linux CFS. To keep CPUs busy, these schedulers load-balance runnable work, causing each workload to…

Read Paper →

Computer Science Preprint PDF DOI

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

Zi-Wei Lin, Tian-Sheuan Chang · 2026

Deploying Large Language Models (LLMs) on resource-constrained edge devices faces critical bottlenecks in memory bandwidth and power consumption. While ternary quantization (e.g., BitNet b1.58) signif…

Read Paper →

AI & Data Science Preprint PDF DOI

A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations

Timothy Flavin, Sandip Sen · 2026

Reinforcement Learning (RL) algorithms exhibit high sample complexity, particularly when applied to Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs). As a response, projects s…

Read Paper →

Computer Science Preprint PDF DOI

On Coded Caching Systems with Decentralized Linear Coding Placement

Yinbin Ma, Daniela Tuninetti · 2026

Coded caching is a technique that leverages locally cached contents at the end users to reduce the network's peak-time communication load. Coded caching has been shown to achieve significant performan…

Read Paper →

AI & Data Science Preprint PDF DOI

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

Zihan Zhao, Baotong Lu, Shengjie Lin, Yizou Chen, Jing Liu, Yanqi Zhang, Ziming Miao, Ming-Chang Yang, Haiying Shen, Qi Chen, Fan Yang · 2026

Long-context LLM serving is bottlenecked by the cost of attending over ever-growing KV caches. Dynamic sparse attention promises relief by accessing only a small, query-dependent subset of the KV stat…

Read Paper →

Physics Preprint PDF DOI

HyPulse: A Pulse Synthesis Framework for Hybrid Qubit-Oscillator Gates on Trapped-Ion Platform

Masoud Hakimi Heris, Yuan Liu, Frank Mueller · 2026

As hybrid qubit-oscillator algorithm development and trapped-ion hardware demonstrations advance in parallel, there is a lack of a compilation layer connecting the two at the pulse level in the vertic…

Read Paper →

Computer Science Preprint PDF DOI

A Semantic Quantum Circuit Cache for Scalable and Distributed Quantum-Classical Workflows

Mar Tejedor, Javier Conejero, Rosa M. Badia · 2026

Hybrid quantum--classical workflows often execute large ensembles of circuits that differ syntactically but implement identical operations, leading to substantial redundant computation. To address thi…

Read Paper →

Mathematics Preprint PDF DOI

Explicit Planar Finite Element Elasticity Complexes and $C^1$ Elements on Barycentric Refinements

Chunyu Chen, Long Chen, Xuehai Huang · 2026

The exact-sequence structure behind the Arnold--Douglas--Gupta family of higher-order mixed finite elements for plane elasticity on barycentric refinements is made explicit. On each macro triangle, th…

Read Paper →

Mathematics Preprint PDF DOI

Non-symmetrically $t$-affine functions revisited

Tibor Kiss, Dora Koroknai · 2026

In 2014, Michal Lewicki and Andrzej Olbry\'s proved that if a real valued function $f$ defined on the real line satisfies the conditional functional equation \[ f(tx + (1-t)y) = t f(x) + (1-t) f(y),\q…

Read Paper →

Computer Science Preprint PDF DOI

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

Bodon Jeong, Hongsu Byun, Youngjae Kim, Weikuan Yu, Kyungkeun Lee, Jihoon Yang, Sungyong Park · 2026

The increasing deployment of Large Language Model (LLM) inference on edge AI systems demands efficient execution under tight memory budgets. A key challenge arises from Key-Value (KV) caches, which of…

Read Paper →

AI & Data Science Preprint PDF DOI

When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

Tianyu Liu, Yuhao Shen, Xinyi Hu, Baolin Zhang, Hengxin Zhang, Jun Dai, Jun Zhang, Shuang Ge, Lei Chen, Yue Li, MingCheng Wan · 2026

Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes t…

Read Paper →

AI & Data Science Preprint PDF DOI

Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models

Zhirong Shen, Rui Huang, Jiacheng Liu, Chang Zou, Peiliang Cai, Shikang Zheng, Zhengyi Shi, Liang Feng, Linfeng Zhang · 2026

To address the high sampling cost of Diffusion Transformers (DiTs), feature caching offers a training-free acceleration method. However, existing methods rely on hand-crafted forecasting formulas that…

Read Paper →

Computer Science Preprint PDF DOI

CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering

Yushi Sun, Lei Chen · 2026

The integration of Large Language Models (LLMs) with Retrieval-Augmented Generation (RAG) has significantly advanced Knowledge Graph Question Answering (KGQA). However, existing LLM-driven KGQA system…

Read Paper →

Mathematics Preprint PDF DOI

Counterexamples to an Extremal Conjecture for Random Cycle-Factors

Rishikesh Gajjala · 2026

Christoph, Dragani\'{c}, Gir\~{a}o, Hurley, Michel, and M\"{u}yesser conjectured that, when $d\mid n$, the expected number of cycles in a uniformly random cycle-factor of a directed $d$-regular graph …

Read Paper →

Computer Science Preprint PDF DOI

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

Shouxu Lin, Zhiyuan Guo, Jiaxin Lin · 2026

LLM inference is constrained by GPU memory capacity and bandwidth. Tiered memory architectures mitigate this by allowing the GPU to offload memory to the remote tier. However, existing memory offloadi…

Read Paper →

Mathematics Preprint PDF DOI

Some local and global properties of secant varieties of nonsingular projective curves

Lawrence Ein, Wenbo Niu, Jinhyung Park · 2026

The main goal of this paper is to study some local and global properties of secant varieties of algebraic curves. These results complement our previous work [8] by addressing issues given therein and …

Read Paper →

AI & Data Science Preprint PDF DOI

Pythia: Toward Predictability-Driven Agent-Native LLM Serving

Shan Yu, Junyi Shu, Yuanjiang Ni, Kun Qian, Xue Li, Yang Wang, Jinyuan Zhang, Ziyi Xu, Shuo Yang, Lingjun Zhu, Ennan Zhai, Qingda Lu, Jiarong Xing, Youyou Lu, Xin Jin, Xuanzhe Liu, Harry Xu · 2026

As LLM applications grow more complex, developers are increasingly adopting multi-agent architectures to decompose workflows into specialized, collaborative components, introducing structure that cons…

Read Paper →

Computer Science Preprint PDF DOI

Slice Agent: Identifying and Isolating Slices in Shared Open Radio Unit

Felipe Arnholda, Flavio Rocha, Lucio Prade, Cristiano Bonato Both · 2026

Network Slice as a Service (NSaaS) is a key enabler of Beyond Fifth Generation (5G) and Sixth Generation (6G) networks, supporting next-generation applications such as extended reality (XR), immersive…

Read Paper →

Computer Science Preprint PDF DOI

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

Mingbo Hao, Changwei Yan, Haoyu Cui, Zhihao Yan, Yizhi Ding, Zhangrui Qian, Weiwei Shan · 2026

The rapid growth of LLMs demands high-throughput, memory-capacity-intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out…

Read Paper →

Computer Science Preprint PDF DOI

SimdQuickHeap: The QuickHeap Reconsidered

Johannes Breitling, Ragnar Groot Koerkamp, Marvin Williams · 2026

Priority queues are data structures that maintain a dynamic collection of elements and allow inserting new elements and removing the smallest element. The most widely known and used priority queue is …

Read Paper →

Browse Research Papers

Affinity Tailor: Dynamic Locality-Aware Scheduling at Scale

VitaLLM: A Versatile, Ultra-Compact Ternary LLM Accelerator with Dependency-Aware Scheduling

A High-Throughput Compute-Efficient POMDP Hide-And-Seek-Engine (HASE) for Multi-Agent Operations

On Coded Caching Systems with Decentralized Linear Coding Placement

Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving

HyPulse: A Pulse Synthesis Framework for Hybrid Qubit-Oscillator Gates on Trapped-Ion Platform

A Semantic Quantum Circuit Cache for Scalable and Distributed Quantum-Classical Workflows

Explicit Planar Finite Element Elasticity Complexes and $C^1$ Elements on Barycentric Refinements

Non-symmetrically $t$-affine functions revisited

DUAL-BLADE: Dual-Path NVMe-Direct KV-Cache Offloading for Edge LLM Inference

When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

Beyond Fixed Formulas: Data-Driven Linear Predictor for Efficient Diffusion Models

CacheRAG: A Semantic Caching System for Retrieval-Augmented Generation in Knowledge Graph Question Answering

Counterexamples to an Extremal Conjecture for Random Cycle-Factors

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

Some local and global properties of secant varieties of nonsingular projective curves

Pythia: Toward Predictability-Driven Agent-Native LLM Serving

Slice Agent: Identifying and Isolating Slices in Shared Open Radio Unit

NVLLM: A 3D NAND-Centric Architecture Enabling Edge on-Device LLM Inference

SimdQuickHeap: The QuickHeap Reconsidered

Browse by Category

Research Type

Publish Your Research