Chia Yi Su in Computer Science — Research Repository

Computer Science Preprint PDF DOI

AME-PIM: Can Memory be Your Next Tensor Accelerator?

Emanuele Venieri, Simone Manoni, Alberto Florian, Jaehyun Park, Kyomin Sohn, Andrea Bartolini · 2026

High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited…

Read Paper →

Computer Science Preprint PDF DOI

RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Yan-Cheng Guo, Tian-Sheuan Chang, Jian-Wei Su · 2026

Digital computing-in-memory (DCIM) has emerged as a promising solution for large language model (LLM) acceleration by minimizing data transfers between external DRAM and on-chip accelerators while mai…

Read Paper →

Computer Science Preprint PDF DOI

Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

Yiqi Liu, Noelle Crawford, Michael Wang, Jilong Xue, Jian Huang · 2026

To overcome the well-known memory bottleneck of AI chips, 3D stacked architectures that employ advanced packaging technology with high-density through-silicon vias (TSVs) pins have proven to be a prom…

Read Paper →

Computer Science Preprint PDF DOI

Verification and Validation (V&V)-in-the-Loop for RISC-V Design: The Holistic Vision of BZL

Sajjad Ahmed, Alexander Kropotov, Roberto Ignacio Genovese, Bernat Homs, Eloi Merino, Francesco Urbani, Henrique Yano, Ivan Diaz, Joan Gracia Fernandez, Matteo Toselli, Muhammad Imran, Muhammad Abu Bakar Umar Haider Iqbal, Nadeem Yaseen, Quswar Abid, Shaista Cheema, Samuel Sanchez, Daniel Garcia, Joan Cabre, Mostafa Elyasi, Fernando Ayats, Miquel Moreto, Teresa Cervero, Oscar Palomar, Behzad Salami · 2026

The Barcelona Zetascale Lab (BZL) project aims to strengthening Europe's capacity in the design and manufacture of RISC-V based high-performance computing chips. In this context, we present a holistic…

Read Paper →

Computer Science Preprint PDF DOI

EMiX: Emulating Beyond Single-FPGA Limits

Alexander Kropotov, Miquel Moreto, Behzad Salami · 2026

FPGA-level emulation is a key step in pre-silicon chip design validation. However, emulating large-scale multi-core systems increasingly exceed the hardware resource capacity of a single FPGA, limitin…

Read Paper →

Computer Science Preprint PDF DOI

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

Zhongkai Yu, Haotian Ye, Chenyang Zhou, Ohm Rishabh Venkatachalam, Zaifeng Pan, Zhengding Hu, Junsung Kim, Won Woo Ro, Po-An Tsai, Shuyi Pei, Yangwook Kang, Yufei Ding · 2026

All current LLM serving systems place the GPU at the center, from production-level attention-FFN disaggregation to NVIDIA's Rubin GPU-LPU heterogeneous platform. Even academic PIM/PNM proposals still …

Read Paper →

Computer Science Preprint PDF DOI

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Huriyeh Babak, Melanie Schaller · 2026

Efficient GPU execution of convolution operators is governed by memory-access efficiency, on-chip data reuse, and execution mapping rather than arithmetic throughput alone. This paper presents a contr…

Read Paper →

Computer Science Preprint PDF DOI

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

Zihao Xuan, Jia Chen, Yewen Li, Wei Xuan, Hegan Chen, Xiao Huo, Fengbin Tu · 2026

In this paper, we propose FusionCIM, an operator-fusion-driven compute-in-memory (CIM) accelerator architecture for efficient and scalable LLM inference, with three key innovations: (1) a hybrid CIM p…

Read Paper →

Computer Science Preprint PDF DOI

How Can Reinforcement Learning Achieve Expert-level Placement?

Ruo-Tong Chen, Ke Xue, Chengrui Gao, Yunqi Shi, Tian Xu, Peng Xie, Siyuan Xu, Mingxuan Yuan, Chao Qian, Zhi-Hua Zhou · 2026

Chip placement is a critical step in physical design. While reinforcement learning (RL)-based methods have recently emerged, their training primarily focuses on wirelength optimization, and therefore …

Read Paper →

Computer Science Preprint PDF DOI

The Blahut--Arimoto Algorithm as a Dynamical System with Exact $\chi^2$ Dissipation

Qiao Wang · 2026

This paper uncovers an exact $\chi^2$ dissipation identity for the Blahut--Arimoto (BA) flow and establishes its fundamental information-geometric structure. While prior works have analyzed BA converg…

Read Paper →

Computer Science Preprint PDF DOI

Profiling Resilient to Change in Probe Position

Elie Bursztein, Michael Gruber, Karel Kral, Jean-Michel Picod, Matthias Probst, Georg Sigl · 2026

Side Channel Analysis (SCA) relaxes the black-box assumption of conventional cryptanalysis by incorporating physical measurements acquired during cryptographic operations. Electro-magnetic (EM) emissi…

Read Paper →

Computer Science Preprint PDF DOI

Compilation and Execution of an Embeddable YOLO-NAS on the VTA

Anthony Faure-Gignoux, Kevin Delmas, Adrien Gauffriau, Claire Pagetti · 2026

Deploying complex Convolutional Neural Networks (CNNs) on FPGA-based accelerators is a promising way forward for safety-critical domains such as aeronautics. In a previous work, we have explored the V…

Read Paper →

Computer Science Preprint PDF DOI

RefEvo: Agentic Design with Co-Evolutionary Verification for Agile Reference Model Generation

Yifan Zhang, Jianmin Ye, Jiahao Yang, Xi Wang · 2026

As the complexity of System-on-Chip (SoC) designs grows, the shift-left paradigm necessitates the rapid development of high-fidelity reference models (typically written in SystemC) for early architect…

Read Paper →

Computer Science Preprint PDF DOI

FlowPlace: Flow Matching for Chip Placement

Peng Xie, Ke Xue, Yunqi Shi, Ruo-Tong Chen, Chengrui Gao, Siyuan Xu, Chenjian Ding, Mingxuan Yuan, Chao Qian · 2026

Chip placement plays an important role in physical design. While generative models like diffusion models offer promising learning-based solutions, current methods have the following limitations: they …

Read Paper →

Computer Science Preprint PDF DOI

ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

ChiHeng Jin, Hongche Yu, Xihui Chen · 2026

Large language model (LLM) decoding is latency-sensitive and often bottlenecked by fragmented operator execution and repeated off-chip materialization of intermediate tensors. Prior work expands fusio…

Read Paper →

Computer Science Preprint PDF DOI

Revisiting and Expanding the IPv6 Network Periphery: Global-Scale Measurement and Security Analysis

Zixuan Xie, Zitao Yang, Shurui Fang, Zhaoyang Li, Wenxing Xie, Nannan Fu, Liangyu Dong, Xiang Li · 2026

As IPv6 deployment accelerates, understanding the evolving security posture of network peripheries becomes increasingly important. A DSN 2021 study introduced the first large-scale discovery of IPv6 n…

Read Paper →

Computer Science Preprint PDF DOI

Design Rules for Extreme-Edge Scientific Computing on AI Engines

Zhenghua Ma, G Abarajithan, Dimitrios Danopoulos, Olivia Weng, Francesco Restuccia, Ryan Kastner · 2026

Extreme-edge scientific applications use machine learning models to analyze sensor data and make real-time decisions. Their stringent latency and throughput requirements demand small batch sizes and r…

Read Paper →

Computer Science Preprint PDF DOI

Ternary Memristive Logic: Hardware for Reasoning Realized via Domain Algebra

Chao Li · 2026

Memristive crossbars store numerical weights needing aggregation and decoding; a single junction means nothing alone. This paper presents a fundamentally different use: each junction stores a complete…

Read Paper →

Computer Science Preprint PDF DOI

CHICO-Agent: An LLM Agent for the Cross-layer Optimization of 2.5D and 3D Chiplet-based Systems

Qihang Wu, Aman Arora, Vidya A. Chhabria · 2026

The rapid growth of large language models (LLMs) and AI workloads has pushed monolithic silicon to its reticle and economic limits, accelerating the adoption of 2.5D/3D chiplet systems. However, these…

Read Paper →

Computer Science Preprint PDF DOI

Homodyne Photonic Tensor Processor exceeds 1,000-TOPS

Lian Zhou, Kaiwen Xue, Yun-Jhu Lee, Chun-Ho Lee, Yuan Li, Kiwon Kwon, Weipeng Zhang, Songlin Zhao, Jason Moraes, Niranjan Bhatia, Ryan Hamerly, Mengjie Yu, Zaijun Chen · 2026

High-performance computing underpins modern artificial intelligence (AI), enabling foundation models, real-time inference and perception in autonomous systems, and data-intensive scientific simulation…

Read Paper →

Browse Research Papers

AME-PIM: Can Memory be Your Next Tensor Accelerator?

RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Exploring the Efficiency of 3D-Stacked AI Chip Architecture for LLM Inference with Voxel

Verification and Validation (V&V)-in-the-Loop for RISC-V Design: The Holistic Vision of BZL

EMiX: Emulating Beyond Single-FPGA Limits

AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

How Can Reinforcement Learning Achieve Expert-level Placement?

The Blahut--Arimoto Algorithm as a Dynamical System with Exact $\chi^2$ Dissipation

Profiling Resilient to Change in Probe Position

Compilation and Execution of an Embeddable YOLO-NAS on the VTA

RefEvo: Agentic Design with Co-Evolutionary Verification for Agile Reference Model Generation

FlowPlace: Flow Matching for Chip Placement

ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

Revisiting and Expanding the IPv6 Network Periphery: Global-Scale Measurement and Security Analysis

Design Rules for Extreme-Edge Scientific Computing on AI Engines

Ternary Memristive Logic: Hardware for Reasoning Realized via Domain Algebra

CHICO-Agent: An LLM Agent for the Cross-layer Optimization of 2.5D and 3D Chiplet-based Systems

Homodyne Photonic Tensor Processor exceeds 1,000-TOPS

Browse by Category

Research Type

Publish Your Research