David Lowry Duda — Research Repository

Economics & Finance Preprint PDF DOI

Fast-Vollib: A Fast Implied Volatility Library for Pythonwith PyTorch, JAX, and CUDA Fused-Kernel Backends

Raeid Saqur · 2026

We present fast-vollib, an open-source Python library that provides high-performance European option pricing, implied volatility (IV) computation, and Greeks under the Black-76, Black-Scholes, and Bla…

Read Paper →

Computer Science Preprint PDF DOI

Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

Yuang Yan, Ian Karlin, Ryan Grant · 2026

For NVIDIA GPUs, CUDA is the primary interface through which applications orchestrate GPU execution, yet much of the logic that realizes CUDA operations resides in NVIDIA's closed-source userspace dri…

Read Paper →

AI & Data Science Preprint PDF DOI

MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching

Shuzhao Xie, Junchen Ge, Weixiang Zhang, Jiahang Liu, Chen Tang, Yunpeng Bai, Shijia Ge, Jingyan Jiang, Yuzhi Huang, Fengnian Yang, Cong Zhang, Xiaoyi Fan, Zhi Wang · 2026

3D Gaussian Splatting (3DGS) achieves high-quality novel view synthesis with real-time rendering, but its storage cost remains prohibitive for practical deployment. Existing post-training compression …

Read Paper →

Physics Preprint PDF DOI

SPHEREx Ultracool Dwarf spectral Atlas (SUDA): Atmospheric and Fundamental Parameters of Ultracool Dwarfs

Zhijun Tu, Shu Wang, Haomiao Huang, Xiaodian Chen, Jifeng Liu · 2026

We present the SPHEREx Ultracool Dwarf spectral Atlas (SUDA), a homogeneous sample of 1675 ultracool dwarfs with continuous 0.75--5 $\mu$m spectroscopy from SPHEREx QR2. Using the SAND and ATMO2020++ …

Read Paper →

Computer Science Preprint PDF DOI

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

Sina Heidari, Dimitrios S. Nikolopoulos · 2026

Deep learning compilers and vendor libraries deliver strong baseline performance but are bounded by finite, engineer-curated catalogs. When these omit needed optimizations, practitioners substitute ha…

Read Paper →

Physics Preprint PDF DOI

$\texttt{cuSkyrmion}$: A CUDA-OpenGL framework for interactive simulation and visualization of nuclei as Skyrmions

Sven Bjarke Gudnason, Paul Leask · 2026

We introduce $\texttt{cuSkyrmion}$, a 3-dimensional Skyrme model computation and visualization software, that is written in CUDA C for rapid computation and visualization of especially the arrested Ne…

Read Paper →

AI & Data Science Preprint PDF DOI

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

Hanqing Yang, Qiang Zhou, Yongchao Du, Sashuai Zhou, Zhibin Wang, Jun Song, Tiezheng Ge, Cheng Yu, Bo Zheng · 2026

Recent image editing models have achieved strong visual fidelity but often struggle with tasks requiring complex reasoning. To investigate and enhance the reasoning-grounded planning for image editing…

Read Paper →

Computer Science Preprint PDF DOI

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

Huriyeh Babak, Melanie Schaller · 2026

Efficient GPU execution of convolution operators is governed by memory-access efficiency, on-chip data reuse, and execution mapping rather than arithmetic throughput alone. This paper presents a contr…

Read Paper →

Mathematics Preprint PDF DOI

A colimit decomposition for the loop homology of polyhedral products

Lewis Stanton, Fedor Vylegzhanin · 2026

We show that the loop homology algebras of polyhedral products of the form $(\underline{X},\underline{*})^{\mathcal{K}}$ can be written as a colimit over the flagification of $\mathcal{K}$, and obtain…

Read Paper →

AI & Data Science Preprint PDF DOI

PointTransformerX: Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms

Laurenz Reichardt, Nikolas Ebert, Oliver Wasenmuller · 2026

3D point cloud perception remains tightly coupled to custom CUDA operators for spatial operations, limiting portability and efficiency on non-NVIDIA, AMD, and embedded hardware. We introduce PointTran…

Read Paper →

Mathematics Preprint PDF DOI

SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon

Hengrui Zhang, Boao Kong, Jiahe Geng, Zhengyang Huang · 2026

Fully decentralized Muon is difficult because its nonlinear matrix-sign operator does not commute with linear gossip averaging. This makes decentralized Muon a structural design problem: in designing …

Read Paper →

Mathematics Preprint PDF DOI

Sharp pathwise nonuniqueness for additive SDEs

Elias Hess-Childs, Keefer Rowan · 2026

We construct a family of velocity fields demonstrating the sharpness of the classical Zvonkin--Veretennikov--Davie strong well-posedness by noise regime. We consider stochastic differential equations …

Read Paper →

AI & Data Science Preprint PDF DOI

Building a GPU-Accelerated Multivariate Statistics Platform

Mike Crowhurst · 2026

Classical multivariate statistical methods such as covariance estimation and principal component analysis are well understood mathematically, yet their application at extreme data scales remains chall…

Read Paper →

AI & Data Science Preprint PDF DOI

ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

Chih-Chung Hsu, Xin-Di Ma, Wo-Ting Liao, Chia-Ming Lee · 2026

Existing attention accelerators often trade exact softmax semantics, depend on fused Tensor Core kernels, or incur sequential depth that limits FP32 throughput on long sequences. We present \textbf{EL…

Read Paper →

Computer Science Preprint PDF DOI

Prompt-Unknown Promotion Attacks against LLM-based Sequential Recommender Systems

Yuchuan Zhao, Tong Chen, Junliang Yu, Zongwei Wang, Lizhen Cui, Hongzhi Yin · 2026

Large language model-powered sequential recommender systems (LLM-SRSs) have recently demonstrated remarkable performance, enabling recommendations through prompt-driven inference over user interaction…

Read Paper →

Computer Science Preprint PDF DOI

ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

ChiHeng Jin, Hongche Yu, Xihui Chen · 2026

Large language model (LLM) decoding is latency-sensitive and often bottlenecked by fragmented operator execution and repeated off-chip materialization of intermediate tensors. Prior work expands fusio…

Read Paper →

AI & Data Science Preprint PDF DOI

Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference

Divakar Kumar Yadav, Tian Zhao · 2026

Large Language Models (LLMs) have achieved strong performance across natural language and multimodal tasks, yet their practical deployment remains constrained by inference latency and kernel launch ov…

Read Paper →

AI & Data Science Preprint PDF DOI

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Divakar Kumar Yadav, Tian Zhao, Deepak Kumar · 2026

NVIDIA's CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining Tensor Core and Tensor Memory Accelerator (…

Read Paper →

Computer Science Preprint PDF DOI

ARCHES: Adaptive Real-Time Switching of AI Models for the RAN

Neagin Neasamoni Santhi, Davide Villa, Michele Polese, Salvatore D'Oro, Yunseong Lee, Koichiro Furueda, Tommaso Melodia · 2026

Artificial Intelligence (AI) has become a powerful tool for model-free Radio Access Network (RAN) signal processing and optimization. However, designing a single model that generalizes across all radi…

Read Paper →

Physics Preprint PDF DOI

gateau: an observation simulator for ground-based submillimeter astronomy with integral field units and kinetic inductance detectors

A. Moerman, N. Soshnin, S. A. Brackenhoff, S. O. Dabironezare, K. Karatsu, L. H. Marting, S. A. H. de Rooij, M. Roos, B. R. Brandl, A. Endo · 2026

Submillimeter (submm) integral field units (IFUs) utilising kinetic inductance detectors (KIDs) are a promising instrument architecture for the study of galaxies, galaxy clusters, and the large-scale …

Read Paper →

Browse Research Papers

Fast-Vollib: A Fast Implied Volatility Library for Pythonwith PyTorch, JAX, and CUDA Fused-Kernel Backends

Revealing NVIDIA Closed-Source Driver Command Streams for CPU-GPU Runtime Behavior Insight

MesonGS++: Post-training Compression of 3D Gaussian Splatting with Hyperparameter Searching

SPHEREx Ultracool Dwarf spectral Atlas (SUDA): Atmospheric and Fundamental Parameters of Ultracool Dwarfs

FACT: Compositional Kernel Synthesis with a Three-Stage Agentic Workflow

$\texttt{cuSkyrmion}$: A CUDA-OpenGL framework for interactive simulation and visualization of nuclei as Skyrmions

DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

CUDA Kernel Optimization and Counter-Free Performance Analysis for Depthwise Convolution in Cloud Environments

A colimit decomposition for the loop homology of polyhedral products

PointTransformerX: Portable and Efficient 3D Point Cloud Processing without Sparse Algorithms

SUDA-Muon: Structural Design Principles and Boundaries for Fully Decentralized Muon

Sharp pathwise nonuniqueness for additive SDEs

Building a GPU-Accelerated Multivariate Statistics Platform

ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers

Prompt-Unknown Promotion Attacks against LLM-based Sequential Recommender Systems

ClusterFusion++: Expanding Cluster-Level Fusion to Full Transformer-Block Decoding

Hybrid JIT-CUDA Graph Optimization for Low-Latency Large Language Model Inference

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

ARCHES: Adaptive Real-Time Switching of AI Models for the RAN

gateau: an observation simulator for ground-based submillimeter astronomy with integral field units and kinetic inductance detectors

Browse by Category

Research Type

Publish Your Research