Expertini Research Research

Browse Research Papers

10,721+ open-access research outputs.

โœ• Clear
๐Ÿ” polina kirichenko
Showing 10721 results for "polina kirichenko"
AI & Data Science Preprint PDF DOI

PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning

Sudong Wang, Weiquan Huang, Xiaomin Yu, Zuhao Yang, Hehai Lin, Keming Wu, Chaojun Xiao, Chen Chen, Wenxuan Wang, Beier Zhu, Yunjian Zhang, Chengwei Qin ยท 2026

The standard post-training recipe for large multimodal models (LMMs) applies supervised fine-tuning (SFT) on curated demonstrations followed by reinforcement learning with verifiable rewards (RLVR). Hโ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

Jingcheng Deng, Zihao Wei, Liang Pang, Junhong Wu, Shicheng Xu, Zenghao Duan, Huawei Shen ยท 2026

Latent reasoning offers a more efficient alternative to explicit reasoning by compressing intermediate reasoning into continuous representations and substantially shortening reasoning chains. However,โ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

Knowledge Graph Representations for LLM-Based Policy Compliance Reasoning

Wilder Baldwin, Sepideh Ghanavati ยท 2026

The risks posed by AI features are increasing as they are rapidly integrated into software applications. In response, regulations and standards for safe and secure AI have been proposed. In this paperโ€ฆ

Read Paper โ†’
Engineering Preprint PDF DOI

Can Tabular Foundation Models Guide Exploration in Robot Policy Learning?

Buqing Ou, Frederike Dumbgen ยท 2026

Policy optimization in high-dimensional continuous control for robotics remains a challenging problem. Predominant methods are inherently local and often require extensive tuning and carefully chosen โ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

Bayesian policy gradient and actor-critic algorithms

Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko ยท 2026

Policy gradient methods are reinforcement learning algorithms that adapt a parameterized policy by following a performance gradient estimate. Conventional policy gradient methods use Monte-Carlo technโ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

APPSI-139: A Parallel Corpus of English Application Privacy Policy Summarization and Interpretation

Pengyun Zhu, Qiheng Sun, Long Wen, Yanbo Wang, Yang Cao, Junxu Liu, Deyi Xiong, Jinfei Liu, Zhibo Wang, Kui Ren ยท 2026

Privacy policies are essential for users to understand how service providers handle their personal data. However, these documents are often long and complex, as well as filled with technobabble and leโ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings

Tomomasa Hara, Hiroto Kurita, Masaaki Imaizumi, Kentaro Inui, Sho Yokoi ยท 2026

For constructing text embeddings, mean pooling, which averages token embeddings, is the standard approach. This paper examines whether mean pooling actually works well in real models. First, we note tโ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

Co-Evolving Policy Distillation

Naibin Gu, Chenxu Yang, Qingyi Si, Chuanyu Qin, Dingyu Yao, Peng Fu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang ยท 2026

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capabโ€ฆ

Read Paper โ†’
Mathematics Preprint PDF DOI

Deep Policy Iteration for High-Dimensional Mean-Field Games with Regenerative Reformulation

Shuixin Fang, Shupeng Wang, Zhen Wu, Hui Zhang, Tao Zhou ยท 2026

This paper develops a deep policy iteration method for high-dimensional finite-horizon mean-field games. We reformulate the game as a regenerative problem with deterministic cycles, which allows policโ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

Preserving Disagreement: Architectural Heterogeneity and Coherence Validation in Multi-Agent Policy Simulation

Ariel Sela ยท 2026

Multi-agent deliberation systems using large language models (LLMs) are increasingly proposed for policy simulation, yet they suffer from artificial consensus: evaluator agents converge on the same opโ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models

Jinho Choo, JunSeung Lee, Jimyeong Kim, Yeeho Song, S. K. Hong, Yeong-Dae Kwon ยท 2026

Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusiโ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

Shuning Shang, Hubert Strauss, Stanley Wei, Sanjeev Arora, Noam Razin ยท 2026

Training language models via reinforcement learning often relies on imperfect proxy rewards, since ground truth rewards that precisely define the intended behavior are rarely available. Standard metriโ€ฆ

Read Paper โ†’
Engineering Preprint PDF DOI

Reference-Augmented Learning for Precise Tracking Policy of Tendon-Driven Continuum Robots

Ziqing Zou, Ke Qiu, Haojian Lu, Rong Xiong, Yue Wang ยท 2026

Tendon-Driven Continuum Robots (TDCRs) pose significant control challenges due to their highly nonlinear, path-dependent dynamics and non-Markovian characteristics. Traditional Jacobian-based controllโ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

Sample-efficient Neuro-symbolic Proximal Policy Optimization

Simone Murari, Celeste Veronese, Daniele Meli ยท 2026

Deep Reinforcement Learning (DRL) algorithms often require a large amount of data and struggle in sparse-reward domains with long planning horizons and multiple sub-goals. In this paper, we propose a โ€ฆ

Read Paper โ†’
Physics Preprint PDF DOI

Piezomagnetic effect of a rare-earth-based altermagnet TbPt6Al3

Ryohei Oishi, Kazunori Umeo, Takuya Aoyama, Takahiro Onimaru, Kaya Kobayashi ยท 2026

We have investigated the piezomagnetic (PZM) effect of the rare-earth-based g-wave altermagnet TbPt6Al3 by magnetization measurements of single-crystalline samples under uniaxial stress sigma. The magโ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

BARRED: Synthetic Training of Custom Policy Guardrails via Asymmetric Debate

Arnon Mazza, Elad Levi ยท 2026

Deploying guardrails for custom policies remains challenging, as generic safety models fail to capture task-specific requirements, while prompting LLMs suffers from inconsistent boundary-case performaโ€ฆ

Read Paper โ†’
Physics Preprint PDF DOI

Adaptive Sensing beyond Non-Adaptive Information Limits: End-to-End Co-Design of Geometry, Policy, and Inference

Arvin Keshvari, William Tuxbury, Zin Lin ยท 2026

Inverse design has made vast physical parameter spaces a substrate for emergent behavior. In sensing, the stakes of this principle are sharpest at the analog-to-digital boundary, where any informationโ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

Frictive Policy Optimization for LLMs: Epistemic Intervention, Risk-Sensitive Control, and Reflective Alignment

James Pustejovsky, Nikhil Krishnaswamy ยท 2026

We propose Frictive Policy Optimization (FPO), a framework for learning language model policies that regulate not only what to say, but when and how to intervene in order to manage epistemic and normaโ€ฆ

Read Paper โ†’
Computer Science Preprint PDF DOI

Spark Policy Toolkit: Semantic Contracts and Scalable Execution for Policy Learning in Spark

Zeyu Bai ยท 2026

Custom policy-learning pipelines in Spark fail for two coupled systems reasons: rowwise Python execution makes inference impractical, and driver-side candidate materialization makes split search fragiโ€ฆ

Read Paper โ†’
AI & Data Science Preprint PDF DOI

DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

Junshuo Zhang, Chengrui Huang, Feng Guo, Zihan Li, Ke Shi, Menghua Jiang, Jiguo Yu, Shuo Shang, Shen Gao ยท 2026

Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration โ€ฆ

Read Paper โ†’
Page 1 of 537 Next โ†’