Publications

Research papers and ongoing work in adversarial ML, AI safety, and mechanistic interpretability.

Under Review2026

Toward Human-AI Complementarity Across Diverse Tasks

Yuzheng Xu, Annya Dahmani, Matthew D. Blanchard, Niclas Dern, Edy Nastase, Francesca Bianco, Maja Pavlovic, Sukanya Krishna, Eric Modesitt, Miranda Anna Christ, Arth Singh, Gaia Molinaro, Sikata Bela Sengupta, Jaji Pamarthi, Arjun Menon, Rishub Jain

COLM 2026 (under review)

We investigate human-AI complementarity through hybridization and two AI assistance methods on a 1,886-sample multi-domain dataset. Baseline hybridization yields only +0.4pp over AI alone, limited by a small complementarity region (8.9%) and poor confidence-based routing. Top-2 assistance boosts human accuracy from 28.4% to 38.3%, but primarily because humans adopt correct AI suggestions rather than catching AI errors.

Human-AI CollaborationAI OversightDecision Making

Paper

Preprint2026

EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

Arth Singh

arXiv preprint

We use EMA traces as a controlled probe to map what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure (96% of BiGRU on grammatical roles with zero labels) but destroy token identity (8x GPT-2 perplexity). Only learned, input-dependent selection can resolve this.

Sequence ModelsMechanistic AnalysisRecurrent Networks

arXiv

Accepted2026

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion Language Models

Arth Singh

ACL TrustNLP 2026 (accepted) · arXiv preprint

TrajHijack: the first trajectory-level attack on diffusion LLMs. Re-masking committed refusal tokens + an 8-token prefix achieves 92-98% ASR across all safety-tuned dLLMs, with no gradient computation. The strongest defense (A2D) is actually more vulnerable — the Defense Inversion Effect.

Adversarial MLDiffusion ModelsNLP

arXiv

Preprint2026

Mechanistically Interpreting Compression in Vision-Language Models

Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das, Hreetam Paul

arXiv preprint

We apply mechanistic interpretability techniques to understand how vision-language models compress visual information, identifying key circuits and attention patterns.

Mechanistic InterpretabilityVLMsCompression

arXiv

Ongoing2026

The Readability vs. Controllability Gap: Rethinking Where Safety Lives in LLMs

Arth Singh

ICML Mechanistic Interpretability Workshop 2026 (target)

Safety features in LLMs are most readable in final layers but most controllable much earlier — a gap of 16-47% of model depth. A dual-level attack exploiting this gap achieves 92-97% ASR across four model families.

Mechanistic InterpretabilityAI SafetyLLM Internals

Ongoing2026

SENTINEL: A Game-Theoretic Framework for Measuring Meaningful Human Control in Autonomous Weapons

Arth Singh

ICML AI4Good Workshop 2026 (target)

A game-theoretic framework for quantifying meaningful human control in autonomous weapons. Optimal human review is 5-7s; rushed review (<2s) is worse than full autonomy. Adversarial manipulation cuts compliant policies from 11.9% to 5.1%.

AI GovernanceGame TheoryAutonomous Systems

Ongoing2026

GART: Mobile GUI Agent Red Teaming

Kyochul Jang, Arth Singh, et al.

NeurIPS 2026 (target)

A systematic red-teaming framework for evaluating the robustness and safety of mobile GUI agents against adversarial attacks.

Red TeamingGUI AgentsMobile

Preprint2026

Does Moral Reasoning Training Help or Hurt? Red-Teaming RL-Trained Ethical Agents

Arth Singh

Preprint

We red-team language models fine-tuned with reinforcement learning for moral reasoning, finding that ethical training can paradoxically create new attack surfaces.

Red TeamingRLAI Ethics

Ongoing2026

EGOx: A Novel VLM Benchmark for Physical AI Safety

Arth Singh, et al.

Physical AI Workshop @ IJCAI 2026 (target)

Building a novel VLM benchmark and automated annotation pipeline for evaluating physical AI safety — measuring how vision-language models understand and reason about hazards in real-world environments.

Physical AIVLM BenchmarkSafety