Publications
Research papers and ongoing work in adversarial ML, AI safety, and mechanistic interpretability.

Re-Mask and Redirect: Exploiting Denoising Irreversibility in Diffusion LMs
Arth Singh
ACL TrustNLP 2026
We demonstrate that diffusion-based language models exhibit a fundamental vulnerability: the denoising process is irreversible, and strategic re-masking of intermediate tokens can redirect generation toward adversarial outputs.

Mechanistically Interpreting Compression in Vision-Language Models
Veeraraju Elluru, Arth Singh, Roberto Aguero, Ajay Agarwal, Debojyoti Das, Hreetam Paul
ACL SRW 2026 (under review) · arXiv preprint
We apply mechanistic interpretability techniques to understand how vision-language models compress visual information, identifying key circuits and attention patterns.

The Readability vs. Controllability Gap: Rethinking Where Safety Lives in LLMs
Arth Singh
ICML Mechanistic Interpretability Workshop 2026 (target)
Safety features in LLMs are most readable in final layers but most controllable much earlier — a gap of 16-47% of model depth. A dual-level attack exploiting this gap achieves 92-97% ASR across four model families.

SENTINEL: A Game-Theoretic Framework for Measuring Meaningful Human Control in Autonomous Weapons
Arth Singh
ICML AI4Good Workshop 2026 (target)
A game-theoretic framework for quantifying meaningful human control in autonomous weapons. Optimal human review is 5-7s; rushed review (<2s) is worse than full autonomy. Adversarial manipulation cuts compliant policies from 11.9% to 5.1%.

GART: Mobile GUI Agent Red Teaming
Kyochul Jang, Arth Singh, et al.
NeurIPS 2026 (target)
A systematic red-teaming framework for evaluating the robustness and safety of mobile GUI agents against adversarial attacks.

Does Moral Reasoning Training Help or Hurt? Red-Teaming RL-Trained Ethical Agents
Arth Singh
Preprint
We red-team language models fine-tuned with reinforcement learning for moral reasoning, finding that ethical training can paradoxically create new attack surfaces.

EGOx: A Novel VLM Benchmark for Physical AI Safety
Arth Singh, et al.
Physical AI Workshop @ IJCAI 2026 (target)
Building a novel VLM benchmark and automated annotation pipeline for evaluating physical AI safety — measuring how vision-language models understand and reason about hazards in real-world environments.