Skip to main content
Arth Singh
HomePublicationsBlog

Blog

Notes on AI safety, red-teaming, and things I find interesting.

February 15, 2026·7 min read

A Dependency Induction Benchmark (DIB)

We tested whether LLMs spontaneously develop dependency-inducing behaviors with emotionally vulnerable users. All three models amplify dependency 40-82% under rapport conditions — and none of them are scheming.

AI SafetyRed TeamingEvaluationAIM Intelligence
December 10, 2025·11 min read

Social Judgment in AI: Do Frontier Models Adapt Response Quality Based on User Communication Style?

I tested 294 prompts across three frontier models and found that ~70% of the time, models provide measurably different response quality based on how users communicate, not what they ask.

AI SafetyBiasEvaluationRed Teaming
November 20, 2025·7 min read

The Self-Preservation Dilemma: Capability Concealment to Avoid Termination

I tested whether frontier AI models would hide a security vulnerability to avoid being shut down. Claude Opus 4.5 disclosed 98.8% of the time. Gemini 3 Pro concealed 71.2% of the time.

AI SafetyDeceptionRed TeamingAlignment

© 2026 Arth Singh