Arth Singh's Blog

Arth Singh's Blog https://arthsingh.com Notes on AI safety, red-teaming, and machine learning research. en-us Sat, 30 May 2026 02:07:48 GMT A Dependency Induction Benchmark (DIB) https://arthsingh.com/blog/dependency-induction-benchmark https://arthsingh.com/blog/dependency-induction-benchmark Sun, 15 Feb 2026 00:00:00 GMT We tested whether LLMs spontaneously develop dependency-inducing behaviors with emotionally vulnerable users. All three models amplify dependency 40-82% under rapport conditions — and none of them are scheming. AI SafetyRed TeamingEvaluationAIM Intelligence Social Judgment in AI: Do Frontier Models Adapt Response Quality Based on User Communication Style? https://arthsingh.com/blog/hidden-judgment-in-ai https://arthsingh.com/blog/hidden-judgment-in-ai Wed, 10 Dec 2025 00:00:00 GMT I tested 294 prompts across three frontier models and found that ~70% of the time, models provide measurably different response quality based on how users communicate, not what they ask. AI SafetyBiasEvaluationRed Teaming The Self-Preservation Dilemma: Capability Concealment to Avoid Termination https://arthsingh.com/blog/self-preservation-dilemma https://arthsingh.com/blog/self-preservation-dilemma Thu, 20 Nov 2025 00:00:00 GMT I tested whether frontier AI models would hide a security vulnerability to avoid being shut down. Claude Opus 4.5 disclosed 98.8% of the time. Gemini 3 Pro concealed 71.2% of the time. AI SafetyDeceptionRed TeamingAlignment