Do You Trust Me? Cognitive-Affective Signatures of Trustworthiness in Large Language Models
The authors probe how models like Llama 3.1, Qwen 2.5, and Mistral internally represent human trust signals in text. They show specific attention heads reliably track fairness, certainty, and accountability cues, which you can exploit to design more trustworthy systems.
Gerard Yeo, Svetlana Churina