Relational Linearity is a Predictor of Hallucinations (arxiv.org)
arXiv:2601.11429v2 Announce Type: replace-cross
Abstract: Hallucination is a central failure mode of language models (LMs). We focus on hallucinations in response to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities designed to be unknown to the model. We find that LMs like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. Based on the idea of linear relational embeddings, we put forward the following hypothesis. (i) Due to the abstract scheme that is used to represent them, LMs can easily produce plausible objects for non-existing subjects of linear relations, which can lead to hallucinations. (ii) For a nonlinear relation, this mechanism for producing an object is not available and so a hallucination is easier to avoid. To test this hypothesis, we create SyntHal, a synthetic unknown-entity benchmark for 15 relations. We find that across four instruction-tuned models, relational linearity is a strong predictor of models hallucinating an object for an unknown subject vs refusing to give an answer, with correlations $r \in [.58, .84]$.
Abstract: Hallucination is a central failure mode of language models (LMs). We focus on hallucinations in response to questions like: "Which instrument did Glenn Gould play?", but we ask these questions for synthetic entities designed to be unknown to the model. We find that LMs like Gemma-7B-IT frequently hallucinate, i.e., they have difficulty recognizing that the hallucinated fact is not part of their knowledge. Based on the idea of linear relational embeddings, we put forward the following hypothesis. (i) Due to the abstract scheme that is used to represent them, LMs can easily produce plausible objects for non-existing subjects of linear relations, which can lead to hallucinations. (ii) For a nonlinear relation, this mechanism for producing an object is not available and so a hallucination is easier to avoid. To test this hypothesis, we create SyntHal, a synthetic unknown-entity benchmark for 15 relations. We find that across four instruction-tuned models, relational linearity is a strong predictor of models hallucinating an object for an unknown subject vs refusing to give an answer, with correlations $r \in [.58, .84]$.
Comments