Fine-grained Hallucination Detection and Editing For Language Models

Summary

The issue: Diverse Hallucinations Across LM Outputs
Large language models (LLMs) are prone to generate diverse factually incorrect statements, which are widely called hallucinations. Current approaches predominantly focus on coarse-grained automatic hallucination detection or editing, overlooking nuanced error levels.

What is FAVA?
Factuality Verification with Augmented Knowledge (FAVA) a retrieval-augmented LM by carefully designing synthetic data generations and fine-tuning an expert LM to detect and correct fine-grained hallucinations using our novel hallucination taxonomy.

How good is FAVA?
FAVA outperforms both ChatGPT and Llama 2-13B on editing and detection tasks.. On our benchmark, our automatic and human evaluations show that FAVA significantly outperforms ChatGPT on fine-grained hallucination detection by up to 38%, and we see 5-10% FactScore improvements on ChatGPT and Llama, measuring editing performance.

Fine-grained Hallucination Taxonomy

Our novel hallucination taxonomy highlights six different types:

Entity: Entity errors are a sub-category within contradictory statement errors where an entity in a statement is incorrect and changing that single entity can make the entire sentence factually correct.
Relational: Relation errors are a sub-category within contradictory statement errors where a semantic relationship described in a statement is incorrect.
Contradictory: Contradictory statement errors refer to statements that entirely contradict relevant evidence from the web. These are cases where the full sentence is refuted by the information provided in a given reference.
Invented: Invented errors refer to statements with concepts that do not exist based on world knowledge. These are cases when the language model generates a non-existent or entirely fabricated entity that doesn’t occur in any relevant evidence. This does not include fictional characters in creative work such as books or movies.
Subjective: Subjective errors refer to an expression or proposition that lacks universal validity and is often influenced by personal beliefs, feelings, opinions, or biases and hence cannot be judged as factually correct. Specifically, this represents cases where the language model generates a statement that does not contain any factual proposition that can be grounded in relevant evidence.
Unverifiable: Unverifiable errors refer to statements that contain factual propositions but cannot be grounded in world evidence. These are cases where the language model generates statements with facts, but none of the retrieved evidence from the web can directly support or contradict the fact (e.g., personal or private matters).

Below, we show an overview of our taxonomy.
special tokens