LEAD

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

LEAD is a plug-and-play decoding strategy for multimodal reasoning models. When the model enters high-entropy, uncertain reasoning stages, it preserves latent semantic competition in the token distribution; when entropy decreases, it switches back to discrete decoding for stable convergence, while visual anchors pull the reasoning process back toward image-grounded evidence.

Key finding: transition and discourse markers often coincide with high-entropy reasoning states, and hallucinations are more likely to emerge after these moments.
Entropy-aware switching: high-entropy phases use probability-weighted continuous embeddings for latent reasoning, while low-entropy phases return to discrete token decoding for convergence.
Visual anchor injection: visual anchors are introduced at critical uncertain steps to reduce image-detached, language-driven hallucination chains.
Empirical gains: LEAD consistently improves hallucination benchmarks, general understanding, and math/science reasoning while also shortening reasoning length and preserving text quality.
Illustration of LEAD-style multimodal reasoning with a robot, book, and reasoning steps

Method Overview

LEAD treats token-level entropy as a signal of reasoning uncertainty. When the model is in a high-entropy state, a single sampled token is often insufficient to represent the true competition among candidate semantics, so LEAD directly builds probability-weighted embeddings from the full token distribution to preserve multiple semantic possibilities. Once entropy falls, the method switches back to standard discrete decoding. At the same time, LEAD injects visual anchors at critical uncertain stages to strengthen grounding and suppress hallucination propagation.

Overview figure of the LEAD method
LEAD performs latent decoding during high-entropy stages and returns to discrete decoding during low-entropy stages, while visual and textual tokens jointly drive generation in a unified multimodal reasoning framework.

Key Findings and Motivation

We begin by analyzing the reasoning process itself and studying where hallucinations emerge relative to uncertainty. A direct observation is that, in multimodal reasoning models, hallucinations frequently co-occur with transition words such as because, however, so, and but. Further token-level analysis shows that high-entropy tokens are not negligible noise; instead, they are pivotal branching points that shape the subsequent reasoning trajectory. When these high-entropy tokens are associated with hallucinations, the model also tends to allocate less attention to visual content.

Statistics showing the correlation between hallucinations and transition words
Hallucinations frequently emerge around transition words, and these cases account for a substantial portion of overall hallucination instances, indicating a strong connection between uncertain reasoning nodes and erroneous inferences.
Analysis of high-entropy tokens and visual attention
High-entropy tokens have a larger effect on final performance, and earlier high-entropy tokens more strongly influence the full reasoning trajectory. High-entropy tokens associated with hallucinations also tend to exhibit lower visual attention.

Qualitative Visualization

LEAD not only improves final answers, but also changes the model's reasoning behavior. Qualitatively, the method maintains richer token distributions during high-entropy stages and reallocates visual attention at critical moments, preventing the model from drifting away from image evidence and continuing solely along linguistic momentum.

Qualitative visualization of LEAD
LEAD maintains more stable visual focus across reasoning steps while preserving a more distributed token probability landscape during latent reasoning, reflecting a dynamic exploration-to-convergence transition.

Main Results

Experimental results show that LEAD does not merely improve the final score on a single benchmark; instead, it achieves a better overall balance among sample efficiency, reasoning efficiency, and text quality. It reaches higher Pass@k accuracy with smaller k, attains better correctness with shorter average reasoning length, and maintains or improves fluency, naturalness, grammar, and perplexity.

Pass@k results of LEAD
On RealWorldQA and MathVista, LEAD's Pass@k curves consistently stay above the compared methods and reach high-accuracy regions earlier, demonstrating stronger sample efficiency.
Comparison of reasoning length and accuracy for LEAD
LEAD achieves higher accuracy with shorter average reasoning length, indicating that latent reasoning does not prolong deliberation and instead improves reasoning efficiency.
Text quality evaluation results of LEAD
In GPT-assisted evaluation, LEAD remains stable or improves on fluency, naturalness, grammar, and PPL, showing that the method does not trade text quality for higher accuracy.

Ablation Study

Ablation results further validate the core design choices of LEAD. A dynamic entropy threshold outperforms fixed-threshold strategies, suggesting that reasoning mode should switch adaptively according to uncertainty. In addition, the discrete reasoning window is not simply better when larger; a moderate window provides a better balance between semantic exploration and stable convergence.

Ablation results for different entropy threshold strategies
On MMHalu and Bingo, the dynamic threshold strategy consistently outperforms fixed thresholds, indicating that adaptive switching is a key factor behind LEAD's success.
Ablation results for different window sizes
Performance peaks at a moderate window size: overly small windows hurt stability, while overly large ones weaken the exploration benefits of latent reasoning.

References

@article{lead2026, title={Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding}, author={Anonymous}, journal={CVPR 2026 Submission}, year={2026} }