LEAD | Thinking in Uncertainty

Method Overview

LEAD treats token-level entropy as a signal of reasoning uncertainty. When the model is in a high-entropy state, a single sampled token is often insufficient to represent the true competition among candidate semantics, so LEAD directly builds probability-weighted embeddings from the full token distribution to preserve multiple semantic possibilities. Once entropy falls, the method switches back to standard discrete decoding. At the same time, LEAD injects visual anchors at critical uncertain stages to strengthen grounding and suppress hallucination propagation.

LEAD performs latent decoding during high-entropy stages and returns to discrete decoding during low-entropy stages, while visual and textual tokens jointly drive generation in a unified multimodal reasoning framework.

Key Findings and Motivation

We begin by analyzing the reasoning process itself and studying where hallucinations emerge relative to uncertainty. A direct observation is that, in multimodal reasoning models, hallucinations frequently co-occur with transition words such as because, however, so, and but. Further token-level analysis shows that high-entropy tokens are not negligible noise; instead, they are pivotal branching points that shape the subsequent reasoning trajectory. When these high-entropy tokens are associated with hallucinations, the model also tends to allocate less attention to visual content.

Statistics showing the correlation between hallucinations and transition words

Hallucinations frequently emerge around transition words, and these cases account for a substantial portion of overall hallucination instances, indicating a strong connection between uncertain reasoning nodes and erroneous inferences.

Analysis of high-entropy tokens and visual attention

High-entropy tokens have a larger effect on final performance, and earlier high-entropy tokens more strongly influence the full reasoning trajectory. High-entropy tokens associated with hallucinations also tend to exhibit lower visual attention.

Qualitative Visualization

LEAD not only improves final answers, but also changes the model's reasoning behavior. Qualitatively, the method maintains richer token distributions during high-entropy stages and reallocates visual attention at critical moments, preventing the model from drifting away from image evidence and continuing solely along linguistic momentum.

LEAD maintains more stable visual focus across reasoning steps while preserving a more distributed token probability landscape during latent reasoning, reflecting a dynamic exploration-to-convergence transition.

Main Results

Experimental results show that LEAD does not merely improve the final score on a single benchmark; instead, it achieves a better overall balance among sample efficiency, reasoning efficiency, and text quality. It reaches higher Pass@k accuracy with smaller k, attains better correctness with shorter average reasoning length, and maintains or improves fluency, naturalness, grammar, and perplexity.

On RealWorldQA and MathVista, LEAD's Pass@k curves consistently stay above the compared methods and reach high-accuracy regions earlier, demonstrating stronger sample efficiency.

Comparison of reasoning length and accuracy for LEAD

LEAD achieves higher accuracy with shorter average reasoning length, indicating that latent reasoning does not prolong deliberation and instead improves reasoning efficiency.

In GPT-assisted evaluation, LEAD remains stable or improves on fluency, naturalness, grammar, and PPL, showing that the method does not trade text quality for higher accuracy.

Ablation Study

Ablation results further validate the core design choices of LEAD. A dynamic entropy threshold outperforms fixed-threshold strategies, suggesting that reasoning mode should switch adaptively according to uncertainty. In addition, the discrete reasoning window is not simply better when larger; a moderate window provides a better balance between semantic exploration and stable convergence.

Ablation results for different entropy threshold strategies

On MMHalu and Bingo, the dynamic threshold strategy consistently outperforms fixed thresholds, indicating that adaptive switching is a key factor behind LEAD's success.

Ablation results for different window sizes

Performance peaks at a moderate window size: overly small windows hurt stability, while overly large ones weaken the exploration benefits of latent reasoning.

References

@article{lead2026, title={Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding}, author={Anonymous}, journal={CVPR 2026 Submission}, year={2026} }