Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scale Test-Time Compute

From arxiv:Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scale Test-Time Compute

Abstract This paper presents a simple, effective, and cost-efficient strategy to improve LLM performance by scaling test-time compute. Our strategy builds upon the repeated-sampling-then-voting framework, with a novel twist: incorporating multiple models, even weaker ones, to leverage their complementary strengths that potentially arise from diverse training data and paradigms. By using consistency as a signal, our strategy dynamically switches between models. Theoretical analysis highlights the efficiency and performance advantages of our strategy. Extensive experiments on six datasets demonstrate that our strategy not only outperforms self-consistency and state-of-the-art multi-agent debate approaches, but also significantly reduces inference costs. Additionally, ModelSwitch requires only a few comparable LLMs to achieve optimal performance and can be extended with verification methods, demonstrating the potential of leveraging multiple LLMs in the generation-verification paradigm. 1Introduction Scaling has been a major driving force of recent rapid advancements in large language models (LLMs). While scaling training-time compute [1] appears to be hitting a plateau, scaling inference-time compute stands out as a promising alternative [2]. An emerging direction is to scale inference-time compute based on the generation-verification paradigm. By querying an LLM with the same question for multiple times, a number of samples (or candidate answers) are generated, and then these samples are verified to deliver a final answer. Studies across various LLMs and benchmarks consistently demonstrate that simply scaling the number of generated samples significantly improves the coverage of correct answers [3]. Thus, it is perhaps unsurprising that recent attempts have pushed the number of samples to the scale of hundreds or even thousands [3, 4, 5], in pursuit of improved answer correctness. However, do we truly need so many samples? Scaling repeated sampling is undeniably computationally expensive, with the consumption of floating point operations (FLOPs) increasing linearly with the number of samplings [6]. Additionally, in terms of user experience, repeated sampling often leads to significant delay in providing final answers [7], and no one enjoys waiting too long for a response from AI. Therefore, improving sample efficiency is of paramount importance, and there is a pressing need for methods that can deliver correct final answers while minimizing the number of samples required. Recent approaches have primarily focused on the verification side—a great number of outcome or process reward models [8, 9, 10] and automatic verifiers [11, 12] have been proposed, whereas LLM-as-a-judge [13, 14] has also been extensively explored. Orthogonal to recent efforts, in this paper, we focus on the generation side and explore the potential of leveraging multiple LLMs to improve sample efficiency. We argue that employing multiple LLMs for generation can achieve effective complementary capabilities among the models. Trained on different corpora and using distinct paradigms, LLMs exhibit diverse capabilities—even on the same benchmark, two general-purpose LLMs may excel at answering different types of questions [15, 16]. We test our argument by building upon the simple repeated-sampling-then-voting strategy, following the Occam’s Razor principle, and present a novel method named ModelSwitch. This method introduces two novel twists: (i) incorporating multiple models, even weaker ones, to produce more diverse samples, and (ii) using consistency as a signal to switch models and save compute. The rationale is based on our empirical observation: across various types of LLMs and datasets, their accuracy is positively correlated with the consistency of their generated answers. When a model generates chaotic answers, it serves as a signal to switch models. If the switched model generates consistent answers, there is a higher likelihood of obtaining the correct answer.

动态标题1