Transfer of ensemble benefits to open-ended generation

Determine whether the improvements observed from consensus-seeking generalized-mean f-ensembles and better posterior approximations via sequential Monte Carlo on structured text generation tasks transfer to open-ended generation tasks such as creative writing and dialogue.

Background

The paper introduces f-ensembles—a unified framework for composing multiple LLMs with generalized means—and proposes a Sequential Monte Carlo (SMC) algorithm to sample from global ensemble distributions over strings. Empirically, the authors evaluate on structured text generation tasks (JSON schema validation, word sorting, and text-to-SQL) where correctness can be assessed objectively.

Across these structured tasks, consensus-seeking ensembles (e.g., product and minimum) often outperform coverage-seeking ones, and better posterior approximations correlate with improved expected accuracy. However, evaluation in open-ended generation (e.g., creative writing, dialogue) is inherently more subjective and less standardized, leaving uncertain whether the same benefits carry over to such domains.

References

Whether these benefits transfer to open-ended generation tasks (e.g., creative writing, dialogue) remains an open question, as evaluation in such settings is harder to evaluate.

Ensembling Language Models with Sequential Monte Carlo  (2603.05432 - Chan et al., 5 Mar 2026) in Limitations, Task selection