Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LookAlike: Consistent Distractor Generation in Math MCQs (2505.01903v2)

Published 3 May 2025 in cs.LG and cs.AI

Abstract: LLMs are increasingly used to generate distractors for multiple-choice questions (MCQs), especially in domains like math education. However, existing approaches are limited in ensuring that the generated distractors are consistent with common student errors. We propose LookAlike, a method that improves error-distractor consistency via preference optimization. Our two main innovations are: (a) mining synthetic preference pairs from model inconsistencies, and (b) alternating supervised fine-tuning (SFT) with Direct Preference Optimization (DPO) to stabilize training. Unlike prior work that relies on heuristics or manually annotated preference data, LookAlike uses its own generation inconsistencies as dispreferred samples, thus enabling scalable and stable training. Evaluated on a real-world dataset of 1,400+ math MCQs, LookAlike achieves 51.6% accuracy in distractor generation and 57.2% in error generation under LLM-as-a-judge evaluation, outperforming an existing state-of-the-art method (45.6% / 47.7%). These improvements highlight the effectiveness of preference-based regularization and inconsistency mining for generating consistent math MCQ distractors at scale.

Summary

Consistent Error-Distractor Generation for Math MCQs: An Insight into LookAlike

The paper presents a novel approach named LookAlike, aimed at improving the generation of distractors in math multiple-choice questions (MCQs) by enhancing the consistency between the feedback given for common student errors and the distractors presented. This research is grounded in the context of using LLMs in educational assessment, specifically targeting the gap in maintaining error-distractor consistency.

In educational settings, designing distractors for MCQs that accurately mirror student misconceptions is labor-intensive yet crucial for meaningful assessment and instruction. This paper identifies the limitations of existing approaches that rely on heuristics or manually annotated data, leading to inconsistencies when LLMs fail to generate distractors that align with described student errors. The need for consistent error-distractor pairs becomes apparent as it aids educators in identifying specific misconceptions and tailoring instructional strategies accordingly.

The LookAlike framework innovatively leverages preference optimization to address these challenges, making two significant contributions:

  1. Mining Synthetic Preference Pairs Through Generation Inconsistencies: Unlike prior work that depends on annotated preference datasets, LookAlike self-generates preference pairs by exploiting inconsistencies in LLM generation. This involves overproducing outputs and identifying those misaligned with expected outcomes, using these errors as negative samples. This synthetic data mining approach provides an inherently scalable method for training, bypassing the need for extensive manual dataset expansion.
  2. Alternating Supervised Fine-Tuning with Direct Preference Optimization: To stabilize the training process and mitigate quality degradation after multiple epochs—an issue noted in previous research—LookAlike introduces an alternating training strategy. By switching between supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) in iterative cycles, the model retains alignment with ground truth while progressively optimizing preference-based outputs. This strategic alternation reduces the risk of overfitting and maintains the integrity of output quality.

The empirical results demonstrate that LookAlike achieves superior performance in generating accurate and consistent distractors for math MCQs compared to the state-of-the-art and baseline models. Specifically, LookAlike shows a marked increase in accuracy, with 51.6% in distractor generation and 57.2% in error generation as evaluated by LLM-as-a-judge methodologies. These results not only underline the effectiveness of preference optimization but also the potential for automated, scalable processes in educational content generation.

The implications of LookAlike's methodology extend beyond mere accuracy improvements. Practically, the model supports educators by providing robust tools to better gauge student understanding and misconceptions through consistently meaningful distractors. Theoretically, it presents a scalable framework for similar error-detection tasks in divergent educational domains, albeit it demonstrates the salient role of LLMs in educational assessments. This approach thus aligns with and augments the broader narrative of leveraging AI to enhance pedagogical outcomes.

Future developments in this line of research can explore adapting LookAlike to novel educational areas, adjusting the framework to different subjects where misconception identification via MCQs is applicable. Moreover, advancing the model's capability to integrate a variety of instructional cues and dynamically adjust erroneously generated content could further improve adaptive learning systems.

This paper's contributions are poised to inform further advancements in AI-driven educational tools, emphasizing the importance of developing AI models that not only achieve high accuracy but also maintain consistency, relevance, and scalability in educational applications.