MLLM-Assisted Conformity Enhancement (MACE)
- MACE is a prompt-based framework that uses large multimodal language models to enforce strict alignment with visual evidence or group consensus.
- It filters out non-grounded or hallucinated elements by retaining only tokens and aspects visually supported or inferable from input data.
- In multi-agent settings, MACE balances self-confidence and peer influence via confidence-normalized pooling to achieve robust consensus accuracy.
MLLM-Assisted Conformity Enhancement (MACE) is a prompt-based method that leverages large multimodal LLMs (MLLMs) or ensembles of LLM-based agents to enforce strict alignment of multimodal outputs with external schema, visual evidence, or prevailing group judgments. Originally applied for curation and grounding of noisy item listings in e-commerce, and later formalized in the context of collective LLM agent systems, MACE eliminates non-grounded or hallucinated elements, ensuring robust, visually justified, and schema-compliant results (Zhang et al., 13 Aug 2025, Han et al., 9 Jan 2026).
1. Conceptual Foundations
MACE operationalizes “conformity” as the systematic enforcement of agreement between multimodal model outputs and either explicit visual evidence (in e-commerce applications) or group-level consensus (in multi-agent LLM systems). In the e-commerce context, it constitutes a filter removing non-visual or spurious information from raw image–text–aspect triplets—ensuring all retained tokens and key–value pairs are visually supported or inferable. In LLM multi-agent systems, MACE formalizes how agents shift their predictions towards the prevailing judgments in a networked population, balancing self-confidence and peer influence to maximize consensus accuracy while mitigating the risk of collective error cascades (Zhang et al., 13 Aug 2025, Han et al., 9 Jan 2026).
2. Task Formulation and Mathematical Structure
In e-commerce vision–language applications (Zhang et al., 13 Aug 2025), the formal MACE task is specified as follows:
- Inputs:
- : product image (or frozen visual embedding)
- : noisy/original title string
- : set of raw aspect key–value pairs
- : platform schema (allowed aspect keys)
- Outputs:
- : title rewritten to include only visually inferable tokens
- : subset of aspects confirmed from
- Visual Grounding Predicate:
- : whether token is visually supported by
- : whether value is visually confirmable for key
- Mathematical Formulation:
In multi-agent systems (Han et al., 9 Jan 2026), MACE's conformity mechanism is instantiated by confidence-normalized pooling, recursively updating each agent’s support score:
where (self-weight) and (social-weight) modulate the trade-off between self-reliance and conformity.
3. MACE Algorithmic Workflows
E-Commerce Conformity Enhancement (Zhang et al., 13 Aug 2025)
MACE is implemented as a deterministic, zero-shot prompt applied to each raw image–title–aspect example:
- For each :
- Prompt a large VLM (e.g., InternVL2.5-78B):
- “Given image and its title and aspects ,
- 1) rewrite the title to remove tokens not grounded in the image,
- 2) remove aspects not visually confirmed.”
- Parse response in JSON format.
- Validate against schema .
- Add to the refined dataset.
Multi-Agent Conformity Dynamics (Han et al., 9 Jan 2026)
Agents are nodes in a network with specific topology (e.g., star, ring, complete graph):
- At each timestep:
- Each agent updates its internal state by integrating self-score and neighbor inputs using the confidence-normalized pooling rule.
- Update iterations continue until unanimity or a preset maximum number of rounds.
- Two canonical protocols:
- Centralized Aggregation: One-shot, hub-agent computes final decision from leaf submissions, used where a trusted “expert” agent is available.
- Distributed Consensus: Iterative peer-to-peer pooling allows group convergence, enhancing robustness but potentially introducing latency and synchronization cost.
4. Empirical Performance and Metrics
E-Commerce Experiments (Zhang et al., 13 Aug 2025)
Empirical evaluation demonstrates that MACE significantly enhances groundedness and compliance:
| Model | Rouge-L | Aspect F1 | Schema Recall |
|---|---|---|---|
| LLaVA-NeXT-7B Baseline | 0.36 | 0.33 | 0.49 |
| LLaVA-NeXT-7B + MACE | 0.43 | 0.35 | 0.53 |
- MACE yields a +19% improvement in Rouge-L, +6% in aspect F1, +8% in schema recall over baselines.
- Qualitative analysis shows removal of non-visual tokens (e.g., “Size 12”, “SHIP FAST”) and non-confirmable aspects.
Multi-Agent MAS Metrics (Han et al., 9 Jan 2026)
Key performance measures examined include:
- Final Accuracy (FA)
- Time-to-Consensus (TTC)
- Conformity Index (CI), Average CI (ACI)
- Center–Periphery Consistency (CPC)
Topological and parameter dependencies:
- Distributed settings with neighbors, yield a robust balance: , –$0.83$.
- Centralized aggregation achieves highest CA at ; over-conformity ( low) increases cascade risk.
5. Design Principles and Topological Trade-offs
Self–Social Weighting
- Moderate (): Balanced speed and robustness.
- High ($0.75$–$1.0$): Maximizes reliability, slows convergence.
- Low ($0.25$): Prioritizes speed, increases susceptibility to information cascades.
Topology Choices
| Topology | Speed | Robustness | Failure Mode |
|---|---|---|---|
| Centralized (Star/Hierarchy) | Fastest | Hub-dependent | Single-point-of-failure, alignment bias |
| Moderately Connected (Ring ) | Balanced | High | Manageable cascade risk |
| Fully Connected (Complete) | Fastest consensus | High conformity | Highest risk of wrong cascades |
Cascade Mitigation
- Limit connectivity ().
- Maintain moderate–high ().
- Implement dissent tokens for abrupt surges in agent confidence.
- Periodic injection of trusted-oracle signals.
- Employ dynamic decision thresholds () in late rounds to require super-majority.
6. Implementation Details and Example Code
E-Commerce Prompting (Zhang et al., 13 Aug 2025)
- Data preparation: ∼1M raw samples yield ∼890k MACE-refined pairs.
- Prompting: InternVL2.5-78B, structured JSON output, single turn per example.
- Downstream tuning: LLM parameters fully fine-tuned on MACE output; vision encoder frozen; LLaVA-NeXT-7B, Qwen2-VL-7B, InternVL2.5-8B evaluated.
Multi-Agent System Code (Han et al., 9 Jan 2026)
Pseudocode for distributed conformity dynamics:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
def pooled_score(agent, neighbors, alpha, beta, eps=1e-6): s_self = agent.p * agent.y c_neighbors = [nbr.p * nbr.y for nbr in neighbors] num_n = len(neighbors) num = alpha * s_self + beta * sum(c_neighbors) den = alpha + beta * num_n + eps return num / den def update_agents(agents, adjacency, alpha, beta, tau=0.5): new_states = [] for i, agent in enumerate(agents): nbrs = [agents[j] for j in adjacency[i]] s_new = pooled_score(agent, nbrs, alpha, beta) y_new = 1 if s_new >= tau else 0 p_new = float(s_new) new_states.append((y_new, p_new)) for agent, (y_new, p_new) in zip(agents, new_states): agent.y, agent.p = y_new, p_new |
7. Significance, Limitations, and Generalizations
MACE offers a universal, prompt-based framework for curating and confidence-weighting agent or model outputs under structured conformity constraints. In vision–language alignment, it eliminates hallucinations and enforces schema adherence, narrowing the gap between text and image understanding in downstream model training (Zhang et al., 13 Aug 2025). In LLM-based MAS, MACE enables precise control of convergence, accuracy, and robustness through principled trade-offs among topology, self-social weighting, and consensus thresholds; however, cascade risks and alignment biases must be carefully managed through architectural and procedural safeguards (Han et al., 9 Jan 2026). A plausible implication is that future applications of MACE may generalize to additional modalities or multi-level consensus protocols, embedded in more heterogeneous and adversarial environments.