Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 74 tok/s

Gemini 2.5 Pro 37 tok/s Pro

GPT-5 Medium 36 tok/s Pro

GPT-5 High 37 tok/s Pro

GPT-4o 104 tok/s Pro

Kimi K2 184 tok/s Pro

GPT OSS 120B 448 tok/s Pro

Claude Sonnet 4.5 32 tok/s Pro

2000 character limit reached

Negative Token Merging: Image-based Adversarial Feature Guidance (2412.01339v2)

Published 2 Dec 2024 in cs.CV, cs.AI, cs.GR, cs.LG, and stat.ML

Abstract: Text-based adversarial guidance using a negative prompt has emerged as a widely adopted approach to steer diffusion models away from producing undesired concepts. While useful, performing adversarial guidance using text alone can be insufficient to capture complex visual concepts or avoid specific visual elements like copyrighted characters. In this paper, for the first time we explore an alternate modality in this direction by performing adversarial guidance directly using visual features from a reference image or other images in a batch. We introduce negative token merging (NegToMe), a simple but effective training-free approach which performs adversarial guidance through images by selectively pushing apart matching visual features between reference and generated images during the reverse diffusion process. By simply adjusting the used reference, NegToMe enables a diverse range of applications. Notably, when using other images in same batch as reference, we find that NegToMe significantly enhances output diversity (e.g., racial, gender, visual) by guiding features of each image away from others. Similarly, when used w.r.t. copyrighted reference images, NegToMe reduces visual similarity to copyrighted content by 34.57%. NegToMe is simple to implement using just few-lines of code, uses only marginally higher (<4%) inference time and is compatible with different diffusion architectures, including those like Flux, which don't natively support the use of a negative prompt. Code is available at https://negtome.github.io

Summary

The paper introduces a visual-based adversarial guidance mechanism that replaces text prompts with image features in diffusion models.
It demonstrates a 34.57% improvement over text-based methods in mitigating visual similarity to copyrighted images and boosting output diversity.
NegToMe is a training-free, easily integrated approach that expands the applicability of adversarial guidance across various generative model architectures.

Insights into Negative Token Merging: Image-based Adversarial Feature Guidance

The paper presents an intriguing approach in the field of adversarial guidance for diffusion models, specifically addressing limitations imposed by traditional text-based negative prompts. The method introduced, Negative Token Merging (NegToMe), innovates within the field of text-to-image (T2I) diffusion models by providing a visual-based adversarial guidance mechanism.

Summary of Contributions

NegToMe represents a departure from existing methods that predominantly rely on textual inputs to guide the output away from undesirable concepts. It introduces a training-free technique that leverages visual features from reference images, employing them in the reverse diffusion process to adversarially guide feature generation. This approach not only expands the applicability of adversarial mechanisms but also circumvents the challenges associated with textual guidance such as the inability to capture complex visual attributes solely through negative prompts.

The paper makes several key contributions:

Visual Feature Guidance Mechanism: The authors propose a method for guiding diffusion model outputs directly through visual features. By selectively diverging the matching visual properties of reference and generated images, the technique enhances the diversity of generated content without necessitating model retraining.
Applications in Output Diversity and Copyright Mitigation: The paper highlights the utility of NegToMe across diverse applications. Notably, it can enhance image diversity (e.g., racial, gender, visual variation) and effectively mitigate visual similarity to copyrighted content, surpassing text-based methods' capabilities by 34.57%.
Broad Applicability and Ease of Integration: NegToMe stands out for its ease of implementation and compatibility with various architectures, including models that do not inherently support negative prompts. The method requires minimal computational overhead, making it a practical addition to contemporary workflows.

Analytical Observations

The efficacy of NegToMe is empirically supported, demonstrating significant enhancements in both qualitative and quantitative dimensions. The paper presents compelling data indicating improved output diversity, preserved or improved image quality, and effective reduction of copyrighted character similarities. The choice to employ visual features over textual prompts reflects a strategic pivot that addresses inherent limitations in understanding complex visual domains—a challenge prevalent in contemporary diffusion models.

Implications and Future Directions

The introduction of NegToMe signals a shift towards hybrid guidance systems where textual and visual inputs might coexist to produce more refined and contextually accurate outputs. In practical terms, this development opens avenues for applications requiring high fidelity and diversity of generated images while maintaining compliance with copyright constraints.

Theoretically, this approach invites further exploration into the integration of multi-modal inputs for adversarial guidance, potentially enhancing robustness and interpretability of generative models. Future research might expand unexplored territories in semantic understanding, refining the resolution and complexity of generated outputs by further harmonizing cross-resource guidance mechanisms.

NegToMe exemplifies a noteworthy paradigm where adversarial guidance evolves beyond traditional constructs, presenting a substantial case for visually anchored intervention strategies in generative tasks. As diffusion models continue to progress, the incorporation of such methodologies assures their evolution toward more adaptive, inclusive, and ethically conscious systems in AI-driven creative processes.