Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles (2306.16649v1)

Published 29 Jun 2023 in cs.CL

Abstract: Automatically generating textual content with desired attributes is an ambitious task that people have pursued long. Existing works have made a series of progress in incorporating unimodal controls into LLMs (LMs), whereas how to generate controllable sentences with multimodal signals and high efficiency remains an open question. To tackle the puzzle, we propose a new paradigm of zero-shot controllable text generation with multimodal signals (\textsc{ZeroGen}). Specifically, \textsc{ZeroGen} leverages controls of text and image successively from token-level to sentence-level and maps them into a unified probability space at decoding, which customizes the LM outputs by weighted addition without extra training. To achieve better inter-modal trade-offs, we further introduce an effective dynamic weighting mechanism to regulate all control weights. Moreover, we conduct substantial experiments to probe the relationship of being in-depth or in-width between signals from distinct modalities. Encouraging empirical results on three downstream tasks show that \textsc{ZeroGen} not only outperforms its counterparts on captioning tasks by a large margin but also shows great potential in multimodal news generation with a higher degree of control. Our code will be released at https://github.com/ImKeTT/ZeroGen.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Spice: Semantic propositional image caption evaluation. In European conference on computer vision, pages 382–398. Springer.
  2. Partially-supervised image captioning. Advances in Neural Information Processing Systems, 31.
  3. Experience grounds language. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8718–8735.
  4. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer.
  5. Fast: Improving controllability for text generation with feedback aware self-training. arXiv preprint arXiv:2210.03167.
  6. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
  7. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.
  8. Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380.
  9. Pre-train and plug-in: Flexible conditional text generation with variational auto-encoders. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 253–262.
  10. Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. EMNLP 2017, page 94.
  11. Joseph L Fleiss and Jacob Cohen. 1973. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and psychological measurement, 33(3):613–619.
  12. Stylenet: Generating attractive visual captions with styles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3137–3146.
  13. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369.
  14. Improving controllable text generation with position-aware weighted decoding. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3449–3467.
  15. Stevan Harnad. 1990. The symbol grounding problem. Physica D: Nonlinear Phenomena, 42(1-3):335–346.
  16. More than a feeling: Accuracy and application of sentiment analysis. International Journal of Research in Marketing.
  17. Removing word-level spurious alignment between images and pseudo-captions in unsupervised image captioning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 3692–3702.
  18. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  19. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137.
  20. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  21. Controlling output length in neural encoder-decoders. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1328–1338.
  22. Gedi: Generative discriminator guided sequence generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4929–4952.
  23. Towards unsupervised image captioning with shared multimodal embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7414–7424.
  24. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  25. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119.
  26. Decap: Decoding clip latents for zero-shot captioning via text-only training. arXiv preprint arXiv:2303.03032.
  27. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
  28. Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), pages 605–612.
  29. Dexperts: Decoding-time controlled text generation with experts and anti-experts. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6691–6706.
  30. Visual news: Benchmark and challenges in news image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6761–6771. Association for Computational Linguistics.
  31. Composable text controls in latent space with odes. arXiv preprint arXiv:2208.00638.
  32. Plug and play autoencoders for conditional text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6076–6092.
  33. Text-only training for image captioning using noise-injected clip. arXiv preprint arXiv:2211.00575.
  34. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  35. A plug-and-play method for controlled text generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3973–3997.
  36. Judea Pearl. 2009. Causal inference in statistics: An overview. Statistics surveys, 3:96–146.
  37. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
  38. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  40. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
  41. Language models can see: Plugging visual controls in text generation. arXiv preprint arXiv:2205.02655.
  42. A contrastive framework for neural text generation. arXiv preprint arXiv:2202.06417.
  43. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17918–17928.
  44. Pcae: A framework of plug-in conditional auto-encoder for controllable text generation. Knowledge-Based Systems, 256:109766.
  45. An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621–633.
  46. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575.
  47. Learning to break the loop: Analyzing and mitigating repetitions for neural text generation. arXiv preprint arXiv:2206.02369.
  48. Kevin Yang and Dan Klein. 2021. Fudge: Controlled text generation with future discriminators. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3511–3535.
  49. Conzic: Controllable zero-shot image captioning by sampling-based polishing. arXiv preprint arXiv:2303.02437.
  50. Memcap: Memorizing style knowledge for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12984–12992.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Haoqin Tu (25 papers)
  2. Bowen Yang (55 papers)
  3. Xianfeng Zhao (22 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com