What BERT Sees: Cross-Modal Transfer for Visual Question Generation (2002.10832v3)

Published 25 Feb 2020 in cs.CL, cs.CV, and cs.LG

Abstract: Pre-trained LLMs have recently contributed to significant advances in NLP tasks. Recently, multi-modal versions of BERT have been developed, using heavy pre-training relying on vast corpora of aligned textual and image data, primarily applied to classification tasks such as VQA. In this paper, we are interested in evaluating the visual capabilities of BERT out-of-the-box, by avoiding pre-training made on supplementary data. We choose to study Visual Question Generation, a task of great interest for grounded dialog, that enables to study the impact of each modality (as input can be visual and/or textual). Moreover, the generation aspect of the task requires an adaptation since BERT is primarily designed as an encoder. We introduce BERT-gen, a BERT-based architecture for text generation, able to leverage on either mono- or multi- modal representations. The results reported under different configurations indicate an innate capacity for BERT-gen to adapt to multi-modal data and text generation, even with few data available, avoiding expensive pre-training. The proposed model obtains substantial improvements over the state-of-the-art on two established VQG datasets.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a novel cross-modal transfer method that leverages BERT’s learned representations to generate context-aware questions from visual content.
It integrates pre-trained textual models with visual feature extraction to bridge the gap between language and vision, enhancing question generation accuracy.
Experimental results show measurable performance gains on visual question benchmarks, highlighting the approach’s potential for improved image understanding.

Overview of "Instructions for ACL 2020 Proceedings"

The document titled "Instructions for ACL 2020 Proceedings" provides exhaustive guidelines for authors preparing manuscripts for the ACL 2020 conference proceedings. These guidelines ensure consistency and quality across all submissions and are applicable both during the submission for review and the final versions post-acceptance. This essay examines the crucial components of these instructions and extrapolates on their practical implications for researchers.

Structural Details and Submission Guidelines

The document specifies distinct format rules, necessitating authors to submit manuscripts in a two-column format on A4 paper. The text mandates the use of Adobe's Portable Document Format (PDF) and highlights the requirement for embedding fonts to ensure consistency across various viewing platforms. It institutes strict page number requirements, allowing up to eight pages for long papers (with a potential ninth page for revisions post-acceptance) and four pages for short papers.

Given the double-blind review process, adherence to anonymity is critical. The document advises authors against including personal information and suggests using neutral citations to mask self-references. The explicit directions to exclude any acknowledgment section during review underscore the conference's commitment to unbiased evaluation.

Formatting Specifications

The guidelines delineate explicit formatting norms, particularly in relation to font usage, section headings, and layout dimensions. These specifications, which cover font size variations for different parts of the text (e.g., title, author names, section titles), are vital for maintaining uniformity and readability across all submissions. Additionally, figures and tables are encouraged to fit naturally within the narrative, using color cautiously to optimize accessibility for the colorblind and those printing in grayscale.

Embedded within the text are \LaTeX-specific instructions, which are invaluable for authors utilizing the \LaTeX typesetting system to draft their papers. These instructions provide technical guidance on issues like font inclusion in PDF generation and solutions to common \LaTeX compilation errors.

Citation and Reference Protocol

To enhance the scholarly integrity of submissions, the paper emphasizes meticulous referencing practices. References should be complete, alphabetized, and formatted consistently, preferentially citing works from refereed, archival sources when available. The provision for linking Digital Object Identifiers (DOIs) directly in citations reflects an effort to align with current academic publishing norms, facilitating easy access to referenced works.

The incorporation of the natbib package offers flexibility in citation styles, accommodating a range of author-year formats suitable for varied textual contexts. This flexibility is crucial for academic clarity and enhances the interaction between numeric results and the narrative flow of the research discussions.

Implications for Researchers

The specified guidelines carry both practical and theoretical implications for researchers preparing submissions for ACL 2020. Practically, they set a standard for clarity and consistency that improves peer review efficiency and paper accessibility. Theoretically, by enforcing discipline in presentation and citation, they advocate for rigorous academic standards and contribute to the integrity of computational linguistics research dissemination.

Future developments in the field could see these guidelines evolving in response to new typesetting technologies and academic publishing trends. Continued enhancements could include more sophisticated automated tools for verifying compliance with formatting rules or integrating version control systems to track changes during the review process. These developments emphasize the dynamic nature of academic conventions and the role of conference guidelines in fostering a coherent scientific dialogue.

PDF Markdown

Related Papers

YouTube

Show All Videos