Dice Question Streamline Icon: https://streamlinehq.com

Effective Optimizer for Complex Multimodal Architectures Combining BERT and Cross-Attention

Identify which optimizer is effective for training complex multimodal deep learning architectures that combine a BERT-based textual encoder with a cross-attention Transformer component on sparse textual data, by rigorously comparing optimizers such as Adam, Nadam, and Adamax for Yelp rating prediction tasks.

Information Square Streamline Icon: https://streamlinehq.com

Background

The model integrates high-dimensional deep-contextualized word representations from BERT with tabular inputs using cross-attention, creating a large and potentially sparse gradient space. Although Adam is commonly used in deep learning, the authors argue that the best optimizer for such a complex multimodal architecture remains unclear at the outset.

They therefore design experiments to compare Adam, Nadam, and Adamax, motivated by claims that Adamax can be advantageous with sparse gradients, but explicitly state that the effective optimizer for this multimodal setting has not been clarified.

References

In terms of optimization, many existing studies have adopted Adam as an optimizer; however, as described in {\bf H2}, it has yet to be clarified what optimizer is effective for a complex architecture of multimodal learning.