E2S2: Encoding-Enhanced Sequence-to-Sequence Pretraining for Language Understanding and Generation (2205.14912v3)
Abstract: Sequence-to-sequence (seq2seq) learning is a popular fashion for large-scale pretraining LLMs. However, the prior seq2seq pretraining models generally focus on reconstructive objectives on the decoder side and neglect the effect of encoder-side supervision, which we argue may lead to sub-optimal performance. To verify our hypothesis, we first empirically study the functionalities of the encoder and decoder in seq2seq pretrained LLMs, and find that the encoder takes an important but under-exploitation role than the decoder regarding the downstream performance and neuron activation. Therefore, we propose an encoding-enhanced seq2seq pretraining strategy, namely E2S2, which improves the seq2seq models via integrating more efficient self-supervised information into the encoders. Specifically, E2S2 adopts two self-supervised objectives on the encoder side from two aspects: 1) locally denoising the corrupted sentence (denoising objective); and 2) globally learning better sentence representations (contrastive objective). With the help of both objectives, the encoder can effectively distinguish the noise tokens and capture high-level (i.e., syntactic and semantic) knowledge, thus strengthening the ability of seq2seq model to accurately achieve the conditional generation. On a large diversity of downstream natural language understanding and generation tasks, E2S2 dominantly improves the performance of its powerful backbone models, e.g., BART and T5. For example, upon BART backbone, we achieve +1.1% averaged gain on the general language understanding evaluation (GLUE) benchmark and +1.75% F_0.5 score improvement on CoNLL2014 dataset. We also provide in-depth analyses to show the improvement stems from better linguistic representation. We hope that our work will foster future self-supervision research on seq2seq LLM pretraining.
- P. Ramachandran, P. J. Liu, and Q. Le, “Unsupervised pretraining for sequence to sequence learning,” in EMNLP, 2017.
- K. Song, X. Tan, T. Qin, J. Lu, and T.-Y. Liu, “Mass: Masked sequence to sequence pre-training for language generation,” in ICML, 2019.
- M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in ACL, 2020.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” JMLR, 2020.
- W. Qi, Y. Yan, Y. Gong, D. Liu, N. Duan, J. Chen, R. Zhang, and M. Zhou, “Prophetnet: Predicting future n-gram for sequence-to-sequencepre-training,” in Findings of EMNLP, 2020.
- Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer, “Multilingual denoising pre-training for neural machine translation,” TACL, 2020.
- M. Lewis, M. Ghazvininejad, G. Ghosh, A. Aghajanyan, S. Wang, and L. Zettlemoyer, “Pre-training via paraphrasing,” in NeurIPS, 2020.
- L. Wang, W. Zhao, R. Jia, S. Li, and J. Liu, “Denoising based sequence-to-sequence pre-training for text generation,” in EMNLP, 2019.
- W. Zhou, T. Ge, C. Xu, K. Xu, and F. Wei, “Improving sequence-to-sequence pre-training via sequence span rewriting,” in EMNLP, 2021.
- J. Li, A. Sun, J. Han, and C. Li, “A survey on deep learning for named entity recognition,” TKDE, 2020.
- J. Li, A. Sun, and Y. Ma, “Neural named entity boundary detection,” TKDE, 2020.
- I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NeurIPS, 2014.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv, 2019.
- P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” in ICLR, 2020.
- J. Kasai, N. Pappas, H. Peng, J. Cross, and N. Smith, “Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation,” in ICLR, 2021.
- B. Li, H. Zhou, J. He, M. Wang, Y. Yang, and L. Li, “On the sentence embeddings from bert for semantic textual similarity,” in EMNLP, 2020.
- A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, “Glue: A multi-task benchmark and analysis platform for natural language understanding,” in EMNLP, 2018.
- A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
- P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, 2021.
- L. Wu, H. Lin, C. Tan, Z. Gao, and S. Z. Li, “Self-supervised learning on graphs: Contrastive, generative, or predictive,” TKDE, 2021.
- X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” TKDE, 2021.
- Y. Liu, M. Jin, S. Pan, C. Zhou, Y. Zheng, F. Xia, and P. Yu, “Graph self-supervised learning: A survey,” TKDE, 2022.
- Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, “Self-evolution learning for discriminative language model pretraining,” in Findings of ACL, 2023.
- Q. Zhong, L. Ding, J. Liu, X. Liu, M. Zhang, B. Du, and D. Tao, “Revisiting token dropping strategy in efficient bert pretraining,” in ACL, 2023.
- M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy, “Spanbert: Improving pre-training by representing and predicting spans,” TACL, 2020.
- Y. Sun, S. Wang, Y. Li, S. Feng, H. Tian, H. Wu, and H. Wang, “Ernie 2.0: A continual pre-training framework for language understanding,” in AAAI, 2020.
- Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” NeurIPS, 2019.
- T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” in EMNLP, 2021.
- Y. Yan, R. Li, S. Wang, F. Zhang, W. Wu, and W. Xu, “Consert: A contrastive framework for self-supervised sentence representation transfer,” in ACL, 2021.
- T. Jiang, S. Huang, Z. Zhang, D. Wang, F. Zhuang, F. Wei, H. Huang, L. Zhang, and Q. Zhang, “Promptbert: Improving bert sentence embeddings with prompts,” in EMNLP, 2022.
- K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” in ICLR, 2019.
- S. Panda, A. Agrawal, J. Ha, and B. Bloch, “Shuffled-token detection for refining pre-trained roberta,” in NAACL: Student Research Workshop, 2021.
- B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in EMNLP, 2021.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” in NeurIPS, 2020.
- T. Schick and H. Schütze, “Few-shot text generation with pattern-exploiting training,” in EMNLP, 2021.
- T. Schick and H. Schutze, “Exploiting cloze-questions for few-shot text classification and natural language inference,” in ACL, 2021.
- J. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi, “Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp,” in EMNLP, 2020.
- A. Yamaguchi, G. Chrysostomou, K. Margatina, and N. Aletras, “Frustratingly simple pretraining alternatives to masked language modeling,” in EMNLP, 2021.
- A. Alajrami and N. Aletras, “How does the pre-training objective affect what large language models learn about linguistic properties?” in ACL, 2022.
- E. Nijkamp, B. Pang, Y. N. Wu, and C. Xiong, “Script: Self-critic pretraining of transformers,” in NAACL, 2021.
- K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom, “Teaching machines to read and comprehend,” in NeurIPS, 2015.
- S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization,” in EMNLP, 2018.
- J. Liu, Y. Zou, H. Zhang, H. Chen, Z. Ding, C. Yuan, and X. Wang, “Topic-aware contrastive learning for abstractive dialogue summarization,” in Findings of EMNLP, 2021.
- M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence modeling,” in NAACL-HLT, W. Ammar, A. Louis, and N. Mostafazadeh, Eds., 2019.
- S. Rothe, S. Narayan, and A. Severyn, “Leveraging pre-trained checkpoints for sequence generation tasks,” TACL, 2020.
- Q. Zhong, L. Ding, L. Shen, P. Mi, J. Liu, B. Du, and D. Tao, “Improving sharpness-aware minimization with fisher mask for better generalization on language models,” in Findings of EMNLP, 2022.
- H. T. Ng, S. M. Wu, T. Briscoe, C. Hadiwinoto, R. H. Susanto, and C. Bryant, “The conll-2014 shared task on grammatical error correction,” CoNLL-2014, 2014.
- S. Chollampatt and H. T. Ng, “A multilayer convolutional encoder-decoder neural network for grammatical error correction,” in AAAI, 2018.
- D. Dahlmeier and H. T. Ng, “Better evaluation for grammatical error correction,” in NAACL, 2012.
- S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston, “Personalizing dialogue agents: I have a dog, do you have pets too?” in ACL, 2018.
- Y. Li, H. Su, X. Shen, W. Li, Z. Cao, and S. Niu, “Dailydialog: A manually labelled multi-turn dialogue dataset,” in IJCNLP, 2017.
- H. Alamri, V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, P. Anderson et al., “Audio visual scene-aware dialog,” in CVPR, 2019.
- H. Rashkin, E. M. Smith, M. Li, and Y.-L. Boureau, “Towards empathetic open-domain conversation models: A new benchmark and dataset,” in ACL, 2019.
- K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL, 2002.
- L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon, “Unified language model pre-training for natural language understanding and generation,” in NeurIPS, 2019.
- A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré, “Snorkel: Rapid training data creation with weak supervision,” in VLDB, 2017.
- X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep neural networks for natural language understanding,” in ACL, 2019.
- Z. Zhang, Y. Wu, H. Zhao, Z. Li, S. Zhang, X. Zhou, and X. Zhou, “Semantics-aware bert for language understanding,” in AAAI, 2020.
- A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, É. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” in ACL, 2020.
- M. Collins, P. Koehn, and I. Kučerová, “Clause restructuring for statistical machine translation,” in ACL, 2005.
- L. Ding, L. Wang, X. Liu, D. F. Wong, D. Tao, and Z. Tu, “Progressive multi-granularity training for non-autoregressive translation,” in Findings of the ACL, 2021.
- T. Berg-Kirkpatrick, D. Burkett, and D. Klein, “An empirical investigation of statistical significance in nlp,” in EMNLP, 2012.
- A. See, P. J. Liu, and C. D. Manning, “Get to the point: Summarization with pointer-generator networks,” in ACL, 2017.
- Y. Liu and M. Lapata, “Text summarization with pretrained encoders,” in EMNLP, 2019.
- H. Bao, L. Dong, F. Wei, W. Wang, N. Yang, X. Liu, Y. Wang, J. Gao, S. Piao, M. Zhou et al., “Unilmv2: Pseudo-masked language models for unified language model pre-training,” in ICML, 2020.
- K. Krishna, J. P. Bigham, and Z. C. Lipton, “Does pretraining for summarization require knowledge transfer?” in Findings of EMNLP, 2021.
- T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh, “Autoprompt: Eliciting knowledge from language models with automatically generated prompts,” in EMNLP, 2020.
- A. Conneau and D. Kiela, “Senteval: An evaluation toolkit for universal sentence representations,” in LREC, 2018.
- J. Hao, X. Wang, B. Yang, L. Wang, J. Zhang, and Z. Tu, “Modeling recurrence for transformer,” in NAACL, 2019.
- Qihuang Zhong (22 papers)
- Liang Ding (159 papers)
- Juhua Liu (37 papers)
- Bo Du (264 papers)
- Dacheng Tao (829 papers)