First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models (2311.05020v2)
Abstract: Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on LLMs. After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large $n$-gram models for machine translation (MT). We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. We argue that disparities in scale are transient and researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many applications; that meaningful realistic evaluation is still an open problem; and that there is still room for speculative approaches.
- Persistent anti-Muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21, page 298–306, New York, NY, USA. Association for Computing Machinery.
- A few thousand translations go a long way! leveraging pre-trained models for African news translation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States. Association for Computational Linguistics.
- Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning. In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Dario Amodei and Danny Hernandez. May 16, 2018. AI and compute. OpenAI Blog.
- Proceedings of the First Workshop on Learning with Natural Language Supervision. Association for Computational Linguistics, Dublin, Ireland.
- @andriy_mulyar. 2023. my Twitter feed is full of ph.d. students having an existential crisis. Twitter.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015.
- Training a helpful and harmless assistant with reinforcement learning from human feedback.
- Constitutional AI: Harmlessness from AI feedback.
- We need to consider disagreement in evaluation. In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, pages 15–21, Online. Association for Computational Linguistics.
- A. Belz and E. Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In EACL 2006 - 11th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, pages 313–320.
- Anja Belz and Helen Hastie. 2014. Comparative evaluation and shared tasks for NLG in interactive systems, page 302–350. Cambridge University Press.
- Emily Bender. 2019. The #BenderRule: On Naming the Languages We Study and Why It Matters. The Gradient.
- A neural probabilistic language model. J. Mach. Learn. Res., 3:1137–1155.
- Pythia: A suite for analyzing large language models across training and scaling.
- Terra Blevins and Luke Zettlemoyer. 2022. Language Contamination Helps Explains the Cross-lingual Capabilities of English Pretrained Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3563–3574, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Su Lin Blodgett. 2021. Sociolinguistically Driven Approaches for Just Natural Language Processing. Ph.D. thesis, University of Massachusetts Amherst.
- Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
- Findings of the 2013 workshop on statistical machine translation. In WMT@ACL.
- Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 131–198, Berlin, Germany. Association for Computational Linguistics.
- A grain of salt for the WMT manual evaluation. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 1–11.
- Large language models in machine translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 858–867, Prague, Czech Republic. Association for Computational Linguistics.
- The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311.
- Chris Callison-Burch. 2009. Fast, cheap, and creative: Evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 286–295, Singapore. Association for Computational Linguistics.
- (meta-) evaluation of machine translation. In WMT@ACL.
- Findings of the 2010 joint workshop on statistical machine translation and metrics for machine translation. In WMT@ACL.
- Findings of the 2012 workshop on statistical machine translation. In WMT@NAACL-HLT.
- Findings of the 2009 workshop on statistical machine translation. In WMT@EACL.
- Findings of the 2011 workshop on statistical machine translation. In WMT@EMNLP.
- Re-evaluating the role of Bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 249–256, Trento, Italy. Association for Computational Linguistics.
- Eirini Chatzikoumi. 2019. How to evaluate machine translation: A review of automated and human metrics. Natural Language Engineering, 26:137 – 161.
- David Chiang. 2007. Hierarchical phrase-based translation. Computational Linguistics, 33:201–228.
- NYU-MILA neural machine translation systems for WMT’16. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, pages 268–271, Berlin, Germany. Association for Computational Linguistics.
- All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics.
- Clause restructuring for statistical machine translation. In Annual Meeting of the Association for Computational Linguistics.
- Classist tools: Social class correlates with performance in nlp.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Fast and robust neural network joint models for statistical machine translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1370–1380, Baltimore, Maryland. Association for Computational Linguistics.
- Machine Translation Evaluation and Optimization, pages 745–843. Springer New York, New York, NY.
- Large-scale distributed language modeling. 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP ’07, 4:IV–37–IV–40.
- Chris Chinenye Emezue and Bonaventure F. P. Dossou. 2021. MMTAfrica: Multilingual machine translation for African languages. In Proceedings of the Sixth Conference on Machine Translation, pages 398–411, Online. Association for Computational Linguistics.
- When the majority is wrong: Modeling annotator disagreement for subjective tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6715–6726, Singapore. Association for Computational Linguistics.
- Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
- What’s in a translation rule? In Proc. of HLT-NAACL, pages 273–280.
- Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. J. Artif. Int. Res., 77.
- What comes next? evaluating uncertainty in neural text generators against human production variability. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14349–14371, Singapore. Association for Computational Linguistics.
- To build our future, we must know our past: Contextualizing paradigm shifts in natural language processing.
- Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197, Edinburgh, Scotland. Association for Computational Linguistics.
- Scalable modified Kneser-Ney language model estimation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 690–696, Sofia, Bulgaria. Association for Computational Linguistics.
- Danny Hernandez and Tom Brown. 2023. AI and efficiency. OpenAI Blog.
- An empirical analysis of compute-optimal large language model training. In Neural Information Processing Systems.
- Sara Hooker. 2020. The hardware lottery. Communications of the ACM, 64:58 – 65.
- Mark Hopkins and Jonathan May. 2011. Tuning as ranking. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1352–1362, Edinburgh, Scotland, UK. Association for Computational Linguistics.
- A PhD student’s perspective on research in NLP in the era of very large language models. arXiv preprint arXiv:2305.12544.
- Montreal neural machine translation systems for WMT’15. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 134–140, Lisbon, Portugal. Association for Computational Linguistics.
- Design of a linguistic statistical decoder for the recognition of continuous speech. IEEE Transactions on Information Theory, 21(3):250–256.
- Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
- Semantics-based machine translation with hyperedge replacement grammars. In International Conference on Computational Linguistics.
- Measuring translation quality by testing English speakers with a new defense language proficiency test for Arabic. In Proceedings of the 2005 International Conference on Intelligence Analysis.
- Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Conference on Empirical Methods in Natural Language Processing.
- Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs, stat]. ArXiv: 2001.08361.
- xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval. ArXiv:2303.03004 [cs].
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit.
- Moses: Open source toolkit for statistical machine translation. In Annual Meeting of the Association for Computational Linguistics.
- Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for Computational Linguistics.
- Statistical phrase-based translation. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 127–133.
- Vijay Konda and John Tsitsiklis. 1999. Actor-Critic Algorithms. In Advances in Neural Information Processing Systems, volume 12. MIT Press.
- NVIDIA Ampere Architecture In-Depth.
- Thomas S. Kuhn. 1962. The Structure of Scientific Revolutions. University of Chicago Press, Chicago.
- Human ratings do not reflect downstream utility: A study of free-text explanations for model predictions. In Proceedings of the Fifth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 164–177, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
- Reconsidering annotator disagreement about racist language: Noise or signal? In Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media, pages 81–90, Online. Association for Computational Linguistics.
- A systematic study and comprehensive evaluation of chatgpt on benchmark datasets.
- Hardware Beyond Backpropagation: a Photonic Co-Processor for Direct Feedback Alignment. ArXiv:2012.06373 [cs, stat].
- Surveying (dis)parities and concerns of compute hungry nlp research.
- Sangmin-Michelle Lee. 2023. The effectiveness of machine translation in foreign language education: a systematic review and meta-analysis. Computer Assisted Language Learning, 36(1-2):103–125.
- There’s plenty of room at the top: What will drive computer performance after moore’s law? Science, 368.
- William D. Lewis. 2010. Haitian creole: How to build and ship an mt engine from scratch in 4 days, 17 hours, & 30 minutes. In European Association for Machine Translation Conferences/Workshops.
- How NOT to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2122–2132, Austin, Texas. Association for Computational Linguistics.
- Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models.
- Mitigating political bias in language models through reinforced calibration. Proceedings of the AAAI Conference on Artificial Intelligence, 35(17):14857–14866.
- Assessing inter-annotator agreement for translation error annotation. In MTE: Workshop on Automatic and Manual Metrics for Operational Translation Evaluation, pages 31–37. Language Resources and Evaluation Conference Reykjavik.
- Adam Lopez. 2012. Putting human assessments of machine translation systems in order. In WMT@NAACL-HLT.
- Mark S. Lundstrom and Muhammad Ashraful Alam. 2022. Moore’s law: The journey ahead. Science, 378:722 – 723.
- Exploring the limits of weakly supervised pretraining. In Computer Vision–ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part II, pages 185–201.
- Aligning and using an english-inuktitut parallel corpus. In ParallelTexts@NAACL-HLT.
- Marianna J. Martindale and Marine Carpuat. 2018. Fluency over adequacy: A pilot study in measuring user trust in imperfect mt. In Conference of the Association for Machine Translation in the Americas.
- The Johns Hopkins University Bible corpus: 1600+ tongues for typological exploration. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2884–2892, Marseille, France. European Language Resources Association.
- Bibletts: a large, high-fidelity, multilingual, and uniquely african speech corpus.
- Roger K. Moore. 2003. A comparison of the data requirements of automatic speech recognition systems and human listeners. In Proc. 8th European Conference on Speech Communication and Technology (Eurospeech 2003), pages 2581–2584.
- GLEU: Automatic Evaluation of Sentence-Level Fluency. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 344–351, Prague, Czech Republic. Association for Computational Linguistics.
- Ramón P. Ñeco and Mikel L. Forcada. 1997. Asynchronous translations with recurrent neural nets. Proceedings of International Conference on Neural Networks (ICNN’97), 4:2535–2540 vol.4.
- Participatory research for low-resourced machine translation: A case study in African languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2144–2160, Online. Association for Computational Linguistics.
- NIST. 2008. NIST Open MACHINE TRANSLATION 2008 OFFICIAL Results.
- No language left behind: Scaling human-centered machine translation. ArXiv, abs/2207.04672.
- Why we need new evaluation metrics for NLG. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2241–2252, Copenhagen, Denmark. Association for Computational Linguistics.
- RankME: Reliable human ratings for natural language generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 72–78, New Orleans, Louisiana. Association for Computational Linguistics.
- Franz Och. 2006. Statistical machine translation live.
- Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan. Association for Computational Linguistics.
- Franz Josef Och. 2005. Statistical machine translation: Foundations and recent advances. In Proceedings of Machine Translation Summit X: Tutorial notes, Phuket, Thailand.
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback.
- Klue: Korean language understanding evaluation.
- "Is a picture of a bird a bird": Policy recommendations for dealing with ambiguity in machine vision models. ArXiv:2306.15777 [cs].
- Annotating social media data from vulnerable populations: Evaluating disagreement between domain experts and graduate student annotators. In Hawaii International Conference on System Sciences.
- Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent Disagreements in Human Textual Inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
- Billy Perrigo. 2023. Exclusive: OpenAI Used Kenyan Workers on Less Than $2 Per Hour to Make ChatGPT Less Toxic. Time.
- Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Maja Popović. 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
- MosaicBERT: Pretraining BERT from Scratch for $20.
- On releasing annotator-level labels and information in datasets. In Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop, pages 133–138, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Ai and the everything in the whole wide world benchmark.
- Ehud Reiter. 2018. A Structured Review of the Validity of BLEU. Computational Linguistics, 44(3):393–401.
- Ehud Reiter. 2023. Future of NLG evaluation: LLMs and high quality human eval?
- Ehud Reiter and Anja Belz. 2009. An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems. Computational Linguistics, 35(4):529–558.
- Stefan Riezler and John T Maxwell III. 2005. On some pitfalls in automatic evaluation and significance testing for mt. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 57–64.
- Jared Roesch and Phil Mazenett. 2021. OctoML’s BERT Model Acceleration on Apple M1 Pro and Max Chips.
- Anna Rogers. 2023. Closed AI Models Make Bad Baselines.
- Bloom: A 176b-parameter open-access multilingual language model. ArXiv, abs/2211.05100.
- Carolina Scarton and Lucia Specia. 2016. A reading comprehension corpus for machine translation evaluation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 3652–3658, Portorož, Slovenia. European Language Resources Association (ELRA).
- Training language models with language feedback.
- Training language models with language feedback at scale.
- Continuous space language models for statistical machine translation. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 723–730.
- Compute trends across three eras of machine learning. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.
- Evaluating Machine Accuracy on ImageNet. In Proceedings of the 37th International Conference on Machine Learning, pages 8634–8644. PMLR. ISSN: 2640-3498.
- Claude E Shannon. 1948. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423.
- BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage. ArXiv:2208.03188 [cs].
- Deterministic policy gradient algorithms. In Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pages 387–395, Bejing, China. PMLR.
- Dirt cheap web-scale parallel text from the Common Crawl. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1374–1383, Sofia, Bulgaria. Association for Computational Linguistics.
- A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, pages 223–231, Cambridge, Massachusetts, USA. Association for Machine Translation in the Americas.
- StabilityAI. 2021. Stability AI Launches the First of its StableLM Suite of Language Models.
- Miloš Stanojević and Khalil Sima’an. 2014. Fitting Sentence Level Translation Evaluation with Many Dense Features. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 202–206, Doha, Qatar. Association for Computational Linguistics.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Sequence to sequence learning with neural networks. ArXiv, abs/1409.3215.
- Rich Sutton. 2019. The Bitter Lesson.
- LaMDA: Language Models for Dialog Applications. ArXiv:2201.08239 [cs].
- Jörg Tiedemann and Santhosh Thottingal. 2020. OPUS-MT – building open translation services for the world. In Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 479–480, Lisboa, Portugal. European Association for Machine Translation.
- Julian Togelius and Georgios N Yannakakis. 2023. Choose your weapon: Survival strategies for depressed AI academics. arXiv preprint arXiv:2304.06035.
- Barak Turovsky. 2016. Found in translation: More accurate, fluent sentences in Google Translate.
- Tomer Ullman. 2023. Large language models fail on trivial alterations to theory-of-mind tasks.
- Human evaluation of automatically generated text: Current trends and best practice guidelines. Computer Speech & Language, 67:101151.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325.
- Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. ArXiv, abs/1609.08144.
- Machine translation of Arabic dialects. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 49–59, Montréal, Canada. Association for Computational Linguistics.
- Distributed language modeling for n-best list re-ranking. In Conference on Empirical Methods in Natural Language Processing.
- Ethical-advice taker: Do language models understand natural language interventions?
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Naomi Saphra (34 papers)
- Eve Fleisig (14 papers)
- Kyunghyun Cho (292 papers)
- Adam Lopez (29 papers)