AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles (2404.01084v1)
Abstract: In this paper, we outline our submission for the SemEval-2024 Task 9 competition: 'BRAINTEASER: A Novel Task Defying Common Sense'. We engage in both sub-tasks: Sub-task A-Sentence Puzzle and Sub-task B-Word Puzzle. We evaluate a plethora of pre-trained transformer-based LLMs of different sizes through fine-tuning. Subsequently, we undertake an analysis of their scores and responses to aid future researchers in understanding and utilizing these models effectively. Our top-performing approaches secured competitive positions on the competition leaderboard across both sub-tasks. In the evaluation phase, our best submission attained an average accuracy score of 81.7% in the Sentence Puzzle, and 85.4% in the Word Puzzle, significantly outperforming the best neural baseline (ChatGPT) by more than 20% and 30% respectively.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
- Qlora: Efficient finetuning of quantized llms.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
- Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694.
- A survey of methods, challenges and perspectives in causality.
- Puzzle solving using reasoning of large language models: A survey.
- A literature survey of low-rank tensor approximation techniques.
- Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
- Textbooks are all you need.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.
- Deberta: Decoding-enhanced bert with disentangled attention.
- Lora: Low-rank adaptation of large language models.
- Mistral 7b.
- Mixtral of experts.
- Semeval-2024 task 9: Brainteaser: A novel task defying common sense. In Proceedings of the 18th International Workshop on Semantic Evaluation. Association for Computational Linguistics.
- Brainteaser: Lateral thinking puzzles for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
- Event knowledge in large language models: the gap between the impossible and the unlikely. ArXiv, abs/2212.01488.
- Recovery guarantee of weighted low-rank approximation via alternating minimization.
- Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations.
- Riddlesense: Reasoning about riddle questions featuring linguistic creativity and commonsense knowledge.
- Roberta: A robustly optimized bert pretraining approach.
- A survey of deep learning for mathematical reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14605–14631, Toronto, Canada. Association for Computational Linguistics.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
- Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian.
- Molly R. Petersen and Lonneke van der Plas. 2023. Can language models learn analogical reasoning? investigating training objectives and comparisons to human performance.
- Christopher Richardson and Larry Heck. 2023. Commonsense reasoning for conversational ai: A survey of the state of the art.
- Winogrande: An adversarial winograd schema challenge at scale.
- Large language models can be easily distracted by irrelevant context. ArXiv, abs/2302.00093.
- Damien Sileo. 2023. tasksource: Structured dataset preprocessing annotations for frictionless extreme multi-task learning and evaluation. arXiv preprint arXiv:2301.05948.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
- A survey of reasoning with foundation models.
- CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models.
- Llms cannot find reasoning errors, but can correct them!
- Temporal reasoning in natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4070–4078, Online. Association for Computational Linguistics.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
- SemEval-2020 task 4: Commonsense validation and explanation. In Proceedings of The 14th International Workshop on Semantic Evaluation. Association for Computational Linguistics.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
- Enhancing human-like multi-modal reasoning: A new challenging dataset and comprehensive framework.
- Logical reasoning over natural language as knowledge representation: A survey.
- Can pretrained language models (yet) reason deductively? ArXiv, abs/2210.06442.
- How language model hallucinations can snowball.
- Ioannis Panagiotopoulos (6 papers)
- Giorgos Filandrianos (26 papers)
- Maria Lymperaiou (32 papers)
- Giorgos Stamou (55 papers)