Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles (2404.01084v1)

Published 1 Apr 2024 in cs.CL and cs.AI

Abstract: In this paper, we outline our submission for the SemEval-2024 Task 9 competition: 'BRAINTEASER: A Novel Task Defying Common Sense'. We engage in both sub-tasks: Sub-task A-Sentence Puzzle and Sub-task B-Word Puzzle. We evaluate a plethora of pre-trained transformer-based LLMs of different sizes through fine-tuning. Subsequently, we undertake an analysis of their scores and responses to aid future researchers in understanding and utilizing these models effectively. Our top-performing approaches secured competitive positions on the competition leaderboard across both sub-tasks. In the evaluation phase, our best submission attained an average accuracy score of 81.7% in the Sentence Puzzle, and 85.4% in the Word Puzzle, significantly outperforming the best neural baseline (ChatGPT) by more than 20% and 30% respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
  2. Qlora: Efficient finetuning of quantized llms.
  3. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  4. Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694.
  5. A survey of methods, challenges and perspectives in causality.
  6. Puzzle solving using reasoning of large language models: A survey.
  7. A literature survey of low-rank tensor approximation techniques.
  8. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
  9. Textbooks are all you need.
  10. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.
  11. Deberta: Decoding-enhanced bert with disentangled attention.
  12. Lora: Low-rank adaptation of large language models.
  13. Mistral 7b.
  14. Mixtral of experts.
  15. Semeval-2024 task 9: Brainteaser: A novel task defying common sense. In Proceedings of the 18th International Workshop on Semantic Evaluation. Association for Computational Linguistics.
  16. Brainteaser: Lateral thinking puzzles for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
  17. Event knowledge in large language models: the gap between the impossible and the unlikely. ArXiv, abs/2212.01488.
  18. Recovery guarantee of weighted low-rank approximation via alternating minimization.
  19. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations.
  20. Riddlesense: Reasoning about riddle questions featuring linguistic creativity and commonsense knowledge.
  21. Roberta: A robustly optimized bert pretraining approach.
  22. A survey of deep learning for mathematical reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14605–14631, Toronto, Canada. Association for Computational Linguistics.
  23. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  24. Generalization guarantees for neural networks via harnessing the low-rank structure of the jacobian.
  25. Molly R. Petersen and Lonneke van der Plas. 2023. Can language models learn analogical reasoning? investigating training objectives and comparisons to human performance.
  26. Christopher Richardson and Larry Heck. 2023. Commonsense reasoning for conversational ai: A survey of the state of the art.
  27. Winogrande: An adversarial winograd schema challenge at scale.
  28. Large language models can be easily distracted by irrelevant context. ArXiv, abs/2302.00093.
  29. Damien Sileo. 2023. tasksource: Structured dataset preprocessing annotations for frictionless extreme multi-task learning and evaluation. arXiv preprint arXiv:2301.05948.
  30. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.
  31. A survey of reasoning with foundation models.
  32. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  33. Llama: Open and efficient foundation language models.
  34. Llama 2: Open foundation and fine-tuned chat models.
  35. Llms cannot find reasoning errors, but can correct them!
  36. Temporal reasoning in natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4070–4078, Online. Association for Computational Linguistics.
  37. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
  38. SemEval-2020 task 4: Commonsense validation and explanation. In Proceedings of The 14th International Workshop on Semantic Evaluation. Association for Computational Linguistics.
  39. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  40. Enhancing human-like multi-modal reasoning: A new challenging dataset and comprehensive framework.
  41. Logical reasoning over natural language as knowledge representation: A survey.
  42. Can pretrained language models (yet) reason deductively? ArXiv, abs/2210.06442.
  43. How language model hallucinations can snowball.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Ioannis Panagiotopoulos (6 papers)
  2. Giorgos Filandrianos (26 papers)
  3. Maria Lymperaiou (32 papers)
  4. Giorgos Stamou (55 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.