Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SMILE: Evaluation and Domain Adaptation for Social Media Language Understanding (2307.00135v1)

Published 30 Jun 2023 in cs.CL

Abstract: We study the ability of transformer-based LLMs (LMs) to understand social media language. Social media (SM) language is distinct from standard written language, yet existing benchmarks fall short of capturing LM performance in this socially, economically, and politically important domain. We quantify the degree to which social media language differs from conventional language and conclude that the difference is significant both in terms of token distribution and rate of linguistic shift. Next, we introduce a new benchmark for Social MedIa Language Evaluation (SMILE) that covers four SM platforms and eleven tasks. Finally, we show that learning a tokenizer and pretraining on a mix of social media and conventional language yields an LM that outperforms the best similar-sized alternative by 4.2 points on the overall SMILE score.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. A deep dive into multilingual hate speech classification. In Machine Learning and Knowledge Discovery in Databases. Applied Data Science and Demo Track: European Conference, ECML PKDD 2020, Ghent, Belgium, September 14–18, 2020, Proceedings, Part V. Springer, 423–439.
  2. Dynamic language models for continuously evolving content. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2514–2524.
  3. Brooke Auxier and Monica Anderson. 2021. Social media use in 2021. Pew Research Center 1 (2021), 1–4.
  4. Xlm-t: Multilingual language models in twitter for sentiment analysis and beyond. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. 258–266.
  5. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. arXiv preprint arXiv:2010.12421 (2020).
  6. SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676 (2019).
  7. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation. 12–58.
  8. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  9. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference. 491–500.
  10. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  11. Rethinking attention with performers. arXiv preprint arXiv:2009.14794 (2020).
  12. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022).
  13. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems 35 (2022), 16344–16359.
  14. Bernice: A Multilingual Pre-trained Encoder for Twitter. (2022).
  15. GoEmotions: A dataset of fine-grained emotions. arXiv preprint arXiv:2005.00547 (2020).
  16. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  17. Jonathan Dunn. 2022. Natural language processing for corpus linguistics. Cambridge University Press.
  18. Evaluating a topic modelling approach to measuring corpus similarity. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). 273–279.
  19. Joint nonparametric precision matrix estimation with confounding. In Uncertainty in Artificial Intelligence. PMLR, 378–388.
  20. Stochastic learning for sparse discrete Markov random fields with controlled gradient approximation error. In Uncertainty in artificial intelligence: proceedings of the… conference. Conference on Uncertainty in Artificial Intelligence, Vol. 2018. NIH Public Access, 156.
  21. An efficient pseudo-likelihood method for sparse binary pairwise Markov network estimation. arXiv preprint arXiv:1702.08320 (2017).
  22. Temporal poisson square root graphical models. In International Conference on Machine Learning. PMLR, 1714–1723.
  23. Partially linear additive Gaussian graphical models. In International Conference on Machine Learning. PMLR, 2180–2190.
  24. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH) 3, 1 (2021), 1–23.
  25. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795 (2021).
  26. Mentalbert: Publicly available pretrained language models for mental healthcare. arXiv preprint arXiv:2110.15621 (2021).
  27. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
  28. Adam Kilgarriff. 1997. Using word frequency lists to measure corpus homogeneity and similarity between corpora. In Fifth Workshop on Very Large Corpora.
  29. Adam Kilgarriff. 2001. Comparing corpora. International journal of corpus linguistics 6, 1 (2001), 97–133.
  30. Abstractive summarization of reddit posts with multi-level memory networks. arXiv preprint arXiv:1811.00783 (2018).
  31. A screening rule for l1-regularized ising model estimation. Advances in neural information processing systems 30 (2017).
  32. Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226 (2018).
  33. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110 (2022).
  34. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  35. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  36. Diverging divergences: Examining variants of Jensen Shannon divergence for corpus comparison tasks. (2021).
  37. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in Bioinformatics 23, 6 (2022).
  38. BERTweet: A pre-trained language model for English Tweets. arXiv preprint arXiv:2005.10200 (2020).
  39. Minimizing flops to learn efficient sparse representations. arXiv preprint arXiv:2004.05665 (2020).
  40. Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866 (2023).
  41. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  42. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).
  43. Scaling Up Models and Data with t5x and seqio. https://doi.org/10.48550/ARXIV.2203.17189
  44. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022).
  45. Bengali abstractive text summarization using sequence to sequence RNNs. In 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT). IEEE, 1–5.
  46. Efficient transformers: A survey. Comput. Surveys 55, 6 (2022), 1–28.
  47. Charformer: Fast character transformers via gradient-based subword tokenization. arXiv preprint arXiv:2106.12672 (2021).
  48. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239 (2022).
  49. Attention is all you need. Advances in neural information processing systems 30 (2017).
  50. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems 32 (2019).
  51. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018).
  52. Byt5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics 10 (2022), 291–306.
  53. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934 (2020).
  54. A large language model for electronic health records. npj Digital Medicine 5, 1 (2022), 194.
  55. Bilal Yurdakul and Joshua Naranjo. 2020. Statistical properties of the population stability index. Journal of Risk Model Validation 14, 4 (2020).
  56. TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations. arXiv preprint arXiv:2209.07562 (2022).
  57. Character-level convolutional networks for text classification. Advances in neural information processing systems 28 (2015).
Citations (3)

Summary

We haven't generated a summary for this paper yet.