Papers
Topics
Authors
Recent
2000 character limit reached

Exploring Out-of-Distribution Generalization in Text Classifiers Trained on Tobacco-3482 and RVL-CDIP

Published 5 Aug 2021 in cs.CL | (2108.02684v1)

Abstract: To be robust enough for widespread adoption, document analysis systems involving machine learning models must be able to respond correctly to inputs that fall outside of the data distribution that was used to generate the data on which the models were trained. This paper explores the ability of text classifiers trained on standard document classification datasets to generalize to out-of-distribution documents at inference time. We take the Tobacco-3482 and RVL-CDIP datasets as a starting point and generate new out-of-distribution evaluation datasets in order to analyze the generalization performance of models trained on these standard datasets. We find that models trained on the smaller Tobacco-3482 dataset perform poorly on our new out-of-distribution data, while text classification models trained on the larger RVL-CDIP exhibit smaller performance drops.

Citations (3)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.