Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Balancing Between Over-Weighting and Under-Weighting in Supervised Term Weighting (1604.04007v1)

Published 14 Apr 2016 in cs.IR

Abstract: Supervised term weighting could improve the performance of text categorization. A way proven to be effective is to give more weight to terms with more imbalanced distributions across categories. This paper shows that supervised term weighting should not just assign large weights to imbalanced terms, but should also control the trade-off between over-weighting and under-weighting. Over-weighting, a new concept proposed in this paper, is caused by the improper handling of singular terms and too large ratios between term weights. To prevent over-weighting, we present three regularization techniques: add-one smoothing, sublinear scaling and bias term. Add-one smoothing is used to handle singular terms. Sublinear scaling and bias term shrink the ratios between term weights. However, if sublinear functions scale down term weights too much, or the bias term is too large, under-weighting would occur and harm the performance. It is therefore critical to balance between over-weighting and under-weighting. Inspired by this insight, we also propose a new supervised term weighting scheme, regularized entropy (re). Our re employs entropy to measure term distribution, and introduces the bias term to control over-weighting and under-weighting. Empirical evaluations on topical and sentiment classification datasets indicate that sublinear scaling and bias term greatly influence the performance of supervised term weighting, and our re enjoys the best results in comparison with existing schemes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Haibing Wu (4 papers)
  2. Xiaodong Gu (62 papers)
Citations (39)

Summary

We haven't generated a summary for this paper yet.