Model and Optimize Data Quality via Edge Deletions

Incorporate training data quality into the concept–text bipartite framework by modeling text-to-concept edge deletions during sequential concept learning, and develop optimization strategies to improve learning performance within this extended model.

Background

The authors propose that data quality can be represented as deletions of edges between text and concepts in the sequential learning process, which would directly affect the number of concepts learned and subsequent skill composition.

They list this as an open question and suggest that optimization in this setting has analogues in communication systems and fault-tolerant computation, highlighting a promising methodological direction not yet addressed by their analysis.

References

There are some open questions and considerations worth exploring. Further, the quality of the training data is related to text-to-concept edge deletions in sequential concept learning, which can be incorporated into our framework. Such optimization is a line of future work that has natural analogues in optimization of communication systems and fault-tolerant computation .

— An Information Theory of Compute-Optimal Size Scaling, Emergence, and Plateaus in Language Models (2410.01243 - Nayak et al., 2024) in Conclusion, final paragraph

Model and Optimize Data Quality via Edge Deletions

Background

References

Related Problems