Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
81 tokens/sec
Gemini 2.5 Pro Premium
33 tokens/sec
GPT-5 Medium
31 tokens/sec
GPT-5 High Premium
22 tokens/sec
GPT-4o
78 tokens/sec
DeepSeek R1 via Azure Premium
92 tokens/sec
GPT OSS 120B via Groq Premium
436 tokens/sec
Kimi K2 via Groq Premium
209 tokens/sec
2000 character limit reached

CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation (2506.02306v1)

Published 2 Jun 2025 in cs.LG and stat.ML

Abstract: We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features - captured by column names and text descriptions - to better represent feature dependence. These dual sources of inductive bias enable CACTI to outperform state-of-the-art methods - an average $R2$ gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) - across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.

Summary

  • The paper introduces a novel masked autoencoding framework using Median Truncated Copy Masking and context-aware embeddings to improve imputation accuracy.
  • It robustly handles MNAR conditions, delivering up to 13.4% higher R² compared to state-of-the-art methods across varied datasets.
  • Its modular design paves the way for domain-specific applications, potentially benefiting fields like healthcare and finance.

CACTI: Advancing Tabular Data Imputation with Copy Masking and Context Awareness

The paper entitled "CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation" presents a novel approach to impute missing values in tabular datasets by harnessing both data-driven missingness patterns and contextual information associated with the features. Imputation in tabular datasets is critical due to the ubiquity of missing data in real-world scenarios. Traditional methods either fail to adequately account for the often complex missingness mechanisms or underutilize the rich feature context that may be available, thus leaving room for improvement in both accuracy and reliability of imputed values.

The authors propose CACTI, a masked autoencoding technique enhanced by a novel Median Truncated Copy Masking (MT-CM) strategy and context-aware embeddings. The recurring theme throughout this work is leveraging inductive biases present within each dataset—a principle central to the proposed imputation framework. The MT-CM strategy accommodates empirical patterns of missingness, facilitating efficient learning by avoiding the pitfalls commonly associated with random masking. The theoretical underpinning for copy masking outlined in the paper argues that mimicking real missingness patterns contributes to learning a more robust imputation function, particularly beneficial under Missing Not At Random (MNAR) conditions.

Context information, manifest as feature names and text descriptions, is creatively integrated using transformer-based architectures. By embedding this contextual data, CACTI adapts to dataset-specific semantic structures, allowing for a direct incorporation of prior feature knowledge into the learning process—a significant departure from solely relying on relational patterns inferred from limited observations. This dual approach to imputation harnesses both missingness structure and contextual semantics, presenting a compelling case for an improved imputation methodology.

Empirical evaluations demonstrate CACTI's superiority over existing state-of-the-art methodologies across multiple datasets and under varying degrees of simulated missingness (10%, 30%, 50%, and 70%). The paper boasts impressive gains, with a notable average R2R^2 increase of up to 13.4% compared to the next best method under MNAR conditions. Such results underscore the efficacy of embedding inductive biases from both empirical missingness patterns and contextual semantic facts.

The modular design of both MT-CM and context-aware frameworks within CACTI suggests broad applicability beyond tabular imputation, potentially extending their utility to various domains where missing data plagues decision-making processes. Additionally, the paper explores sensitivity analyses, highlighting CACTI’s robustness in terms of architectural configurations and training parameters, further enhancing confidence in its adaptability to diverse datasets and operational contexts.

Future research lying at the intersection of machine learning, missing data mechanisms, and domain-specific context embeddings holds promise. Particularly intriguing avenues involve tailoring context embedding models for specific domains—such as medical or financial datasets—where the feature relationships are uniquely complex. The exploration and exploitation of structured missingness through empirical characterization continue to be a rich field for academic pursuit, potentially paving avenues for even finer imputation precision.

In summary, CACTI exemplifies how embracing dataset-specific biases and semantic contexts can propel advancements in machine learning models for data imputation tasks. With promising empirical success and scalability, CACTI stands out as a significant contribution to the toolkit of researchers and practitioners dealing with incomplete datasets across varied real-world applications.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com