A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective (2502.08828v2)

Published 12 Feb 2025 in cs.LG and cs.AI

Abstract: Tabular data is one of the most widely used data formats across various domains such as bioinformatics, healthcare, and marketing. As artificial intelligence moves towards a data-centric perspective, improving data quality is essential for enhancing model performance in tabular data-driven applications. This survey focuses on data-driven tabular data optimization, specifically exploring reinforcement learning (RL) and generative approaches for feature selection and feature generation as fundamental techniques for refining data spaces. Feature selection aims to identify and retain the most informative attributes, while feature generation constructs new features to better capture complex data patterns. We systematically review existing generative methods for tabular data engineering, analyzing their latest advancements, real-world applications, and respective strengths and limitations. This survey emphasizes how RL-based and generative techniques contribute to the automation and intelligence of feature engineering. Finally, we summarize the existing challenges and discuss future research directions, aiming to provide insights that drive continued innovation in this field.

Authors (10)

Wangyang Ying (19 papers)
Cong Wei (16 papers)
Nanxu Gong (12 papers)
Xinyuan Wang (34 papers)
Haoyue Bai (33 papers)
Arun Vignesh Malarkkan (1 paper)
Sixun Dong (13 papers)
Dongjie Wang (53 papers)
Denghui Zhang (33 papers)
Yanjie Fu (93 papers)

Summary

The paper surveys data-centric AI for tabular learning, emphasizing feature engineering via reinforcement learning (RL) and generative AI methods.
RL approaches are explored for both feature selection (single/multi-agent) and generation (cascaded/graph-based), framing tasks as sequential decision processes.
Generative models for feature selection (encoder-decoder) and generation (embedding-optimization) are discussed, contrasting their performance and interpretability with RL methods.

The paper provides an extensive and technical survey on data-centric artificial intelligence (DCAI) for tabular data, with a particular emphasis on leveraging reinforcement learning (RL Reinforcement Learning) and generative AI for feature engineering. The paper systematically reviews the state-of-the-art techniques for both feature selection and feature generation, framing these tasks as optimization problems—and in many cases, as sequential decision processes—with a goal of enhancing data utility and model performance.

The survey is organized along several key dimensions:

1. Reinforcement Learning Approaches

Feature Selection with RL:
- Multi-Agent RL Frameworks: Each feature is managed by an independent agent. Methods employ sophisticated state representations (e.g., statistical summaries, Graph Convolutional Networks) and inter-agent collaboration to explore large-scale feature spaces. Despite the advantages in scalability, these methods contend with high computational costs.
- Single-Agent RL Approaches: Consolidate the decision process into a single agent, which explores the feature space sequentially. Enhancements such as Monte Carlo methods and early stopping strategies are introduced to alleviate computational overhead.
- Hybrid and Domain-Specific RL Techniques: Incorporate elements such as external trainers (e.g., decision trees, domain-specific constraints) to guide the exploration process. This integration aims to improve both the efficiency and robustness of feature selection, especially in high-dimensional datasets.
RL for Feature Generation:
- Cascaded Frameworks: Multiple agents operate in sequence to first select features and then apply transformation operators. This architecture is particularly effective for complex data where feature interactions must be modeled in stages.
- Graph-Based Exploration: This strategy leverages feature relationships by constructing a state transition graph to track and optimize transformations.
- Hybrid Methods: These approaches combine RL with techniques like hierarchical modeling or self-optimizing frameworks (including attention and hashing methods) to mitigate issues such as overestimated Q-values and to further refine feature generation.

2. Generative AI Techniques

Generative Models for Feature Selection:
- Encoder-Decoder-Evaluator Frameworks: These systems embed observed feature selection experiences into continuous representations and then generate feature subsets based on learned latent structures.
- Transformer-Based Variational Autoencoders (VAEs): Employed to capture complex dependencies among features while also reducing overfitting and noise sensitivity.
- A typical formulation is given by:
- $z = f(x; \theta)$
- where $x$ represents the original feature set,
- $z$ is the latent feature embedding, and
- $\theta$ denotes the parameters of the encoder.
Generative Models for Feature Generation:
- Initially, generated feature sets are embedded as continuous representations encoding domain knowledge or observed transformations.
- Optimization techniques (including gradient-based strategies) are then applied within the latent space to refine feature interactions before decoding them back into an enhanced feature set.
- Recent advancements have integrated LLMs (LLMs LLMs) into this framework to leverage context learning and reasoning for dynamic adaptation of the feature space.

3. Comparative Analysis: RL-Based vs. Generative Methods

The survey critically compares RL-based and generative approaches across several dimensions:

Performance:
- RL-Based Methods align the feature selection transformation directly with downstream task performance via reward signals, thus being particularly effective in dynamic environments. However, their convergence can be slow due to the discrete nature of the search space.
- Generative Approaches offer smoother optimization in high-dimensional spaces, albeit at the risk of biases from quality of the training data. Their stability can sometimes come at the cost of interpretability.
Interpretability:
- RL Methods typically offer more traceable decision pathways (e.g., policy networks, Q-values), enhancing interpretability in critical applications.
- Generative Methods, by operating in a continuous latent space, obscure direct mappings from input features to generated outputs; hence, attributability remains a challenge.
Adaptability and Automation:
- RL Approaches demonstrate high adaptability in streaming or rapidly evolving data scenarios, while
- Generative Techniques are better suited for cases where label scarcity or unstructured feature spaces necessitate unsupervised or semi-supervised learning.
- Both approaches, however, must contend with significant computational demands.

4. Practical Strategies for Feature Engineering

The paper also outlines practical guidelines to implement these methodologies effectively:

Progressive Complexity: Start with simple models or initial policies before integrating complex RL or deep generative architectures.
Domain Knowledge Integration: Utilize expert insights to craft reward functions and constrain latent variables, ensuring that automatic methods do not select irrelevant or biased features.
Continuous Validation: Employ attribution techniques and surrogate models (e.g., LIME) for monitoring feature transformation impacts, thus fostering transparency.
Scalability Considerations: Leverage hierarchical RL or offline generative training, along with hardware acceleration, to manage high-dimensional datasets efficiently.
Ethical and Regulatory Compliance: Incorporate privacy-preserving techniques and maintain detailed transformation logs to support auditability in sensitive domains.

5. Challenges and Future Research Directions

The survey concludes by identifying several key research challenges:

Automation vs. Domain Customization: Balancing fully automated pipelines with the need for human-in-the-loop interventions.
Interpretability and Traceability: Developing methods that provide clear insights into the transformation process while managing complex feature interactions.
Computational Efficiency and Scalability: Crafting lightweight algorithms and leveraging parallel processing to mitigate resource constraints.
Integration with Multimodal and LLM Technologies: Enhancing feature engineering by integrating tabular data with other modalities, such as text and images, and exploiting the reasoning capabilities of LLMs to generate features adaptively.
Privacy Preservation: Addressing data sensitivity through federated learning and differential privacy techniques.

In summary, the paper provides a comprehensive and technically robust review of how RL-based and generative AI methods are being applied to enhance feature engineering for tabular data. It highlights both the strengths and limitations of each approach and outlines a detailed roadmap for future research in data-centric AI.

PDF Markdown

A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective (2502.08828v2)

Summary

Related Papers