Mapping Features to Data Sources

Updated 1 July 2025

Feature-to-data source mapping is the process of linking derived features to their original data sources, ensuring clear data lineage and enhanced model interpretability.
Automated feature engineering methods like OneBM and SMARTFEAT record transformations to trace each feature back to specific tables or columns in heterogeneous databases.
Mapping techniques are pivotal for integrating multi-modal data, aligning learned representations, and bridging abstract concepts with concrete data in complex environments.

Feature-to-data source mapping refers to the process of establishing and understanding the correspondence between derived data elements, such as features used in machine learning models or abstract concepts representing domain knowledge, and the original data sources from which they originate. This mapping is crucial for data lineage, interpretability, data fusion, and efficient data processing, particularly in scenarios involving multi-source, heterogeneous, or complex data environments. It provides a mechanism to trace features back to their raw origins, understand their composition, and manage data from diverse systems without necessarily requiring costly upfront integration. Different methods establish this mapping based on the nature of the data sources, the type of feature derivation, and the intended application.

Mapping in Automated Feature Engineering

In automated feature engineering (AFE), establishing a clear mapping between generated features and their source data is fundamental for understanding and managing the feature space. Systems like One Button Machine (OneBM) (1706.00327) and SMARTFEAT (2309.07856) approach this by explicitly recording the transformations applied to original data elements.

OneBM, designed for relational databases, constructs an entity graph from connected tables. Features are derived by systematically traversing this graph via joining paths and applying transformations. Each resulting feature is precisely mapped back to its data source(s) through the specific joining path that produced it. A joining path is formally described as $p = T_0 \xrightarrow{c_1} T_1 \xrightarrow{c_2} T_2 \cdots \xrightarrow{c_k} T_k \mapsto c$ , where $T_i$ are tables and $c_i$ are joining columns. This mapping allows tracing a feature back to the tables and columns involved in its creation. OneBM applies type-aware transformations (e.g., aggregations for multi-paths) and handles temporal aspects to avoid information leakage. This automated mapping enables the system to generate powerful features by systematically covering relationships across multiple tables, reducing manual data exploration time.

SMARTFEAT leverages Foundation Models (FMs) for automated feature construction from datasets with multiple columns. It uses an iterative, operator-guided approach. When a new feature $A^{cand}_i$ is generated by applying a function $f_i$ (corresponding to an operator) to a set of input columns $\mathcal{D}_{cols}$ , the mapping records the relationship $A^{cand}_i = f_i (\mathcal{D}_{cols})$ . The system explicitly links each new feature to its source columns and the specific transformation function or operator used. This process is guided by FM interactions at the feature level, translating contextual information and operator types into executable code snippets (like Python or pandas functions). This explicit mapping ensures the provenance of generated features, allowing traceability back to original data and intermediate derivations. The feature-level FM interaction, instead of row-level, significantly improves efficiency, as demonstrated by faster execution times compared to traditional AFE tools and other FM-based approaches (2309.07856).

Both systems highlight that automated feature construction necessitates robust mechanisms to map derived features back to their data origins for interpretability, debugging, and management of the feature space.

Integrating and processing data from diverse sources, which may have differing schemas, structures, or modalities, requires mapping features or data representations across these variations.

The "Entity and Features" model (1905.01306) provides a formal framework for managing information about entities from diverse, non-integrated Big Data sources, particularly NoSQL databases. It defines Entities ( $E$ ), Features ( $F$ ), and Associations ( $n_{e,f}$ ) between them. Each feature $f$ is mapped to the data source(s) where information about $f$ for a given entity $e$ exists. The model quantifies the "distance" or similarity between entities or data sources based on their shared features using an information-theoretic approach. The importance of a feature $f$ for an entity $e$ is given by $I(e, f) = (1 + \log_2 n_{e,f}) \cdot \log_2 \frac{|E|}{|e(f)|}$ , where $|E|$ is the total entities and $|e(f)|$ is entities associated with $f$ . This allows creating feature vectors for entities or abstracting data sources as providers of features, enabling distance calculations between sources: $d(e_1, e_2) = \sqrt{ \sum_{f \in F} ( V(e_1, f) - V(e_2, f) )^2 }$ , where $V(e, f)$ is a normalized weight. This model maps the availability of features for entities to specific sources, facilitating source selection and redundancy identification without requiring upfront data integration.

In multi-source data fusion for engineering problems, heterogeneous input parameter spaces pose a challenge (2407.11268). The framework presented uses Input Mapping Calibration (IMC) to map inputs from different sources into a unified reference parameter space. This mapping is learned, often as a linear transformation $x_1 = g(x_2; A, b) = A x_2 + b$ , by minimizing the difference between predicted outputs from sources in the reference space. After mapping, a Latent Variable Gaussian Process (LVGP) model is used to fuse data from all sources while remaining source-aware via latent variables. This maps the diverse feature sets from original sources into a common, lower-dimensional space for modeling, providing both predictive accuracy and interpretability regarding the relationships between sources in the latent space.

For Land Use/Land Cover (LULC) mapping using historical and current satellite data, domain shift is a key challenge (2404.11114). The REFeD framework addresses this by disentangling features into domain-invariant and domain-specific components using a pseudo-Siamese architecture and contrastive learning. Domain-invariant features are mapped across different time periods or paper sites, aiming to represent intrinsic land cover characteristics regardless of acquisition conditions. This is achieved by minimizing classification loss on LULC classes while maximizing discriminability between domains based on domain-specific features. Contrastive loss is applied at multiple network depths to enforce separation and alignment of class-relevant features across domains. This effectively maps the underlying environmental features to both historical and current data sources by learning representations that are robust to inter-domain variations.

Integrating financial data from internal management, external markets, and online public opinion also involves multi-source feature mapping (2404.12610). The paper uses features derived from each source (e.g., financial ratios, macroeconomic indices, sentiment scores). A key step is feature selection using a hybrid MRMR-SVM-RFE method, which maps the most relevant and least redundant features back to their sources. The empirical results show that the final selected feature set consistently includes features from all three sources, highlighting their joint importance and effectively mapping critical indicators to their diverse origins for improved financial distress prediction.

Mapping Learned/Latent Representations to Input Data

Interpreting deep learning models often involves understanding what specific features learned by the model correspond to in the original input data space. This can be viewed as a reverse mapping process.

The "Data from Model" (DfM) concept explores the reverse process of training: synthesizing data from a trained model without accessing the original training data (2007.06196). This demonstrates a mapping from the model's internal parameters (which encode learned features) back to synthetic data samples. The process involves optimizing input samples from a background dataset to match model outputs for virtual targets. This cyclical DtM/DfM process reveals that the model's feature mapping encapsulates essential classification-relevant information. Robust models, which encode "robust features," preserve this feature mapping better over repeated cycles. The paper shows that this mapping is architecture-dependent, with models from the same family having more compatible mappings. This implies that learned features have a mapping to data distributions that is influenced by both the training data and the model architecture.

featMAP (2211.09321) provides an interpretable dimensionality reduction method by preserving the mapping from low-dimensional embeddings back to original source features. For each data point, it approximates the local tangent space in the high-dimensional input space using Singular Value Decomposition (SVD). The right singular vectors $V_i$ provide a basis for this local space. The low-dimensional embedding is then constructed by maintaining the alignment of these local tangent spaces. The rows of $V_i$ (or their embedded counterparts) map directly back to the original source features (e.g., pixels), indicating their contribution to the local structure. FeatMAP quantifies local feature importance: $\left\|f_h^i\right\| = \left(\sum_{l=1}^d | v^{f_h}_{il} |^2\right)^{1/2}$ . This method explicitly maps points in the low-dimensional embedding space to interpretations in terms of the original features, addressing the lack of interpretability in many non-linear dimensionality reduction techniques.

More recently, FeatInv (2505.21032) proposes using a conditional diffusion model to map from a model's learned feature space back to the input image space, specifically using spatially resolved feature maps. The method conditions a pretrained diffusion model on unpooled feature maps extracted from a classifier. This provides a probabilistic mapping, generating diverse images that are consistent with the given feature representation. This allows researchers to visualize what information (in the input space) corresponds to specific feature activations, how different concepts are encoded spatially, and how feature spaces can be composed or manipulated (e.g., feature arithmetic). The high fidelity of the conditional diffusion model ensures that the reconstructed images are visually faithful to potential inputs that would generate those features.

These approaches demonstrate different facets of mapping learned representations back to the original data domain, crucial for interpreting complex models and understanding the nature of the features they have extracted.

Mapping Abstract Concepts and Intent to Data

Feature-to-data source mapping can also bridge the gap between high-level user requirements or abstract concepts and the concrete features derived from data.

The concept of "Feature Concepts" (2111.04505) is introduced as an abstract model of the information desired from data, linking user requirements ("why") to methods ("how") and datasets ("what"). Unlike simple variables, feature concepts (e.g., "diversity shift," "islands and bridges") capture the essence of needed information. The mapping process, particularly in the Innovators Marketplace on Data Jackets (IMDJ) context, involves creative communication among stakeholders to elicit and refine these concepts. Once defined, feature concepts guide the selection, extraction, and combination of data sources and features. For example, in explaining market changes, the feature concept "diversity shift" is mapped to a quantitative measure like Graph-Based Entropy (GBE) $H_g = \sum_{j} p(\text{cluster}_j) \log p(\text{cluster}_j)$ computed from POS data. In earthquake prediction, "diversity shift" maps to Regional Entropy on Seismic Information (RESI) $H(S, t)$ of epicenter locations. This approach formalizes the often-implicit process by which domain experts connect abstract ideas to measurable data features across federated datasets.

Mapping in Software Engineering and Geospatial Domains

Feature-to-data source mapping principles are also applied in specific domains like software engineering and geospatial analysis.

In software engineering, feature location aims to find the source code relevant to a particular feature or functionality. The ACIR technique (2402.05711) proposes using changeset descriptions from version control systems as a data source to describe software entities (files or methods). A mapping is established between query terms (representing the feature description) and code artifacts by indexing the text of changesets associated with lines of code. This uses an IR-based approach with TF-IDF weighting: $\text{TF-IDF}(t,d) = tf(t,d) \times \log \frac{N}{df(t)}$ . The mapping connects the high-level concept of a software feature (as described in a commit message or issue tracker) to the specific code entities that were changed to implement it. This provides an alternative data source to traditional comments and identifiers, leveraging the intent captured in commit messages.

For geospatial data, mapping features between raster and vector formats at different scales is a common challenge (2407.10599). The proposed general algorithm maps features (e.g., air pollutant concentrations from raster data) to vector components (e.g., road segments). This involves rasterizing the vector map area into a grid matching the raster resolution and then assigning raster values to vector elements based on geographic overlap. For a grid center at (lat, lon), the raster value $P(\text{lat}, \text{lon})$ is obtained from the raster data $D_r$ by looking up the value at the corresponding geographic coordinates within the raster cell. This value is then attached as an attribute to the vector features (edges) that fall within that grid cell. This method provides a scalable way to fuse features from disparate spatial data sources while preserving the integrity of both, enabling analyses like assigning environmental metrics to urban infrastructure.

Challenges and Future Directions

Several recurring challenges emerge in feature-to-data source mapping. Handling heterogeneous data structures, semantic alignment across sources, redundancy, and scalability for large datasets are common issues (1905.01306, 2404.12610, 2407.10599). Preserving data integrity and source uniqueness during fusion is critical (2407.10599). For learned representations, maintaining fidelity and interpretability during mapping is challenging, especially with complex non-linear models (2211.09321, 2505.21032). Automated methods face challenges in combinatorial search space and ensuring the quality and relevance of generated features or selected data sources (1706.00327, 2309.07856). Incremental scenarios introduce complexity regarding stability and efficiency (2412.09355).

Solutions often involve formal models to structure the mapping (e.g., Entity-Features model, join paths), explicit transformation learning (IMC), disentanglement techniques (REFeD), preservation of local structures (featMAP), or statistical analysis of feature distributions (StoRe). Leveraging external knowledge sources and LLMs shows promise for generating and mapping features (2309.07856).

Future research directions include developing more sophisticated mappings for complex feature spaces and data structures, enhancing interpretability across diverse modalities, creating frameworks for incremental and dynamic mapping in evolving data environments (2412.09355), and applying these techniques to new domains and problem types (2407.10599, 2111.04505).