- The paper introduces DataMap, a browser-based R/Shiny tool that offers secure, client-side visualization for high-dimensional biomedical datasets.
- It employs guided data transformation methods such as imputation, log transformation, normalization, and outlier capping to prepare data for analysis.
- DataMap produces publication-quality visualizations like heatmaps, PCA, and t-SNE while automatically generating reproducible R code.
DataMap is a browser-based application designed for visualizing high-dimensional data, particularly relevant in biomedical research fields like genomics (RNA-seq) and proteomics (2504.08875). It addresses the need for secure, easy-to-use, and reproducible tools for exploring complex datasets. The application runs entirely within the user's web browser, eliminating the need for server-side processing or software installation, thus ensuring data privacy as sensitive information never leaves the local machine.
Implementation:
- Technology: DataMap is built as an R/Shiny application. It leverages Shinylive, which compiles R code into WebAssembly using WebR, allowing it to run directly in the browser.
- Deployment: The application is hosted as static files on GitHub Pages, making it serverless.
- Availability: It's freely accessible via a GitHub page (\url{https://gexijin.github.io/datamap/}). The source code is available on GitHub (\url{https://github.com/gexijin/datamap}) and can also be installed as a standard R package for local use, particularly beneficial for larger datasets.
Key Features and Functionality:
- Secure Client-Side Processing: All data loading, transformation, and visualization happen locally in the user's browser. This is a major advantage for handling sensitive datasets and bypasses server capacity limitations.
- Versatile Data Import: Supports various file formats like Excel (.xlsx), CSV, TSV, and TXT. It features automatic detection of delimiters and checks for row/column names to simplify data loading. Annotations for rows and columns (e.g., experimental conditions) can be uploaded separately or included within the data file itself (for row annotations).
- Smart Data Transformation: Includes essential preprocessing steps:
- Missing Value Imputation: Options to impute using row/column mean or median.
- Log Transformation: Automatically recommended if data shows high skewness (>1) and contains no negative values, common in biological data.
- Normalization/Scaling: Infers matrix orientation (rows vs. columns as features) based on variability (Median Absolute Deviation) and suggests appropriate centering or scaling.
- Outlier Capping: Caps outliers beyond 3 standard deviations from the mean to improve color mapping in heatmaps.
- Feature Filtering: Allows filtering out rows with low variability.
- These features are guided by statistical heuristics, making the tool accessible even for users without deep statistical expertise.
- Visualization Methods: Generates publication-quality visualizations:
- Heatmaps: Uses the
pheatmap
R package for hierarchical clustering and heatmap generation. Allows cutting dendrograms to define and visualize clusters.
- Dimensionality Reduction: Performs Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) for visualizing sample relationships in lower dimensions.
- Visualizations can be downloaded in PDF or PNG formats.
- Reproducibility: Automatically records all user actions and settings (data import, transformations, visualization parameters) and generates the corresponding R code. This allows users to reproduce the analysis exactly, promoting transparency and collaboration.
Comparison with Existing Tools:
- DataMap joins other browser-based tools like Phantasus and Morpheus in offering client-side processing for enhanced security.
- Compared to server-based tools like Clustergrammer, it ensures data privacy.
- It differentiates itself by offering a comprehensive suite of preprocessing options guided by heuristics, generating reproducible R scripts, and focusing on publication-quality graphics using established R packages.
- A potential drawback mentioned is that it might be less interactive compared to tools built natively with JavaScript or other web technologies.
Limitations and Future Directions:
- Performance: Browser-based execution via WebAssembly is significantly slower than running native R code locally, especially for large datasets. The paper cites an example where heatmap generation took 80 seconds in the browser versus 5 seconds in native R. Users with very large datasets are advised to install and run the R package locally.
- Package Dependency: Relies on WebR, which supports a subset of R packages, and updates might lag behind the main R ecosystem.
- Future Work: Plans include optimizing performance for browser-based execution, expanding visualization options, and adding more analytical modules.
In summary, DataMap provides a secure, user-friendly, and reproducible platform for visualizing high-dimensional data directly in the browser. Its key strengths lie in its client-side processing, guided data transformation workflow, high-quality visualizations (heatmaps, PCA, t-SNE), and automatic generation of R code for reproducibility, making it a valuable tool for biomedical researchers.