Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DataMap: A Portable Application for Visualizing High-Dimensional Data (2504.08875v1)

Published 11 Apr 2025 in q-bio.QM, cs.HC, cs.LG, and stat.AP

Abstract: Motivation: The visualization and analysis of high-dimensional data are essential in biomedical research. There is a need for secure, scalable, and reproducible tools to facilitate data exploration and interpretation. Results: We introduce DataMap, a browser-based application for visualization of high-dimensional data using heatmaps, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE). DataMap runs in the web browser, ensuring data privacy while eliminating the need for installation or a server. The application has an intuitive user interface for data transformation, annotation, and generation of reproducible R code. Availability and Implementation: Freely available as a GitHub page https://gexijin.github.io/datamap/. The source code can be found at https://github.com/gexijin/datamap, and can also be installed as an R package. Contact: [email protected]

Summary

  • The paper introduces DataMap, a browser-based R/Shiny tool that offers secure, client-side visualization for high-dimensional biomedical datasets.
  • It employs guided data transformation methods such as imputation, log transformation, normalization, and outlier capping to prepare data for analysis.
  • DataMap produces publication-quality visualizations like heatmaps, PCA, and t-SNE while automatically generating reproducible R code.

DataMap is a browser-based application designed for visualizing high-dimensional data, particularly relevant in biomedical research fields like genomics (RNA-seq) and proteomics (2504.08875). It addresses the need for secure, easy-to-use, and reproducible tools for exploring complex datasets. The application runs entirely within the user's web browser, eliminating the need for server-side processing or software installation, thus ensuring data privacy as sensitive information never leaves the local machine.

Implementation:

  • Technology: DataMap is built as an R/Shiny application. It leverages Shinylive, which compiles R code into WebAssembly using WebR, allowing it to run directly in the browser.
  • Deployment: The application is hosted as static files on GitHub Pages, making it serverless.
  • Availability: It's freely accessible via a GitHub page (\url{https://gexijin.github.io/datamap/}). The source code is available on GitHub (\url{https://github.com/gexijin/datamap}) and can also be installed as a standard R package for local use, particularly beneficial for larger datasets.

Key Features and Functionality:

  1. Secure Client-Side Processing: All data loading, transformation, and visualization happen locally in the user's browser. This is a major advantage for handling sensitive datasets and bypasses server capacity limitations.
  2. Versatile Data Import: Supports various file formats like Excel (.xlsx), CSV, TSV, and TXT. It features automatic detection of delimiters and checks for row/column names to simplify data loading. Annotations for rows and columns (e.g., experimental conditions) can be uploaded separately or included within the data file itself (for row annotations).
  3. Smart Data Transformation: Includes essential preprocessing steps:
    • Missing Value Imputation: Options to impute using row/column mean or median.
    • Log Transformation: Automatically recommended if data shows high skewness (>1>1) and contains no negative values, common in biological data.
    • Normalization/Scaling: Infers matrix orientation (rows vs. columns as features) based on variability (Median Absolute Deviation) and suggests appropriate centering or scaling.
    • Outlier Capping: Caps outliers beyond 3 standard deviations from the mean to improve color mapping in heatmaps.
    • Feature Filtering: Allows filtering out rows with low variability.
    • These features are guided by statistical heuristics, making the tool accessible even for users without deep statistical expertise.
  4. Visualization Methods: Generates publication-quality visualizations:
    • Heatmaps: Uses the pheatmap R package for hierarchical clustering and heatmap generation. Allows cutting dendrograms to define and visualize clusters.
    • Dimensionality Reduction: Performs Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) for visualizing sample relationships in lower dimensions.
    • Visualizations can be downloaded in PDF or PNG formats.
  5. Reproducibility: Automatically records all user actions and settings (data import, transformations, visualization parameters) and generates the corresponding R code. This allows users to reproduce the analysis exactly, promoting transparency and collaboration.

Comparison with Existing Tools:

  • DataMap joins other browser-based tools like Phantasus and Morpheus in offering client-side processing for enhanced security.
  • Compared to server-based tools like Clustergrammer, it ensures data privacy.
  • It differentiates itself by offering a comprehensive suite of preprocessing options guided by heuristics, generating reproducible R scripts, and focusing on publication-quality graphics using established R packages.
  • A potential drawback mentioned is that it might be less interactive compared to tools built natively with JavaScript or other web technologies.

Limitations and Future Directions:

  • Performance: Browser-based execution via WebAssembly is significantly slower than running native R code locally, especially for large datasets. The paper cites an example where heatmap generation took 80 seconds in the browser versus 5 seconds in native R. Users with very large datasets are advised to install and run the R package locally.
  • Package Dependency: Relies on WebR, which supports a subset of R packages, and updates might lag behind the main R ecosystem.
  • Future Work: Plans include optimizing performance for browser-based execution, expanding visualization options, and adding more analytical modules.

In summary, DataMap provides a secure, user-friendly, and reproducible platform for visualizing high-dimensional data directly in the browser. Its key strengths lie in its client-side processing, guided data transformation workflow, high-quality visualizations (heatmaps, PCA, t-SNE), and automatic generation of R code for reproducibility, making it a valuable tool for biomedical researchers.

Github Logo Streamline Icon: https://streamlinehq.com