Optimal dimensionality of biological network embedding spaces

Determine the optimal reduced dimensionality for embedding biological networks, including multi-omics interaction networks such as protein–protein interaction, genetic interaction, and co-expression networks, so that the learned spaces are small enough to be computationally efficient yet large enough to preserve the necessary properties of the whole network for accurate analysis.

Background

Selecting the dimensionality of network embeddings strongly affects model performance: very low dimensions are typically not expressive enough to capture the richness of biological network data, while very high dimensions can lead to overfitting and increased computational cost. As a result, most methods treat dimensionality as a tunable hyper-parameter rather than a principled choice.

Existing practices vary widely, with intrinsic-dimension estimators often yielding ultra-low dimensions and data-driven methods using up to 200–250 dimensions; common defaults in Node2vec and DeepWalk are 128 or 256. Preliminary studies on human protein–protein interaction networks suggest an upper limit around 250–300 dimensions, but a general principle for optimal dimensionality across biological networks remains unresolved.

References

Another main challenge in graph embeddings is finding an optimal reduced dimension that is small enough to be efficient, but large enough to keep all of the necessary properties of the whole network. As this is an unresolved scientific question, the embedding space dimensionality is considered a hyper-parameter of the model.

— Simplicity within biological complexity (2405.09595 - Przulj et al., 15 May 2024) in Pillar IV. Finding optimal dimensionality of the embedding spaces (Section 2.4)

Optimal dimensionality of biological network embedding spaces

Background

References

Related Problems