1. Single-cell RNA-Seq

WASJ

World Academy of Sciences Journal

2632-2900 2632-2919

D.A. Spandidos

WASJ-7-2-00315

10.3892/wasj.2025.315

Review

Single‑cell RNA sequencing data dimensionality reduction (Review)

Zogopoulos

Vasileios L.

1 2 Tsotra

Ioanna

1 2 Spandidos

Demetrios A.

3 Iconomidou

Vassiliki A.

2 Michalopoulos

Ioannis

1Centre of Systems Biology, Biomedical Research Foundation, Academy of Athens, 11527 Athens, Greece 2Section of Cell Biology and Biophysics, Department of Biology, National and Kapodistrian University of Athens, 15701 Athens, Greece 3Laboratory of Clinical Virology, Medical School, University of Crete, 71003 Heraklion, Greece

Correspondence to: Dr Ioannis Michalopoulos, Centre of Systems Biology, Biomedical Research Foundation, Academy of Athens, Soranou Efessiou 4, 11527 Athens, Greece imichalop@bioacademy.gr yhkuang0412@163.com

Abbreviations: GAN, generative adversarial network; PC, principal component; PCA, principal component analysis; scRNA-Seq, single-cell RNA sequencing; t-SNE, t-distributed stochastic neighbour embedding; UMAP, uniform manifold approximation and projection; UMI, unique molecular identifier; VAE, variational autoencoder

Mar-Apr 2025

20 01 2025

7 2

21 11 2024 15 01 2025

2025

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

Single-cell RNA sequencing (scRNA-Seq) provides detailed insight into gene expression at the individual cell level, revealing hidden cell diversity. However, scRNA-Seq data pose challenges due to high-dimensionality and sparsity. High-dimensionality stems from analysing numerous cells and genes, while sparsity arises from zero counts in gene expression data, known as dropout events. This necessitates robust data processing methods of the scRNA-Seq gene counts, for meaningful interpretation. Dimensionality reduction techniques, such as principal component analysis, transform gene count data into lower-dimensional spaces retaining biological information, aiding in downstream analyses, while dimensionality reduction-based visualisation methods, such as t-distributed stochastic neighbour embedding, and uniform manifold approximation and projection are used for cell or gene clustering. Deep learning techniques, such as variational autoencoders and generative adversarial networks compress data and generate synthetic gene expression profiles, augmenting datasets and improving utility in biomedical research. In recent years, the interest for scRNA-Seq dimensionality reduction has markedly increased, not only leading to the development of a multitude of methods, but also to the integration of these approaches into scRNA-Seq data processing pipelines. The present review aimed to list and explain, in layman's terms, the current popular dimensionality reduction methods, as well as include advancements and software package implementations of them.

scRNA-Seq dimensionality reduction VAE GAN UMAP t-SNE PCA

Funding: No funding was received.

1. Single-cell RNA-Seq

The transcriptome is the set of all RNA transcripts of a cell/tissue of an organism, as well as their quantity (1). The two main transcriptomic technologies used to obtain gene expression data are microarrays (2) and RNA sequencing (RNA-Seq) (1). The latter can be divided into bulk and single-cell RNA-Seq (scRNA-Seq). Bulk RNA-Seq, as the first iteration of this technology, uses the total mRNA extracted from a tissue, providing an average expression for each gene in the variety of cells included in a sample. On the other hand, scRNA-Seq is an emerging RNA-Seq technology which investigates the transcriptome of single cells (3). Despite the large amount of different sequencing platforms, the main experimental workflow of scRNA-Seq includes the following steps: i) Single-cell isolation from the tissue of interest; ii) lysis of cells and RNA isolation; iii) reverse transcription of the mRNA and amplification through PCR; and iv) library preparation and sequencing (4). Independent of the sequencing platform used, the final output is a FASTQ file, which constitutes the scRNA-Seq raw data, containing the nucleotide sequence, as well as a PHRED quality score for each base (5). FASTQ file generation is followed by their computational pre-processing, resulting in the production of a gene expression matrix, usually in the form of gene read count or unique molecular identifier (UMI) (6) matrix in the case of droplet-based platforms (e.g., 10x Genomics Chromium); the latter was introduced to cater for PCR bias and ensure accurate gene expression quantification. The pipeline for the mapping of reads to the reference genome is in principle the same as in bulk RNA-Seq, including the following basic steps: i) Quality control and adapter sequence removal; ii) alignment of reads to the reference genome; iii) feature count; and iv) normalisation (7,8). However, in the case of single-cell data, further preprocessing steps are included, to account for the intricacies of single-cell sequencing, performed by specialised software. These steps include the identification of low-quality cells, count transformation for UMI datasets, the identification of highly variable features (genes), dimensionality reduction, cell clustering, etc (9). Existing pipelines for the pre-processing of scRNA-Seq data, such as Cell Ranger (10) for 10x Genomics-based data, have already been established in the scientific community.

scRNA-Seq allows for the high-resolution study of gene expressions in a cell-specific manner. However, scRNA-Seq gene count data are characterised by high dimensionality, due to the high number of cells that are isolated from an extracted tissue and the high number of genes (both coding and non-coding) that are studied (11). Furthermore, gene expression levels derived from scRNA-Seq demonstrate high sparsity due to the appearance of a large amount of zero counts of genes (known as ‘dropout events’) that are truly expressed in other cells of the same type. Dropout events may be attributed to the low levels of mRNA which are extracted from each cell, the stochasticity of gene expression and the cell-specific expression of certain genes (12). In order to deal with those two major drawbacks of single-cell data, statistical and artificial intelligence methods of dimensionality reduction and imputation, have been developed. Furthermore, certain dimensionality reduction methods also cater for the imputation of zero values (13). Nevertheless, the sparsity inherent in scRNA-Seq data, can be overcome using just dimensionality reduction, as the compression to a low-dimension space results in the combination of expression data in the various cells and naturally deals with data redundancy (14). The present review mainly focuses on the available and most commonly used methods which are used to perform dimensionality reduction on scRNA-Seq gene count data.

2. Dimensionality reduction

In the context of scRNA-Seq data, each cell may be represented as a data point in a Euclidean space with as many dimensions as the number of genes in the dataset and the coordinates of the data point are the expressions of the genes in the cell. Vice versa, each gene may also be depicted as a data point in a high-dimensional space, whose dimensions are as many as the cell number, and the point coordinates are the gene expression levels in each cell. Consequently, scRNA-Seq count data, albeit represented as a two-dimensional text file with columns (cells) and rows (genes), are actually multidimensional.

Dimensionality reduction refers to the transformation of high-dimensional data to lower-dimensions, reducing their size while keeping most of the information present in the original data (15). As the amount of computational resources required to run any algorithm (e.g., for machine learning) depends on the size of the input data, reducing their dimensions results in lower memory requirements and shorter execution times (16).

There are two approaches for dimensionality reduction: Feature selection and feature extraction, where features refer to the dataset dimensions (genes or samples). In feature selection, a certain number of dimensions that provide the most significant information are selected, while the remainder are discarded. Feature extraction focuses on creating a new set of dimensions by combining the original dimensions (15,17).

As high-dimensionality in scRNA-Seq is attributed to both the samples and genes, dimensionality reduction can be performed for any of the two, usually through feature extraction. In this case, the reduction of the dimensionality of genes in scRNA-Seq data creates a smaller set of latent genes, enabling the efficient clustering of cells, and the subsequent identification of cell types, a step which constitutes an essential part of most scRNA-Seq analyses (18). On the other hand, reducing the dimensionality of cells, through the creation of latent samples that contain most of the biological information of the original cells (Fig. 1), facilitates dataset integration for differential gene expression analysis (19). Dimensionality reduction has been established as an integral part in the scRNA-Seq data processing pipeline for bringing the data to a more manageable form before being used in further downstream analysis or data visualisation (20) (Fig. 2).

3. Common dimensionality reduction techniques in single-cell RNA-Seq <sec> <title>Principal component (PC) analysis (PCA)

PCA is a statistical method used to reduce high-dimensional data (such as scRNA-Seq data) into lower dimensions, while retaining most of the original data information (21). PCA is an orthogonal linear transformation of the data points of the original dataset (22), creating new variables known as PCs that are unrelated amongst themselves and each PC captures decreasing proportions of the total variance of the original dataset (23). There are several approaches to detect the number of PCs that need to be kept in order to retain most of the variability of the original dataset, while excluding variability that is caused by noise. One of the most commonly used methods is keeping the top PCs that explain an arbitrarily selected percentage of variability, although that may include a large number of PCs that explain variability that is attributed to noise. On the other hand, the PCs and the variability of the dataset they explain can be plotted and the top ones can be selected using the ‘elbow’ method (24); however, in many cases, the ‘elbow’ may not be easily defined. In both cases, the remainder of the PCs are discarded, thus efficiently reducing the dataset dimensions (25).

When cells in scRNA-Seq data are treated as data points, PCs are linear combinations of genes, known as latent genes (26). As scRNA-Seq data provide no prior information about the identity of each cell, PCA, as an unsupervised method, may capture the linear associations present in the scRNA-Seq gene expressions, producing a low-dimension dataset, having an equal amount of cells as originally studied, and a smaller number of latent genes than in the original dataset, while retaining most of its variance (20). The produced low-dimensional gene expression matrix is commonly used as input to visualisation algorithms or for additional analyses.

Visualisation methods in lower dimensions

To visualise high-dimensional data in a comprehensible form, data first need to undergo dimensionality reduction and then, to be mapped into two dimensions if a plot is drawn (20). Alternatively, if 3D-visualisation software is used, data need to be mapped into three dimensions. For scRNA-Seq data, there are two major methods for dimensionality reduction into two or three dimensions, and subsequent visualisation: t-distributed stochastic neighbour embedding (t-SNE) and uniform manifold approximation and projection (UMAP). t-SNE (14) was created as an improvement to the SNE method (27), which uses a Gaussian distribution to determine the similarity of the low-dimensional points and determines the low-dimensional representation through a loss function. t-SNE uses a Student-t distribution and an improved loss function, ultimately offering better spread of the data points and faster run time, respectively. UMAP (28) constructs a k-neighbour weighted graph and subsequently computes a lower-dimension layout of it. UMAP is more recent and was developed as an alternative to t-SNE, having an even lower execution time, while claiming to preserve the global structure of the data; i.e., the overall arrangement of the clusters, better.

t-SNE and UMAP, as non-linear methods, are commonly used in scRNA-Seq analysis pipelines to perform visualisation of the cells, being able to capture the non-linear relationships of the data. Cells (as data points) with similar expression patterns are grouped closer to each other in the three-dimensional space. Subsequently, by colour-coding each cell using given annotations, e.g., cell-type, tissue, etc., it is possible to define novel cell sub-populations with distinct expression patterns, through visual exploration (29). In a similar manner, genes may also be visualised. In this case, the users are able to discover groups of co-expressed genes with similar expression patterns (30), although thorough gene annotations are necessary to define the biologically-connected gene clusters.

Both t-SNE and UMAP are able to preserve the global, as well as the local structure of data, using proper data initialisation, PCA being one of the options for this step (31), while also having similar execution times with parameter tuning. Thus, it is recommended to perform a different dimensionality reduction approach as a pre-processing step prior to trying out both methods, when visualising scRNA-Seq data, and determining which plot better depicts the organisation of the cell clusters.

PCA, t-SNE and UMAP are already established techniques and integral parts in the pre-processing and visualisation of scRNA-Seq data (29) and are also included in major processing pipelines and software, such as SEURAT (32) and Cell Ranger (10). Thus, these methods are used in the majority of scientific studies that include scRNA-Seq data analysis. Nevertheless, the increasing diversity and dimensionality of scRNA-Seq data necessitated the usage of more advanced techniques for their efficient analysis.

4. Deep learning-based dimensionality reduction methods <sec> <title>Autoencoders

The advancement of neural networks using multiple hidden layers, coupled with increased computing power, has led to the evolution of machine learning to deep learning (33). The ability of deep learning-based methods to be trained and learn the distribution of the input data was proven valuable for the construction of tools that deal with the high-dimensionality and sparsity of scRNA-Seq data. One such tool is scvis (34), an autoencoder-based method for the dimensionality reduction and subsequent visualisation of scRNA-Seq data. Autoencoders are an archetypal deep-learning technique consisting of two neural networks with hidden layers: One encoder network and one decoder network. Autoencoders are trained to learn compressed representations of input data (35). At first glance, scvis is similar in functionality to t-SNE and UMAP, as it is mainly used for the visualisation of cells and detection of new cell subtypes. However, scvis can detect both linear and non-linear associations in the data and has been shown to possess improved performance, achieving similar or better grouping of data points, while also scaling better with larger datasets (34). Nevertheless, data initialisation is equally necessary in the case of scvis, to preserve both global and local alignment of the original data.

Another autoencoder-based technique is deep count autoencoder (DCA) (36). As opposed to scvis, DCA is used for the denoising of scRNA-Seq data, which refers to the efficient imputation of data, while also aiming to improve the expression estimation of all gene counts (37). DCA exhibits better performance compared to commonly used imputation techniques, such as SAVER (38) and scImpute (39), showcasing the application of autoencoders for performing simultaneous dimensionality reduction and imputation. The rapid advancement of deep learning has enabled further improvements in neural networks, in the form of variational autoencoders (VAEs) and generative adversarial networks (GANs), which have skyrocketed in popularity.

VAEs

VAEs (40) represent a paradigm shift in the field of deep learning, particularly in their application to complex, high-dimensional datasets. At their core, VAEs are an advancement of traditional autoencoders (35), although VAEs diverge significantly by incorporating a probabilistic framework. This framework involves the encoder network mapping input data not to a deterministic point, but to a probability distribution within a latent space. Consequently, the decoder network reconstructs the input data by sampling from this latent distribution. This probabilistic approach is underpinned by the principles of variational inference, enabling the approximation of complex data distributions. The incorporation of stochasticity in the encoding process allows VAEs to generate new data samples by sampling from the learned latent space distribution.

VAEs have been proven as an effective tool for reducing scRNA-Seq data dimensionality, while retaining the biological properties of the original dataset (41). VAEs not only compress gene expression data into a more manageable latent space, considering that such datasets can contain data of >100,000 cells, but they also capture the biological variance across cells, while mitigating the impact of the inherent noise and sparsity of scRNA-Seq data (42). The sampling of the probabilistic latent space in VAEs yields different datasets each time, yet properly trained models tend to produce results that exhibit minimal variance among them (40). Furthermore, utilising non-linear transformations for producing a low-dimensional latent space through the training on non-linear mappings of high-dimensional data could improve data clustering (43). Thus, the low-dimensional gene expression generated by trained VAEs, can facilitate downstream analyses (44), such as cell clustering, gene co-expression or regulatory network inference or protein-protein association network construction. Such applications of VAEs have been developed, including DiffVAE (45) for modelling cell differentiation, BEENE (46) for improved batch correction, β-TCVAE (47), which was used for data integration in single-cell GTEx (48) and FAVA (49) for the inference of high-quality protein-protein association networks. The newest version of STRING, used FAVA for the computation of the co-expression scores, as the results of this method outperformed their previous ones, since they were able to capture both linear and non-linear associations of the scRNA-Seq data (50).

GANs

GANs (51) are a class of deep learning algorithms that have garnered significant attention for their ability to generate high-quality, synthetic data samples. A GAN consists of two neural networks, the generator and the discriminator, engaged in a continuous adversarial process. The generator attempts to produce data samples indistinguishable from real data, while the discriminator strives to differentiate between the generator's synthetic data and actual data. This adversarial training encourages the generator to produce increasingly realistic samples, adjusting its parameters to produce data that better model the complex distribution of the input data.

In the context of the analysis of scRNA-Seq data, GANs are particularly valuable as they can learn to capture and reproduce the intricate structures and patterns inherent in such data. Instead of performing dimensionality reduction in a direct way, i.e., by performing feature extraction on the genes or samples, GANs generate new datasets of a desired number of dimensions, thus indirectly reducing the dimensionality of the original dataset. GANs can be employed to learn the complex distribution of scRNA-Seq data, and once trained, GANs may generate synthetic, yet biologically plausible, single-cell gene expression profiles (52). These ‘fabricated’ datasets can be used to augment the original dataset as input to other algorithms, in the cases where data scarcity prevents the easy procurement of training datasets and be utilised in place of a high-dimensional dataset, while providing a similar amount of biological information or by imitating data derived from specific biological conditions (53). GANs outperform the usual methods for synthetic scRNA-Seq dataset generation, when their output is used to construct gene regulatory networks, as GANs can more efficiently generate realistic datasets and thus allowing downstream network creation algorithms that perform well on synthetic datasets to generalise well on real data (54). Applications of GANs in scRNA-Seq data include cscGAN (55) and LSH-GAN (56), used for dataset generation. Certain methods, such as AGImpute (57), combine both autoencoders and GANs in their approach, in this case, to perform cell-type aware imputation of scRNA-Seq data.

5. Comparison between dimensionality reduction methods

Even though a variety of options for dimensionality reduction of scRNA-Seq data were described, each one has specific use-cases, as well as certain advantages and disadvantages (Table I).

Dimensionality reduction techniques such as PCA, t-SNE and UMAP have been established in the scientific community, being integral scRNA-Seq analysis steps, thanks to their fast execution times owing to their comparatively low need for computational resources, particularly in the case of PCA. However, in recent years, their application has been limited to either being used for data pre-processing (PCA) or visualisation of cells (t-SNA and UMAP). Furthermore, t-SNE and UMAP have been shown to require a lot of computational resources with larger input data, while also requiring data initialisation and proper parameter tuning to produce similar plots (31,58).

In comparison, deep learning-based dimensionality reduction techniques have recently been in the centre of attention, owing mostly to their ability to be trained on the input dataset, made more accessible through the development of deep learning packages such as Keras (59) and TensorFlow (60). Deep-learning techniques are valued for their ability to capture both linear and non-linear relationships of the input data, compared to PCA, which is a linear method, and t-SNE/UMAP which are non-linear methods. However, deep-learning methods require much more computational resources than their statistical or machine learning counterparts, often relying on multiple graphical processing units for optimal execution (61), which renders them less friendly to the average user. Furthermore, advanced knowledge of deep-learning is necessary for the construction of optimised VAEs and GANs, including the integration of the best training and validation sets. If these networks are not trained properly, e.g., having a small validation dataset or unbalanced input data, they may overfit and, thus, not produce impartial data (62). Thus, ample research and evaluation are still necessary by the scientific community to integrate these techniques into popular data analysis pipelines.

6. Conclusions and future perspectives

The advent of scRNA-Seq has enabled the study of gene expression with unprecedented definition and cell-specificity. However, the high-sparsity and high-dimensionality of scRNA-Seq data requires the use of strategies in order to bring them to a comprehensible state and extract meaningful biological information. Dimensionality reduction techniques, such as PCA, t-SNE and UMAP help in the visualisation of such data, being indispensable tools in their visual examination. More advanced techniques such as VAEs and GANs bring the data to lower dimensions, while retaining the original biological information. This facilitates their usage for downstream analyses, e.g. identification of co-expressed genes or cell subtypes, as well as their role in creating synthetic scRNA-Seq datasets, to be used for the training or evaluation of more complex algorithms. The overall volume of scRNA-Seq datasets, in conjunction with readily available software packages which implement such methods, has allowed for the massive influx of research articles based on scRNA-Seq analyses, in the recent years. The future advancement of deep learning will further improve the speed and fidelity of the analyses based on dimensionality reduction.

Acknowledgements

The authors are indebted to Professor Nikolaos Drakoulis (Department of Pharmacy, School of Health Sciences, National and Kapodistrian University of Athens, Athens, Greece) for inviting them to present this work at the 4th International Congress on Pharmacogenomics and Personalized Diagnosis and Therapy.

Availability of data and materials

Not applicable.

Authors' contributions

VLZ performed literature review, wrote the original draft of the manuscript, and wrote, reviewed and edited the manuscript. IT, DAS and VAI wrote, reviewed and edited the manuscript. IM conceptualized and supervised the study, was involved in the writing of the original draft of the manuscript, and also wrote, reviewed and edited the final manuscript. All authors have read and approved the final version of the manuscript. Data authentication is not applicable.

Ethics approval and consent to participate

Not applicable.

Patient consent for publication

Not applicable.

Competing interests

DAS is the Managing Editor of the journal, but had no personal involvement in the reviewing process, or any influence in terms of adjudicating on the final decision, for this article.

References 1

Wang

Gerstein

Snyder

RNA-Seq: A revolutionary tool for transcriptomics

Nat Rev Genet1057632009

19015660

10.1038/nrg2484

Schena

Shalon

Davis

Brown

Quantitative monitoring of gene expression patterns with a complementary DNA microarray

Science2704674701995

7569999

10.1126/science.270.5235.467

Tang

Barbacioru

Wang

Nordman

Lee

Wang

Bodeau

Tuch

Siddiqui

mRNA-Seq whole-transcriptome analysis of a single cell

Nat Methods63773822009

19349980

10.1038/nmeth.1315

Haque

Engel

Teichmann

Lonnberg

A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications

Genome Med9752017

28821273

10.1186/s13073-017-0467-4

Cock

Fields

Goto

Heuer

Rice

The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants

Nucleic Acids Res38176717712010

20015970

10.1093/nar/gkp1137

Kivioja

Vaharautio

Karlsson

Bonke

Enge

Linnarsson

Taipale

Counting absolute numbers of molecules using unique molecular identifiers

Nat Methods972742011

22101854

10.1038/nmeth.1778

Satija

Farrell

Gennert

Schier

Regev

Spatial reconstruction of single-cell gene expression data

Nat Biotechnol334955022015

25867923

10.1038/nbt.3192

Zogopoulos

Saxami

Malatras

Papadopoulos

Tsotra

Iconomidou

Michalopoulos

Approaches in gene coexpression analysis in eukaryotes

Biology (Basel)1110192022

36101400

10.3390/biology11071019

Ilicic

Kim

Kolodziejczyk

Bagger

McCarthy

Marioni

Teichmann

Classification of low quality cells from single-cell RNA-seq data

Genome Biol17292016

26887813

10.1186/s13059-016-0888-1

Zheng

Terry

Belgrader

Ryvkin

Bent

Wilson

Ziraldo

Wheeler

McDermott

Zhu

Massively parallel digital transcriptional profiling of single cells

Nat Commun8140492017

28091601

10.1038/ncomms14049

Zhang

Tools for the analysis of high-dimensional single-cell RNA sequencing data

Nat Rev Nephrol164084212020

32221477

10.1038/s41581-020-0262-0

Qiu

Embracing the dropouts in single-cell RNA-seq analysis

Nat Commun1111692020

32127540

10.1038/s41467-020-14976-9

Imoto

Nakamura

Escolar

Yoshiwaki

Kojima

Yabuta

Katou

Yamamoto

Hiraoka

Saitou

Resolution of the curse of dimensionality in single-cell RNA sequencing data analysis

Life Sci Alliance5e2022015912022

35944930

10.26508/lsa.202201591

Van der Maaten

Hinton

Visualizing data using t-SNE

J Mach Learn Res92008

Nanga

Bawah

Acquaye

Billa

Baeta

Odai

Obeng

Nsiah

Review of dimension reduction methods

J Data Anal Inform Process091892312021

Sarker

Machine learning: Algorithms, Real-world applications and research directions

SN Comput Sci21602021

33778771

10.1007/s42979-021-00592-x

Alpaydin

Introduction to Machine Learning. MIT Press, Cambridge, Massachusetts, London, England, 2020.

Okada

Chung

Hojo

Practical compass of Single-cell RNA-Seq Analysis

Curr Osteoporos Rep224334402024

38019344

10.1007/s11914-023-00840-4

Arora

Opasawatchai

Poonpanichakul

Jiravejchakul

Sungnak

Thailand

Matangkasombut

Teichmann

Matangkasombut

Charoensawan

Single-cell temporal analysis of natural dengue infection reveals skin-homing lymphocyte expansion one day before defervescence

iScience251040342022

35345453

10.1016/j.isci.2022.104034

Linderman

Dimensionality reduction of Single-cell RNA-Seq data

Methods Mol Biol22843313422021

33835451

10.1007/978-1-0716-1307-8_18

Pearson

LIII. On lines and planes of closest fit to systems of points in space

Lond Edinb Dubl Phil Mag25595721901

Jolliffe

Principal Component Analysis. Springer, New York, NY, 2002.

Jolliffe

Cadima

Principal component analysis: A review and recent developments

Philos Trans A Math Phys Eng Sci374201502022016

26953178

10.1098/rsta.2015.0202

Thorndike

Who belongs in the family?

Psychometrika182672761953

Tsuyuzaki

Sato

Nikaido

Benchmarking principal component analysis for large-scale single-cell RNA-sequencing

Genome Biol2192020

31955711

10.1186/s13059-019-1900-3

Dai

Principal component analysis based methods in bioinformatics studies

Brief Bioinform127147222011

21242203

10.1093/bib/bbq090

Hinton

Roweis

Stochastic Neighbor Embedding. In: Advances in Neural Information Processing Systems. Becker S, Thrun S and Obermayer K (eds.) MIT Press, Cambridge, MA, pp857-864, 2003.

McInnes

Healy

Melville

Umap: Uniform manifold approximation and projection for dimension reduction arXiv: 1802.03426, 2018.

Slovin

Carissimo

Panariello

Grimaldi

Bouche

Gambardella

Cacchiarelli

Single-cell RNA sequencing analysis: A Step-by-Step overview

Methods Mol Biol22843433652021

33835452

10.1007/978-1-0716-1307-8_19

Lachmann

Torre

Keenan

Jagodnik

Lee

Wang

Silverstein

Ma'ayan

Massive mining of publicly available RNA-seq data from human and mouse

Nat Commun913662018

29636450

10.1038/s41467-018-03751-6

Kobak

Linderman

Initialization is critical for preserving global data structure in both t-SNE and UMAP

Nat Biotechnol391561572021

33526945

10.1038/s41587-020-00809-z

Hao

Stuart

Kowalski

Choudhary

Hoffman

Hartman

Srivastava

Molla

Madad

Fernandez-Granda

Satija

Dictionary learning for integrative, multimodal and scalable single-cell analysis

Nat Biotechnol422933042024

37231261

10.1038/s41587-023-01767-y

Goodfellow

Bengio

Courville

Deep Learning. An MIT Press book. https://www.deeplearningbook.org/.

Ding

Condon

Shah

Interpretable dimensionality reduction of single cell transcriptome data with deep generative models

Nat Commun920022018

29784946

10.1038/s41467-018-04368-5

Kramer

Nonlinear principal component analysis using autoassociative neural networks

AIChE J372332431991

Eraslan

Simon

Mircea

Mueller

Theis

Single-cell RNA-seq denoising using a deep count autoencoder

Nat Commun103902019

30674886

10.1038/s41467-018-07931-2

Agarwal

Wang

Zhang

Data denoising and Post-denoising corrections in single cell RNA sequencing

Statistical Science351121282020

Huang

Wang

Torre

Dueck

Shaffer

Bonasio

Murray

Raj

Zhang

SAVER: Gene expression recovery for single-cell RNA sequencing

Nat Methods155395422018

29941873

10.1038/s41592-018-0033-z

An accurate and robust imputation method scImpute for single-cell RNA-seq data

Nat Commun99972018

29520097

10.1038/s41467-018-03405-7

Kingma

Welling

Auto-encoding variational bayes. arXiv, 2013.

Gronbech

Vording

Timshel

Sonderby

Pers

Winther

scVAE: Variational auto-encoders for single-cell gene expression data

Bioinformatics36441544222020

32415966

10.1093/bioinformatics/btaa293

Pan

Long

Pan

ScInfoVAE: Interpretable dimensional reduction of single cell transcription data with variational autoencoders and extended mutual information regularization

BioData Min16172023

37301826

10.1186/s13040-023-00333-1

Hinton

Salakhutdinov

Reducing the dimensionality of data with neural networks

Science3135045072006

16873662

10.1126/science.1127647

Erfanian

Heydari

Feriz

Ianez

Derakhshani

Ghasemigol

Farahpour

Razavi

Nasseri

Safarpour

Sahebkar

Deep learning applications in single-cell genomics and transcriptomics data analysis

Biomed Pharmacother1651150772023

37393865

10.1016/j.biopha.2023.115077

Bica

Andres-Terre

Cvejic

Lio

Unsupervised generative and graph representation learning for modelling cell differentiation

Sci Rep1097902020

32555334

10.1038/s41598-020-66166-8

Rahman

Tutul

Sharmin

Bayzid

BEENE: Deep learning-based nonlinear embedding improves batch effect estimation

Bioinformatics39btad4792023

37561107

10.1093/bioinformatics/btad479

Chen

RTQ

Grosse

Duvenaud

Isolating sources of disentanglement in VAEs. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems Curran Associates Inc., Montréal Canada, pp2615-2625, 2018.

Eraslan

Drokhlyansky

Anand

Fiskin

Subramanian

Slyper

Wang

Van Wittenberghe

Rouhana

Waldman

Single-nucleus cross-tissue molecular reference maps toward understanding disease gene function

Science376eabl42902022

35549429

10.1126/science.abl4290

Koutrouli

Nastou

Piera Lindez

Bouwmeester

Rasmussen

Martens

Jensen

FAVA: High-quality functional association networks inferred from scRNA-seq and proteomics data

Bioinformatics40btae0102024

38192003

10.1093/bioinformatics/btae010

Szklarczyk

Kirsch

Koutrouli

Nastou

Mehryary

Hachilif

Gable

Fang

Doncheva

Pyysalo

The STRING database in 2023: Protein-protein association networks and functional enrichment analyses for any sequenced genome of interest

Nucleic Acids Res51D638D6462023

36370105

10.1093/nar/gkac1000

Goodfellow

Pouget-Abadie

Mirza

Warde-Farley

Ozair

Courville

Bengio

Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2 MIT Press, Montreal, Canada, pp2672-2680, 2014.

Lan

You

Zhang

Fan

Zhao

Zeng

Chen

Zhou

Generative Adversarial Networks and Its Applications in Biomedical Informatics

Front Public Health81642020

32478029

10.3389/fpubh.2020.00164

Lacan

Sebag

Hanczar

GAN-based data augmentation for transcriptomics: Survey and comparative assessment

Bioinformatics39i111i1202023

37387181

10.1093/bioinformatics/btad239

Vinas

Andres-Terre

Lio

Bryson

Adversarial generation of gene expression data

Bioinformatics387307372022

33471074

10.1093/bioinformatics/btab035

Marouf

Machart

Bansal

Kilian

Magruder

Krebs

Bonn

Realistic in silico generation and augmentation of single-cell RNA-seq data using generative adversarial networks

Nat Commun111662020

31919373

10.1038/s41467-019-14018-z

Lall

Ray

Bandyopadhyay

LSH-GAN enables in-silico generation of cells for small sample high dimensional scRNA-seq data

Commun Biol55772022

35688990

10.1038/s42003-022-03473-y

Zhu

Meng

Wang

Peng

AGImpute: Imputation of scRNA-seq data based on a hybrid GAN with dropouts identification

Bioinformatics40btae0682024

38317025

10.1093/bioinformatics/btae068

Chari

Pachter

The specious art of single-cell genomics

PLoS Comput Biol19e10112882023

37590228

10.1371/journal.pcbi.1011288

Chollet

Keras. https://github.com/fchollet/keras; https://keras.io.

Abadi

Agarwal

Barham

Brevdo

Chen

Citro

Corrado

Davis

Dean

Devin

TensorFlow: Large-scale machine learning on heterogeneous distributed Systems. Distributed Parallel Cluster Computing: 16 Mar, 2016.

Mittal

Vaishay

A survey of techniques for optimizing deep learning on GPUs

J Systems Architecture991016352019

Kim

Park

Limited discriminator GAN using explainable AI model for overfitting problem

ICT Express92412462023

Figure 1

An example of feature extraction dimensionality reduction of scRNA-Seq gene count data. (A) The original input scRNA-Seq count data. In this case, the full non-normalised scRNA-Seq dataset procured from the GTEx database is depicted, containing 17,625 genes and 209,216 cells (48). The dataset is characterised not only by high dimensions but also by the appearance of numerous zero gene counts. (B) The same GTEx scRNA-Seq data after dimensionality reduction in the level of cells. Cells are replaced by a far smaller number of 300 latent samples which retain the variance of the original sample set, with the number of genes staying the same. The gene counts have been replaced by gene expressions which contain the biological relevance of the original data, while also filling in the zero values. scRNA-Seq, single-cell RNA sequencing; GTEx database, Genotype-Tissue Expression database.

Figure 2

Flowchart of a simplified pre-processing scRNA-Seq workflow and consequent dimensionality reduction analyses. Starting from raw sequencing data, pre-processing steps (quality control, alignment, and gene counting) generate a high-dimensional gene expression matrix. Dimensionality reduction methods, such as PCA, UMAP, t-SNE, and advanced deep learning approaches (e.g., VAEs, GANs), address data sparsity and complexity, facilitating visualisation and downstream analyses. These techniques enable the extraction and preservation of critical biological information, forming the basis for deeper biological inferences. scRNA-Seq, single-cell RNA sequencing; PCA, principal component analysis; UMAP, uniform manifold approximation and projection; t-SNE, t-distributed stochastic neighbour embedding; VAEs, variational auto encoders; GANs, generative adversarial networks.

Table I

Comparison of dimensionality reduction techniques for scRNA-Seq data.

Technique	Description	Rationale	Advantages	Disadvantages
PCA	Linear transformation creating new variables (principal components) to retain most variance in the data	Reduces the dimensions of scRNA-Seq data while retaining meaningful variance	• Retains most variability • Simple and widely used • Fast execution	• Limited to linear associations • Sensitive to noise in data
t-SNE	Non-linear method using Student-t distribution to visualise data in 2D/3D by capturing relationships among data points	Maps scRNA-Seq data into a comprehensible visual format	• Captures non-linear relationships • Effective for visualising clusters	• May be computationally expensive with large datasets • Can fail to preserve global structure without data initialisation
UMAP	Non-linear methods that constructs a graph of data points and optimises a low-dimensional representation	Alternative to t-SNE, focusing on speed and better global structure representation	• Faster than t-SNE • Better global structure retention • Flexible parameter tuning	• Requires careful tuning • Interpretation may vary with parameters
scvis	Deep learning model using autoencoders for data visualisation	Deep-learning alternative to t-SNE and UMAP	• Handles both linear and non-linear relationships • Scales well to large datasets	• Requires substantial computational resources • Performance depends on architecture and training
DCA	Application of autoencoders which focuses on denoising scRNA-Seq data	Reduces noise and imputes scRNA-Seq data	• Improves data quality by denoising • Better imputation performance than traditional methods	• Relies heavily on initial parameter selection • Computationally intensive for very large datasets
VAEs	Probabilistic version of autoencoders that maps data to distributions in a latent space and reconstructs data by sampling from these distributions	Generates a low-dimensional dataset that retains the biological information of input scRNA-Seq	• Captures both linear and non-linear patterns • Effective for downstream analyses	• Requires expertise in probabilistic modelling • Models can be complex to train effectively
GANs	Two neural networks (generator and discriminator) adversarially trained to create realistic synthetic data	Generates biologically plausible data by learning input scRNA-Seq data distributions	• Generates realistic synthetic datasets • Useful for data augmentation • Handles complex distributions effectively	• Training is challenging and requires significant computational resources • High risk of generating artefacts or overfitting discriminator

scRNA-Seq, single-cell RNA sequencing; PCA, principal component analysis; DCA, deep count autoencoder; UMAP, uniform manifold approximation and projection; t-SNE, t-distributed stochastic neighbour embedding; VAEs, variational auto encoders; GANs, generative adversarial networks.