UniQuant π§¬ο
π§° Name: Universal Quantile-Based Transcriptome Integration (R package)
π Note: UniQuant is a dynamic quantile normalization method for cross-platform heterogeneous data that effectively integrates RNA expression data from different platforms, batches, and formats. UniQuant eliminates batch effects and background differences between datasets, thereby generating more accurate and reliable integrated data, significantly improving the accuracy, reliability, and versatility of data analysis. The datasets integrated based on UniQuant can be used for building disease diagnosis models, molecular classification, survival analysis, differential gene expression analysis, and more.
βοΈ Author: Huanhou Su
Quick run πο
##### 1. Quantile score
## Prepare raw data
# ls_datasset: Name of the dataset.
# df_raw: Raw expression data, with genes as rows and samples as columns.
# dataset_type: 'Training' or 'Validation'.
# df_Disease: The first column is the sample name, the second column is the disease status ('No' represents non-diseased samples, 'Yes' represents diseased samples). If survival information is available, the third column is the survival outcome (0 represents survival, 1 represents death), and the fourth column is the follow-up time (in months). Non-diseased samples do not need to have survival information filled in.
Quantile_score(ls_datasset, df_raw, dataset_type, df_Disease)
##### 2. Select the genes of interest
# Identify the top 1000 hypervariable genes from independent datasets and select those expressed across datasets.
rst_Hype = UniQuant_Hypervariable_gene(n_hypervariable_gene = 1000)
interest_gene = rst_Hype$Gene
##### 3. Training the diagnostic model
# Use the interest genes to identify disease-related genes.
rst_dg = UniQuant_Disease_gene(ls_gene = interest_gene)
Disease_gene = rst_dg$Disease_gene
# Train the diagnostic model using the identified disease-related genes.
rst_Training = UniQuant_Model_Training(ls_gene = Disease_gene)
##### 4. Validation of the diagnostic model
rst_Validation = UniQuant_Model_Validation()
##### 5. Molecular classification
# Perform molecular classification based on the selected genes of interest and classify into 3 groups.
rst_Class = UniQuant_Class(ls_gene = interest_gene, n_Class = 3)
##### 6. Subsequent analyses
# Perform additional analyses such as survival analysis, differential gene expression analysis, and enrichment analysis.
# These steps may follow depending on your dataset and goals.
ο
Comprehensive Guide πο
0οΈβ£ Settingο
library(UniQuant);library(ggplot2); library(edgeR); library(pROC); library(stringr)
dir_main = getwd()
dir_in = file.path(dir_main, '0.raw')
dir_dataset = file.path(dir_main, 'UniQuant', 'dataset'); if(!dir.exists(dir_dataset)) dir.create(dir_result, recursive = TRUE)
dir_result = file.path(dir_main, 'UniQuant', 'result'); if(!dir.exists(dir_result)) dir.create(dir_result, recursive = TRUE)
setwd(dir_main); set.seed(100)
ο
1οΈβ£ Quantile scoreο
##### Converts raw expression data of a dataset into quantile-based scores and organizes the disease status of the samples.
#### Training set
ls_datassets = c('GSE6764', 'GSE14520', 'GSE17856', 'GSE57957', 'GSE36376','GSE102079', 'GSE54236', 'GSE36411', 'GSE22058')
for (ls_datasset in ls_datassets){
df_raw = readRDS(file.path(dir_in, paste0(ls_datasset, '_Expression.rds')))
df_Disease = readRDS(file.path(dir_in, paste0(ls_datasset, '_Disease.rds')))
Quantile_score(ls_datasset, df_raw, "Training", df_Disease)
}
#### Validation set
ls_datassets = c('PMID31585088', 'PMID35382356', 'GSE77314', 'GSE76427','GSE25097', 'GSE63898', 'GSE39791', 'GSE144269','GSE114564', 'GSE14811')
for (ls_datasset in ls_datassets){
df_raw = readRDS(file.path(dir_in, paste0(ls_datasset, '_Expression.rds')))
df_Disease = readRDS(file.path(dir_in, paste0(ls_datasset, '_Disease.rds')))
Quantile_score(ls_datasset, df_raw, "Validation", df_Disease)
}
β You can also transform individual datasets.
ls_datasset = 'GSE36376'
df_raw = readRDS(file.path(dir_in, paste0(ls_datasset, '_Expression.rds')))
df_Disease = readRDS(file.path(dir_in, paste0(ls_datasset, '_Disease.rds')))
Quantile_score(ls_datasset, df_raw, "Validation", df_Disease)
β The data format of the raw gene expression data is as follows.
df_raw[1:4, 1:4]
| GSM890128 | GSM890129 | GSM890130 | GSM890131 | |
|---|---|---|---|---|
| EEF1A1 | 14.054179 | 14.395189 | 14.349301 | 14.183636 |
| LOC643334 | 6.543733 | 6.399822 | 6.391479 | 6.190396 |
| SLC35E2 | 6.065169 | 6.172510 | 5.957617 | 5.980207 |
| LOC642820 | 6.862825 | 6.699461 | 6.726557 | 6.664170 |
β The data format of the transformed gene expression data is as follows.
| GSE36376@GSM890128 | GSE36376@GSM890129 | GSE36376@GSM890130 | GSE36376@GSM890131 | |
|---|---|---|---|---|
| EEF1A1 | 4 | 9 | 8 | 6 |
| LOC643334 | 10 | 8 | 8 | 4 |
| SLC35E2 | 4 | 7 | 2 | 2 |
| LOC642820 | 9 | 5 | 6 | 4 |
β The data format for the disease status of the samples is as follows.
print(rbind(head(df_Disease, 4), tail(df_Disease, 4)))
| Sample | Disease |
|---|---|
| GSM890128 | No |
| GSM890129 | No |
| GSM890130 | No |
| GSM890131 | No |
| GSM890557 | Yes |
| GSM890558 | Yes |
| GSM890559 | Yes |
| GSM890560 | Yes |
β
If your samples contain survival information, you can add columns for survival status (Outcome: 0 represents survival, 1 represents death) and survival time (Time: months).Note that only diseased samples require the addition of survival information.
ls_datasset = 'GSE14520'
df_Disease = readRDS(file.path(dir_in, paste0(ls_datasset, '_Disease.rds')))
head(df_Disease, 4)
| Sample | Disease | Outcome | Time |
|---|---|---|---|
| GSM362958 | Yes | 1 | 28.2 |
| GSM362959 | Yes | 1 | 9.5 |
| GSM362960 | Yes | 0 | 66.1 |
| GSM362961 | No |
ο
2οΈβ£ Select the genes of interestο
# Select the hypervariable genes that are expressed across the majority of the datasets.
rst_Hype = UniQuant_Hypervariable_gene(n_hypervariable_gene = 1000, dataset_threshold = 0.7)
interest_gene = rst_Hype$Gene
interest_gene
[1] "ACSL4" "AFP" "AKR1B10" "ALDH3A1" "APOA4" "APOF" "C7" "C9" "CCL19"
[10] "CCL20" "COL1A1" "CRP" "CTHRC1" "CYP1A2" "CYP3A4" "DCN" "DHRS2" "DKK1"
[19] "DLK1" "EPCAM" "FCN3" "FOS" "FOSB" "GPC3" "GPR88" "HAMP" "HSD11B1"
[28] "IFI27" "LCN2" "LUM" "MME" "MMP7" "MT1M" "MUC13" "MYH4" "NQO1"
[37] "NTS" "PAGE4" "PEG10" "PGC" "PLA2G2A" "REG3A" "RELN" "S100P" "SDS"
[46] "SLC22A1" "SLPI" "SPINK1" "SPP1"
β You can enter other genes of interest, such as immune genes, metabolic genes, cell cycle genes, etc.
ο
3οΈβ£ Training the diagnostic modelο
4οΈβ£ Validation of diagnostic modelο
rst_Validation = UniQuant_Model_Validation()
names(rst_Validation)
[1] "AUC" "Cutoff" "ROC_Model_plot"
[4] "ROC_Gene_plot" "ROC_Model_data" "ROC_Gene_data"
# AUC: The AUC value of the diagnostic model in the Validation set.
# Cutoff: The sensitivity, specificity, and accuracy performance of the diagnostic model when the threshold is set to 0.5.
# ROC_Model_plot: The ROC curve plot of the diagnostic model in the Validation set.
# ROC_Gene_plot: The ROC curve plot of the diagnostic model and its genes in the Validation set.
# ROC_Model_data: The data used to plot the ROC curve of the diagnostic model in the Validation set.
# ROC_Gene_data: The data used to plot the ROC curve of the diagnostic model and its genes in the Validation set.

β
You can extract the data from rst_Validation and plot the receiver operating characteristic (ROC) curve of the diagnostic model for the Validation set.
rst_Validation$ROC_Gene_plot

ο
5οΈβ£ Molecular classificationο
5.1 Classificationο
# Perform molecular classification of disease samples based on the integrated dataset.
rst_Class = UniQuant_Class(ls_gene = interest_gene, n_Class = 3)
df_Class = rst_Class$df_Class
# Count the number of samples for each molecular classification
rst_Class$df_Class_table
Class_1 Class_2 Class_3
570 690 708
β View the molecular classification results of all samples.
head(df_Class)
| Sample | Class |
|---|---|
| GSE102079@GSM2723193 | Class_1 |
| GSE102079@GSM2723195 | Class_2 |
| GSE102079@GSM2723197 | Class_2 |
| GSE102079@GSM2723198 | Class_1 |
| GSE102079@GSM2723199 | Class_3 |
| GSE102079@GSM2723200 | Class_2 |
5.2 Gene contributionο
β Using machine learning methods to calculate the contribution of genes to molecular classification.
# num_Top: Number of top genes to display based on their contribution score.
# n_Train: Number of classification training tasks.
rst_Contribution = UniQuant_Class_Contribution(num_Top = 20, n_Train = 2)
rst_Contribution$Acc
Class Accuracy
Class_1 0.984
Class_2 0.990
Class_3 0.983
# Display the contribution of genes to molecular typing in a table format
head(rst_Contribution$Contribution)
| Gene | Class_1 | Class_2 | Class_3 |
|---|---|---|---|
| ACSL4 | 0.0125754340 | 0.0204913958 | 0.0018271964 |
| AFP | 0.0297757726 | 0.0105659208 | 0.0051961789 |
| AKR1B10 | 0.0003743662 | 0.0006829170 | 0.0010541345 |
| ALDH3A1 | 0.0016005614 | 0.0018106727 | 0.0007021926 |
| APOA4 | 0.0003649859 | 0.0011380439 | 0.0006137823 |
| APOF | 0.0021530311 | 0.0007603114 | 0.0023867149 |
# Display the contribution of genes to molecular typing in a bar chart format
rst_Contribution$Plot_bar

# Display the contribution of genes to molecular typing in a heatmap format
rst_Contribution$Plot_heatmap

5.3 Survival analysisο
β Analyze the impact of molecular classification on patient survival outcomes in the specified dataset.
library(survival); library(survminer)
ls_dataset = 'GSE14520'
df_Disease = readRDS(file.path(dir_dataset, paste0(ls_dataset, '@Disease.rds')))
df_Class = readRDS(file.path(dir_result, 'df_Class.rds'))
df_Disease = df_Disease[!is.na(df_Disease$Time),]
df_Class_tmp = df_Class[df_Class$Sample %in% df_Disease$Sample,]
df_Disease_Class = merge(df_Disease, df_Class_tmp, by = 'Sample')
unique_Class = as.character(sort(unique(df_Disease_Class$Class)))
sfit = survfit(Surv(Time, Outcome)~Class, data = df_Disease_Class)
p = ggsurvplot(sfit, pval = TRUE, data = df_Disease_Class, risk.table = FALSE,
legend.title = "Class",
legend.labs = unique_Class,
legend = "right",
pval.size = 10,
pval.coord = c(0.1,0.1),
palette = c("#F8766D", "#00BA38", "#619CFF")
)
p$plot = p$plot + theme(legend.title = element_text(face = "bold", size = 15)); p

ο
6οΈβ£ Differential gene expression (DEG) analysis.ο
6.1 DEGοΌDiseaseο
# Perform DEG analysis based on the disease status of the samples
rst_DEG_Disease = UniQuant_DEG_Disease(dataset_threshold = 0.7)
df_DEG_Disease = rst_DEG_Disease$DEG
head(df_DEG_Disease)
| Gene | LogFC | logCPM | LR | PValue | FDR |
|---|---|---|---|---|---|
| CAP2 | 1.134334 | 6.434386 | 3076.955 | 0 | 0 |
| ASPM | 1.129646 | 6.434048 | 3052.996 | 0 | 0 |
| TOP2A | 1.124791 | 6.434128 | 3028.964 | 0 | 0 |
| PRC1 | 1.116258 | 6.433803 | 2987.175 | 0 | 0 |
| RACGAP1 | 1.112615 | 6.434054 | 2969.149 | 0 | 0 |
| CENPF | 1.111725 | 6.433586 | 2963.983 | 0 | 0 |
6.2 DEGοΌClassο
# Perform DEG analysis based on the molecular classification of the samples
rst_DEG_Class = UniQuant_DEG_Class(
input_Class = rst_Class$df_Class,
Class_test = 'Class_3',
Class_control = c('Class_1', 'Class_2'),
dataset_threshold = 0.7)
head(df_DEG_Class)
| Gene | LogFC | logCPM | LR | PValue | FDR |
|---|---|---|---|---|---|
| FGFR3 | 0.860094 | 6.350244 | 556.9723 | 3.830429e-123 | 1.962663e-120 |
| PDE9A | 0.857775 | 6.367620 | 503.9286 | 1.327970e-111 | 3.766691e-109 |
| AFP | 0.827938 | 6.476032 | 580.9512 | 2.329059e-128 | 1.541449e-125 |
| PEG3 | 0.780960 | 6.214317 | 330.6309 | 7.004662e-74 | 5.705746e-72 |
| PPP1R9A | 0.750667 | 6.433330 | 395.2754 | 5.880676e-88 | 7.983646e-86 |
| EPCAM | 0.746381 | 6.243183 | 300.1526 | 3.051573e-67 | 1.923460e-65 |
β After obtaining the DEGs, further analyses can be conducted, such as protein-protein interaction network, Gene Set Enrichment Analysis (GSEA), transcription factor analysis, etc.
6οΈβ£ 7οΈβ£ 8οΈβ£ 9οΈβ£
ο
All functions π§ο
Quantile_scoreο
π© Converts raw expression data of a dataset into quantile-based scores and organizes the disease status of the samples.
| Parameter | Type | Description |
|---|---|---|
| dataset_name | Character | Name of the dataset. |
| dataset_expression | Data.frame | Raw expression data, with genes as rows and samples as columns. |
| dataset_type | Character | 'Training' or 'Validation'. |
| dataset_disease | Data.frame | The first column is the sample name, the second column is the disease status ('No' represents non-diseased samples, 'Yes' represents diseased samples). If survival information is available, the third column is the survival status (0 represents survival, 1 represents death), and the fourth column is the follow-up time (in months). Non-diseased samples do not need to have survival information filled in. |
UniQuant_Hypervariable_geneο
π© Select the hypervariable genes that are expressed across the majority of the datasets.
| Parameter | Type | Description |
|---|---|---|
| n_hypervariable_gene | Integer | 1-3000. The number of hypervariable genes selected from each dataset. |
| dataset_threshold | Numeric | 0-1.0. The minimum proportion of datasets in which the selected genes must be expressed. |
UniQuant_Disease_geneο
π© Select disease-related genes with high AUC from the Training set.
| Parameter | Type | Description |
|---|---|---|
| ls_gene | Vector | The vector containing multiple genes. |
| dataset_threshold | Numeric | 0-1.0. The minimum proportion of datasets in which the selected genes must be expressed. |
UniQuant_Model_Trainingο
π© Build a disease diagnostic model using the integrated training dataset.
| Parameter | Type | Description |
|---|---|---|
| ls_gene | Vector | The vector containing multiple genes. |
UniQuant_Model_Validationο
π© Analyze the performance of the disease diagnostic model using the integrated validation dataset.
UniQuant_Model_dataset_AUCο
π© Analyze the diagnostic performance of the disease diagnostic model across all datasets.
UniQuant_Remove_datasetο
π© Remove the specified dataset.
| Parameter | Type | Description |
|---|---|---|
| dataset_name | Character | Name of the dataset. |
UniQuant_Classο
π© Perform molecular classification of disease samples based on the integrated dataset.
| Parameter | Type | Description |
|---|---|---|
| ls_gene | Vector | A vector containing a list of genes to be processed. |
| n_Class | Integer | The number of molecular classifications to be used. |
| method | Character | The method for classification, either 'CCP' or 'NMF'. |
UniQuant_Class_Contributionο
π© Identifies the top contributing genes based on their importance in molecular classification.
| Parameter | Type | Description |
|---|---|---|
| num_Top | Integer | Number of top genes to display based on their contribution score. |
| n_Train | Integer | Number of classification training tasks. |
| reTrain | Logical | If TRUE, the tasks will be retrained. |
UniQuant_DEG_Diseaseο
π© Perform differential gene expression analysis between disease and non-disease samples.
| Parameter | Type | Description |
|---|---|---|
| dataset_threshold | Numeric | 0-1.0. The minimum proportion of datasets in which the selected genes must be expressed. |
UniQuant_DEG_Classο
π© Perform differential gene expression analysis between different molecular classification.
| Parameter | Type | Description |
|---|---|---|
| input_Class | Data.frame | The first column represents the samples, and the second column represents the molecular classification. |
| Class_test | Character | The target molecular classification for differential expression analysis. |
| Class_control | Vector | The contrast molecular classification for differential expression analysis. |
| dataset_threshold | Numeric | 0-1.0. The minimum proportion of datasets in which the selected genes must be expressed. |
UniQuant_datasetο
π© View information about the dataset.
| Parameter | Type | Description |
|---|---|---|
| Sample | Logical | If TRUE, counts the number of diseased and non-diseased samples. |
ο
Reference πο
Huanhou Su et al. Deciphering the Oncogenic Landscape of Hepatocytes through Integrated Single-Nucleus and Bulk RNA-Seq of Hepatocellular Carcinoma. Advanced Science. 2025. DOI: [10.1002/advs.202412944]


