UniQuant πŸ§¬οƒ

🧰 Name: Universal Quantile-Based Transcriptome Integration (R package)

πŸ“ Note: UniQuant is a dynamic quantile normalization method for cross-platform heterogeneous data that effectively integrates RNA expression data from different platforms, batches, and formats. UniQuant eliminates batch effects and background differences between datasets, thereby generating more accurate and reliable integrated data, significantly improving the accuracy, reliability, and versatility of data analysis. The datasets integrated based on UniQuant can be used for building disease diagnosis models, molecular classification, survival analysis, differential gene expression analysis, and more.

✍️ Author: Huanhou Su

Quick run πŸš€οƒ

##### 1. Quantile score
## Prepare raw data
# ls_datasset: Name of the dataset.
# df_raw: Raw expression data, with genes as rows and samples as columns.
# dataset_type: 'Training' or 'Validation'.
# df_Disease: The first column is the sample name, the second column is the disease status ('No' represents non-diseased samples, 'Yes' represents diseased samples). If survival information is available, the third column is the survival outcome (0 represents survival, 1 represents death), and the fourth column is the follow-up time (in months). Non-diseased samples do not need to have survival information filled in.
Quantile_score(ls_datasset, df_raw, dataset_type, df_Disease)

##### 2. Select the genes of interest
# Identify the top 1000 hypervariable genes from independent datasets and select those expressed across datasets.
rst_Hype = UniQuant_Hypervariable_gene(n_hypervariable_gene = 1000)  
interest_gene = rst_Hype$Gene

##### 3. Training the diagnostic model
# Use the interest genes to identify disease-related genes.
rst_dg = UniQuant_Disease_gene(ls_gene = interest_gene)  
Disease_gene = rst_dg$Disease_gene

# Train the diagnostic model using the identified disease-related genes.
rst_Training = UniQuant_Model_Training(ls_gene = Disease_gene) 

##### 4. Validation of the diagnostic model
rst_Validation = UniQuant_Model_Validation()

##### 5. Molecular classification
# Perform molecular classification based on the selected genes of interest and classify into 3 groups.
rst_Class = UniQuant_Class(ls_gene = interest_gene, n_Class = 3)  

##### 6. Subsequent analyses
# Perform additional analyses such as survival analysis, differential gene expression analysis, and enrichment analysis.
# These steps may follow depending on your dataset and goals.



Comprehensive Guide πŸ“–οƒ

0️⃣ Setting

library(UniQuant);library(ggplot2); library(edgeR); library(pROC); library(stringr)

dir_main = getwd()
dir_in = file.path(dir_main, '0.raw')
dir_dataset = file.path(dir_main, 'UniQuant', 'dataset'); if(!dir.exists(dir_dataset)) dir.create(dir_result, recursive = TRUE)
dir_result = file.path(dir_main, 'UniQuant', 'result'); if(!dir.exists(dir_result)) dir.create(dir_result, recursive = TRUE)
setwd(dir_main); set.seed(100)



1️⃣ Quantile score

##### Converts raw expression data of a dataset into quantile-based scores and organizes the disease status of the samples.
#### Training set
ls_datassets = c('GSE6764', 'GSE14520', 'GSE17856', 'GSE57957', 'GSE36376','GSE102079', 'GSE54236', 'GSE36411', 'GSE22058')

for (ls_datasset in ls_datassets){
  df_raw = readRDS(file.path(dir_in, paste0(ls_datasset, '_Expression.rds')))
  df_Disease = readRDS(file.path(dir_in, paste0(ls_datasset, '_Disease.rds')))
  Quantile_score(ls_datasset, df_raw, "Training", df_Disease)
}


#### Validation set
ls_datassets = c('PMID31585088', 'PMID35382356', 'GSE77314', 'GSE76427','GSE25097', 'GSE63898', 'GSE39791', 'GSE144269','GSE114564', 'GSE14811')

for (ls_datasset in ls_datassets){
  df_raw = readRDS(file.path(dir_in, paste0(ls_datasset, '_Expression.rds')))
  df_Disease = readRDS(file.path(dir_in, paste0(ls_datasset, '_Disease.rds')))
  Quantile_score(ls_datasset, df_raw, "Validation", df_Disease)
}

β­• You can also transform individual datasets.

ls_datasset = 'GSE36376'
df_raw = readRDS(file.path(dir_in, paste0(ls_datasset, '_Expression.rds')))
df_Disease = readRDS(file.path(dir_in, paste0(ls_datasset, '_Disease.rds')))
Quantile_score(ls_datasset, df_raw, "Validation", df_Disease)

β­• The data format of the raw gene expression data is as follows.

df_raw[1:4, 1:4]
GSM890128 GSM890129 GSM890130 GSM890131
EEF1A1 14.054179 14.395189 14.349301 14.183636
LOC643334 6.543733 6.399822 6.391479 6.190396
SLC35E2 6.065169 6.172510 5.957617 5.980207
LOC642820 6.862825 6.699461 6.726557 6.664170

β­• The data format of the transformed gene expression data is as follows.

GSE36376@GSM890128 GSE36376@GSM890129 GSE36376@GSM890130 GSE36376@GSM890131
EEF1A1 4 9 8 6
LOC643334 10 8 8 4
SLC35E2 4 7 2 2
LOC642820 9 5 6 4

β­• The data format for the disease status of the samples is as follows.

print(rbind(head(df_Disease, 4), tail(df_Disease, 4)))
Sample Disease
GSM890128 No
GSM890129 No
GSM890130 No
GSM890131 No
GSM890557 Yes
GSM890558 Yes
GSM890559 Yes
GSM890560 Yes

β­• If your samples contain survival information, you can add columns for survival status (Outcome: 0 represents survival, 1 represents death) and survival time (Time: months).Note that only diseased samples require the addition of survival information.

ls_datasset = 'GSE14520'
df_Disease = readRDS(file.path(dir_in, paste0(ls_datasset, '_Disease.rds')))
head(df_Disease, 4)
Sample Disease Outcome Time
GSM362958 Yes 1 28.2
GSM362959 Yes 1 9.5
GSM362960 Yes 0 66.1
GSM362961 No



2️⃣ Select the genes of interest

# Select the hypervariable genes that are expressed across the majority of the datasets.
rst_Hype = UniQuant_Hypervariable_gene(n_hypervariable_gene = 1000, dataset_threshold = 0.7)  
interest_gene = rst_Hype$Gene

interest_gene
 [1] "ACSL4"   "AFP"     "AKR1B10" "ALDH3A1" "APOA4"   "APOF"    "C7"      "C9"      "CCL19"  
[10] "CCL20"   "COL1A1"  "CRP"     "CTHRC1"  "CYP1A2"  "CYP3A4"  "DCN"     "DHRS2"   "DKK1"   
[19] "DLK1"    "EPCAM"   "FCN3"    "FOS"     "FOSB"    "GPC3"    "GPR88"   "HAMP"    "HSD11B1"
[28] "IFI27"   "LCN2"    "LUM"     "MME"     "MMP7"    "MT1M"    "MUC13"   "MYH4"    "NQO1"   
[37] "NTS"     "PAGE4"   "PEG10"   "PGC"     "PLA2G2A" "REG3A"   "RELN"    "S100P"   "SDS"    
[46] "SLC22A1" "SLPI"    "SPINK1"  "SPP1"   

β­• You can enter other genes of interest, such as immune genes, metabolic genes, cell cycle genes, etc.



3️⃣ Training the diagnostic model


4️⃣ Validation of diagnostic model

rst_Validation = UniQuant_Model_Validation()

names(rst_Validation)
[1] "AUC"            "Cutoff"         "ROC_Model_plot"
[4] "ROC_Gene_plot"  "ROC_Model_data" "ROC_Gene_data" 

# AUC: The AUC value of the diagnostic model in the Validation set.
# Cutoff: The sensitivity, specificity, and accuracy performance of the diagnostic model when the threshold is set to 0.5.
# ROC_Model_plot: The ROC curve plot of the diagnostic model in the Validation set.
# ROC_Gene_plot: The ROC curve plot of the diagnostic model and its genes in the Validation set.
# ROC_Model_data: The data used to plot the ROC curve of the diagnostic model in the Validation set.
# ROC_Gene_data: The data used to plot the ROC curve of the diagnostic model and its genes in the Validation set.

图片描述

β­• You can extract the data from rst_Validation and plot the receiver operating characteristic (ROC) curve of the diagnostic model for the Validation set.

rst_Validation$ROC_Gene_plot

图片描述



5️⃣ Molecular classification

5.1 Classification

# Perform molecular classification of disease samples based on the integrated dataset.
rst_Class = UniQuant_Class(ls_gene = interest_gene, n_Class = 3)  
df_Class = rst_Class$df_Class

# Count the number of samples for each molecular classification
rst_Class$df_Class_table
Class_1 Class_2 Class_3 
    570     690     708 

β­• View the molecular classification results of all samples.

head(df_Class)
Sample Class
GSE102079@GSM2723193 Class_1
GSE102079@GSM2723195 Class_2
GSE102079@GSM2723197 Class_2
GSE102079@GSM2723198 Class_1
GSE102079@GSM2723199 Class_3
GSE102079@GSM2723200 Class_2

5.2 Gene contribution

β­• Using machine learning methods to calculate the contribution of genes to molecular classification.

# num_Top: Number of top genes to display based on their contribution score.
# n_Train: Number of classification training tasks.
rst_Contribution = UniQuant_Class_Contribution(num_Top = 20, n_Train = 2)

rst_Contribution$Acc
    Class Accuracy
  Class_1    0.984
  Class_2    0.990
  Class_3    0.983
# Display the contribution of genes to molecular typing in a table format
head(rst_Contribution$Contribution)
Gene Class_1 Class_2 Class_3
ACSL4 0.0125754340 0.0204913958 0.0018271964
AFP 0.0297757726 0.0105659208 0.0051961789
AKR1B10 0.0003743662 0.0006829170 0.0010541345
ALDH3A1 0.0016005614 0.0018106727 0.0007021926
APOA4 0.0003649859 0.0011380439 0.0006137823
APOF 0.0021530311 0.0007603114 0.0023867149
# Display the contribution of genes to molecular typing in a bar chart format
rst_Contribution$Plot_bar

图片描述

# Display the contribution of genes to molecular typing in a heatmap format
rst_Contribution$Plot_heatmap

图片描述


5.3 Survival analysis

β­• Analyze the impact of molecular classification on patient survival outcomes in the specified dataset.

library(survival); library(survminer)  

ls_dataset = 'GSE14520'  
df_Disease = readRDS(file.path(dir_dataset, paste0(ls_dataset, '@Disease.rds')))
df_Class = readRDS(file.path(dir_result, 'df_Class.rds'))

df_Disease = df_Disease[!is.na(df_Disease$Time),]
df_Class_tmp = df_Class[df_Class$Sample %in% df_Disease$Sample,]
df_Disease_Class = merge(df_Disease, df_Class_tmp, by = 'Sample')
unique_Class = as.character(sort(unique(df_Disease_Class$Class)))

sfit = survfit(Surv(Time, Outcome)~Class, data = df_Disease_Class)
p = ggsurvplot(sfit, pval = TRUE, data = df_Disease_Class, risk.table = FALSE, 
                legend.title = "Class",
                legend.labs = unique_Class,
                legend = "right",
                pval.size = 10,
                pval.coord = c(0.1,0.1),
                palette = c("#F8766D", "#00BA38", "#619CFF")
)

p$plot = p$plot + theme(legend.title = element_text(face = "bold", size = 15)); p

图片描述



6️⃣ Differential gene expression (DEG) analysis.

6.1 DEG:Disease

# Perform DEG analysis based on the disease status of the samples
rst_DEG_Disease = UniQuant_DEG_Disease(dataset_threshold = 0.7)
df_DEG_Disease = rst_DEG_Disease$DEG

head(df_DEG_Disease)
Gene LogFC logCPM LR PValue FDR
CAP2 1.134334 6.434386 3076.955 0 0
ASPM 1.129646 6.434048 3052.996 0 0
TOP2A 1.124791 6.434128 3028.964 0 0
PRC1 1.116258 6.433803 2987.175 0 0
RACGAP1 1.112615 6.434054 2969.149 0 0
CENPF 1.111725 6.433586 2963.983 0 0

6.2 DEG:Class

# Perform DEG analysis based on the molecular classification of the samples
rst_DEG_Class = UniQuant_DEG_Class(
  input_Class = rst_Class$df_Class,
  Class_test = 'Class_3',  
  Class_control = c('Class_1', 'Class_2'),
  dataset_threshold = 0.7)

head(df_DEG_Class)
Gene LogFC logCPM LR PValue FDR
FGFR3 0.860094 6.350244 556.9723 3.830429e-123 1.962663e-120
PDE9A 0.857775 6.367620 503.9286 1.327970e-111 3.766691e-109
AFP 0.827938 6.476032 580.9512 2.329059e-128 1.541449e-125
PEG3 0.780960 6.214317 330.6309 7.004662e-74 5.705746e-72
PPP1R9A 0.750667 6.433330 395.2754 5.880676e-88 7.983646e-86
EPCAM 0.746381 6.243183 300.1526 3.051573e-67 1.923460e-65

β­• After obtaining the DEGs, further analyses can be conducted, such as protein-protein interaction network, Gene Set Enrichment Analysis (GSEA), transcription factor analysis, etc.

6️⃣ 7️⃣ 8️⃣ 9️⃣



All functions πŸ”§οƒ

Quantile_score

🚩 Converts raw expression data of a dataset into quantile-based scores and organizes the disease status of the samples.

Parameter Type Description
dataset_name Character
Name of the dataset.
dataset_expression Data.frame
Raw expression data, with genes as rows and samples as columns.
dataset_type Character
'Training' or 'Validation'.
dataset_disease Data.frame
The first column is the sample name, the second column is the disease status ('No' represents non-diseased samples, 'Yes' represents diseased samples). If survival information is available, the third column is the survival status (0 represents survival, 1 represents death), and the fourth column is the follow-up time (in months). Non-diseased samples do not need to have survival information filled in.

UniQuant_Hypervariable_gene

🚩 Select the hypervariable genes that are expressed across the majority of the datasets.

Parameter Type Description
n_hypervariable_gene Integer
1-3000. The number of hypervariable genes selected from each dataset.
dataset_threshold Numeric
0-1.0. The minimum proportion of datasets in which the selected genes must be expressed.

UniQuant_Disease_gene

🚩 Select disease-related genes with high AUC from the Training set.

Parameter Type Description
ls_gene Vector
The vector containing multiple genes.
dataset_threshold Numeric
0-1.0. The minimum proportion of datasets in which the selected genes must be expressed.

UniQuant_Model_Training

🚩 Build a disease diagnostic model using the integrated training dataset.

Parameter Type Description
ls_gene Vector
The vector containing multiple genes.

UniQuant_Model_Validation

🚩 Analyze the performance of the disease diagnostic model using the integrated validation dataset.


UniQuant_Model_dataset_AUC

🚩 Analyze the diagnostic performance of the disease diagnostic model across all datasets.


UniQuant_Remove_dataset

🚩 Remove the specified dataset.

Parameter Type Description
dataset_name Character
Name of the dataset.

UniQuant_Class

🚩 Perform molecular classification of disease samples based on the integrated dataset.

Parameter Type Description
ls_gene Vector
A vector containing a list of genes to be processed.
n_Class Integer
The number of molecular classifications to be used.
method Character
The method for classification, either 'CCP' or 'NMF'.

UniQuant_Class_Contribution

🚩 Identifies the top contributing genes based on their importance in molecular classification.

Parameter Type Description
num_Top Integer
Number of top genes to display based on their contribution score.
n_Train Integer
Number of classification training tasks.
reTrain Logical
If TRUE, the tasks will be retrained.

UniQuant_DEG_Disease

🚩 Perform differential gene expression analysis between disease and non-disease samples.

Parameter Type Description
dataset_threshold Numeric
0-1.0. The minimum proportion of datasets in which the selected genes must be expressed.

UniQuant_DEG_Class

🚩 Perform differential gene expression analysis between different molecular classification.

Parameter Type Description
input_Class Data.frame
The first column represents the samples, and the second column represents the molecular classification.
Class_test Character
The target molecular classification for differential expression analysis.
Class_control Vector
The contrast molecular classification for differential expression analysis.
dataset_threshold Numeric
0-1.0. The minimum proportion of datasets in which the selected genes must be expressed.

UniQuant_dataset

🚩 View information about the dataset.

Parameter Type Description
Sample Logical
If TRUE, counts the number of diseased and non-diseased samples.



Reference πŸ”–οƒ

Huanhou Su et al. Deciphering the Oncogenic Landscape of Hepatocytes through Integrated Single-Nucleus and Bulk RNA-Seq of Hepatocellular Carcinoma. Advanced Science. 2025. DOI: [10.1002/advs.202412944]