Author(s): Ying Ge, Taojun Ye
Date: 2025-09-22

Academic Citation

If you use this code in your work or research, we kindly request that you cite our publication:

Xiaofan Lu, et al. (2025). FigureYa: A Standardized Visualization Framework for Enhancing Biomedical Data Interpretation and Research Efficiency. iMetaMed. https://doi.org/10.1002/imm3.70005

需求描述

我要用自己的gene set做富集分析,像paper里那样用ROAST-barcode plot的方法来实现。

**扩展阅读:https://mp.weixin.qq.com/s/O1xd5l5h33fXUWOvX3UBGw

Requirement description

I want to use my own gene set for enrichment analysis, using the ROAST barcode plot method as described in the paper.

**Extended reading: * * https://mp.weixin.qq.com/s/O1xd5l5h33fXUWOvX3UBGw

出自http://genesdev.cshlp.org/content/28/12/1337.long

fromhttp://genesdev.cshlp.org/content/28/12/1337.long

Figure 5. Pax5 restoration causes rapid repression of Myc and DNA replication factors. (F) Gene set analysis barcode plot. The RNA-seq differential gene expression data set upon Pax5 restoration in STAT5-CA;Vav-tTA;TRE-GFP-shPax5 leukemia cells in vivo is shown as a shaded rectangle, with genes horizontally ranked by moderated t statistic; genes up-regulated upon Pax5 restoration are shaded pink (t > 1), and down-regulated genes are shaded blue (t < 1). Overlaid are a set of previously described genes induced (red bars) or repressed (blue bars) upon the transition from large cycling pre-B cells to small resting pre-B cells during normal B-cell development in the bone marrow (Hoffmann et al. 2002). Red/blue traces above/below the bar represent relative enrichment.

Pathway analysis

Gene expression signatures were tested used ROAST rotation gene set testing (Smyth 2004; Wu et al. 2010). ROAST is a hypothesis driven test that takes into account the directionality (up or down) and strength (log2 fold change) of the genes in the gene set. Gene set barcode plots were generated using the barcodeplot function of the limma package as described previously (Lim et al. 2009).

出自https://www.nature.com/articles/leu201527

Figure 2. Inducible Ikaros restoration in T-ALL in vivo. (e) Gene set analysis barcode plot, with RNA-seq differential gene expression from combined analysis of Ikaros restoration in ALL65, ALL101 and ALL211 in vivo shown as a shaded rectangle with genes horizontally ranked by moderated t-statistic. Genes upregulated upon Ikaros restoration are shaded pink (z41) and downregulated genes are shaded blue (zo1). Overlaid are a previously described set of genes induced (red bars) or repressed (blue bars) upon Notch inhibition in a murine T-ALL cell line.7 Red and blue traces above and below the barcode represent relative enrichment. P-value was computed by the roast method54 using both up- and downregulated genes. (f) Gene set analysis barcode plot as for (e) but with blue bars indicating 81 Rbpj-bound, Notch-activated genes recently identified in a murine T-cell leukemia cell line.

应用场景

roast和barcode分别用来做什么?

简单讲就是:用roast来判断上/下调基因是否显著富集目标基因集,用barcode plot来展示上/下调基因在目标基因集中的分布。roast检验一个gene set,mroast做multiple roast tests。原文:

Application scenarios

What are roar and barcode used for respectively?

Simply put, it is to use roar to determine whether upregulated/downregulated genes significantly enrich the target gene set, and use barcode plot to display the distribution of upregulated/downregulated genes in the target gene set. Roast test a gene set, mroast does multiple roar tests. Original text:

适用性:适用于microarray、RNA-seq的表达矩阵,用limma给全部基因做差异表达分析,不需要筛差异表达基因。

Applicability: Suitable for microarray and RNA seq expression matrices, perform differential expression analysis on all genes using limma without screening for differentially expressed genes.

环境设置

Environment settings

source("install_dependencies.R")
## Starting R package installation...
## ===========================================
## 
## Installing Bioconductor packages...
## Package already installed: limma 
## 
## ===========================================
## Package installation completed!
## You can now run your R scripts in this directory.
# 加载limma包,用于基因表达数据分析和差异表达分析
# Load the limma package for gene expression data analysis and differential expression analysis
library(limma)

# 设置系统环境变量,使R显示英文报错信息
# Set the system environment variable to display error messages in English
Sys.setenv(LANGUAGE = "en") 

# 设置选项,禁止将字符串自动转换为因子类型
# Set the option to prevent automatic conversion of strings to factors
options(stringsAsFactors = FALSE) 

输入文件

需要三种输入文件:

  • easy_input_set1.csv和easy_input_set2.csv,一个或两个。自己感兴趣的gene set/signature,例如某个通路、某个GO term、别人文章里发现的某一条件下的应答基因、跟目标蛋白有相互作用的蛋白等等。总之是你能讲出意义的gene set/signature就可以。

  • easy_input_count.csv,基因表达矩阵,已经过均一化等预处理

  • easy_input_sample.txt,样品分组信息

用后两个文件做差异表达分析

Input file

Three types of input files are required:

-EasyInput_Set1.csv and EasyInput_Set2.csv, one or two. The gene set/signature that interests oneself, such as a certain pathway, a certain GO term, a response gene found under a certain condition in someone else’s article, a protein that interacts with the target protein, and so on. In short, as long as you can articulate the meaning of the gene set/signature.

  • easy_input_count.csv, The gene expression matrix has undergone preprocessing such as homogenization

  • easy_input_sample.txt, Sample grouping information

Perform differential expression analysis using the last two files

# 读取表达矩阵数据,第一列作为行名(通常为基因ID)
# Read expression matrix data, using the first column as row names (usually gene IDs)
expr <- read.csv("easy_input_expr.csv", row.names = 1)
head(expr)  # 查看数据前几行
##        GSM431121 GSM431122 GSM431123 GSM431124 GSM431125 GSM431126
## RPL41   40030.68  40292.92  40345.05  40397.75  40719.06  40345.05
## ANXA2   31230.28  35752.89  37174.10  36584.56  37906.75  36292.69
## TPT1    34509.26  33224.73  34509.26  34528.11  35306.72  34509.26
## PLK1    33776.06  33099.00  34648.96  34953.17  34874.99  35081.78
## RPL23A  34933.65  34657.47  33335.98  33957.87  33656.04  33879.19
## RPS2    34138.56  32664.69  32056.11  32664.69  32664.69  32664.69
# 读取样本信息表,使用制表符分隔,第一行为表头
# Read sample information table with tab-separated values and header row
Targets <- read.table("easy_input_pheno.txt", sep = "\t", header = T)
Targets  # 显示样本信息内容
##                      title condition
## 1  HMLER, paclitaxel, rep1         a
## 2  HMLER, paclitaxel, rep2         a
## 3  HMLER, paclitaxel, rep3         a
## 4 HMLER, salinomycin, rep1         b
## 5 HMLER, salinomycin, rep2         b
## 6 HMLER, salinomycin, rep3         b
# 构建设计矩阵,用于线性模型分析,~0表示不包含截距项
# Construct design matrix for linear model analysis, ~0 indicates no intercept term
design <- model.matrix(~ 0 + condition, data = Targets)
design  # 显示设计矩阵结构
##   conditiona conditionb
## 1          1          0
## 2          1          0
## 3          1          0
## 4          0          1
## 5          0          1
## 6          0          1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$condition
## [1] "contr.treatment"
# 使用voom方法对RNA-seq数据进行预处理,使其适合线性模型分析
# Preprocess RNA-seq data using voom to make it suitable for linear modeling
expr <- voom(expr, design, plot = TRUE)

# 读取两个基因集数据,通常包含感兴趣的基因列表
# Read two gene sets, typically containing lists of genes of interest
geneSet1 <- read.csv("easy_input_set1.csv")
head(geneSet1)  # 查看基因集1的前几行
##      Gene
## 1    CD40
## 2    DBF4
## 3 DCLRE1B
## 4    DNTT
## 5   POLA1
## 6   POLD1
geneSet2 <- read.csv("easy_input_set2.csv")

# 创建逻辑索引,标记表达矩阵中属于各基因集的基因行
# Create logical indices to mark rows in the expression matrix belonging to each gene set
Set1index <- rownames(expr) %in% geneSet1$Gene
Set2index <- rownames(expr) %in% geneSet2$Gene

# 使用roast方法对单个基因集进行功能富集分析
# Perform gene set enrichment analysis for a single gene set using roast method
roast(expr, Set1index, design)
##          Active.Prop      P.Value
## Down               0 1.0000000000
## Up                 1 0.0002500625
## UpOrDown           1 0.0005000000
## Mixed              1 0.0005000000
# 使用mroast方法同时对多个基因集进行富集分析,参数2表示置换次数
# Perform multiple gene set enrichment tests simultaneously using mroast, parameter 2 is permutation times
mroast(expr, list(Set1index,Set2index), design, 2)
##      NGenes PropDown PropUp Direction PValue   FDR PValue.Mixed FDR.Mixed
## set1     32        0      1        Up  5e-04 5e-04        5e-04     5e-04
## set2     30        0      1        Up  5e-04 5e-04        5e-04     5e-04

题外话:如果你感兴趣的比较多,还可以写成循环,挑出pvalue较小的gene set用来画图。

**Off topic: * * If you are interested in more, you can also write it as a loop and select the gene set with smaller p-value for drawing.

开始画图

Start drawing

# 读取表达矩阵数据,第一列作为行名(通常为基因ID)
# Read expression matrix data, using the first column as row names (usually gene IDs)
expr <- read.csv("easy_input_expr.csv", row.names = 1)
head(expr)  # 查看数据前几行
##        GSM431121 GSM431122 GSM431123 GSM431124 GSM431125 GSM431126
## RPL41   40030.68  40292.92  40345.05  40397.75  40719.06  40345.05
## ANXA2   31230.28  35752.89  37174.10  36584.56  37906.75  36292.69
## TPT1    34509.26  33224.73  34509.26  34528.11  35306.72  34509.26
## PLK1    33776.06  33099.00  34648.96  34953.17  34874.99  35081.78
## RPL23A  34933.65  34657.47  33335.98  33957.87  33656.04  33879.19
## RPS2    34138.56  32664.69  32056.11  32664.69  32664.69  32664.69
# 读取样本信息表,使用制表符分隔,第一行为表头
# Read sample information table with tab-separated values and header row
Targets <- read.table("easy_input_pheno.txt", sep = "\t", header = T)
Targets  # 显示样本信息内容
##                      title condition
## 1  HMLER, paclitaxel, rep1         a
## 2  HMLER, paclitaxel, rep2         a
## 3  HMLER, paclitaxel, rep3         a
## 4 HMLER, salinomycin, rep1         b
## 5 HMLER, salinomycin, rep2         b
## 6 HMLER, salinomycin, rep3         b
# 构建设计矩阵,用于线性模型分析,~0表示不包含截距项
# Construct design matrix for linear model analysis, ~0 indicates no intercept term
design <- model.matrix(~ 0 + condition, data = Targets)
design  # 显示设计矩阵结构
##   conditiona conditionb
## 1          1          0
## 2          1          0
## 3          1          0
## 4          0          1
## 5          0          1
## 6          0          1
## attr(,"assign")
## [1] 1 1
## attr(,"contrasts")
## attr(,"contrasts")$condition
## [1] "contr.treatment"
# 使用voom方法对RNA-seq数据进行预处理,使其适合线性模型分析
# Preprocess RNA-seq data using voom to make it suitable for linear modeling
expr <- voom(expr, design, plot = TRUE)

# 读取两个基因集数据,通常包含感兴趣的基因列表
# Read two gene sets, typically containing lists of genes of interest
geneSet1 <- read.csv("easy_input_set1.csv")
head(geneSet1)  # 查看基因集1的前几行
##      Gene
## 1    CD40
## 2    DBF4
## 3 DCLRE1B
## 4    DNTT
## 5   POLA1
## 6   POLD1
geneSet2 <- read.csv("easy_input_set2.csv")

# 创建逻辑索引,标记表达矩阵中属于各基因集的基因行
# Create logical indices to mark rows in the expression matrix belonging to each gene set
Set1index <- rownames(expr) %in% geneSet1$Gene
Set2index <- rownames(expr) %in% geneSet2$Gene

# 使用roast方法对单个基因集进行功能富集分析
# Perform gene set enrichment analysis for a single gene set using roast method
roast(expr, Set1index, design)
##          Active.Prop      P.Value
## Down               0 1.0000000000
## Up                 1 0.0002500625
## UpOrDown           1 0.0005000000
## Mixed              1 0.0005000000
# 使用mroast方法同时对多个基因集进行富集分析,参数2表示置换次数
# Perform multiple gene set enrichment tests simultaneously using mroast, parameter 2 is permutation times
mroast(expr, list(Set1index,Set2index), design, 2)
##      NGenes PropDown PropUp Direction PValue   FDR PValue.Mixed FDR.Mixed
## set1     32        0      1        Up  5e-04 5e-04        5e-04     5e-04
## set2     30        0      1        Up  5e-04 5e-04        5e-04     5e-04

小伙伴对rosta富集分析比较陌生,附相关资料:

RNA-seq为例:https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf,看“18.1 Profiles of Yoruba HapMap Individuals”,尤其是“Gene set testing”部分。

microarray为例:http://genomicsclass.github.io/book/pages/gene_set_analysis_in_R.html

limma包里的更多富集分析方法:http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/10GeneSetTests.html

Tips

My friend is relatively unfamiliar with Rosta enrichment analysis. Attached are relevant materials:

For example, RNA seq:< https://bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf >Please refer to “18.1 profiles of Yoruba HapMap Individuals”, especially the “Gene set testing” section.

For example, microarray:< http://genomicsclass.github.io/book/pages/gene_set_analysis_in_R.html >

More enrichment analysis methods in the limma package:< http://web.mit.edu/ ~r/current/arch/i386_linux26/lib/R/library/limma/html/10GeneSetTests.html>

Session Info

sessionInfo()
## R version 4.5.1 (2025-06-13)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] limma_3.64.3
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.6.1          fastmap_1.2.0     xfun_0.53        
##  [5] cachem_1.1.0      knitr_1.50        htmltools_0.5.8.1 rmarkdown_2.29   
##  [9] lifecycle_1.0.4   cli_3.6.5         sass_0.4.10       jquerylib_0.1.4  
## [13] statmod_1.5.0     compiler_4.5.1    tools_4.5.1       evaluate_1.0.5   
## [17] bslib_0.9.0       yaml_2.3.10       rlang_1.1.6       jsonlite_2.0.0