lncRNA 流程使用说明
USAGE: RNAseq_new <analysis_dir> [species_name: hg19]
这里我们创建名为 TEST_RNA 的项目, 并使用用于测试的参考基因组 test
RNAseq_new TEST_RNA test
在当前目录下, 可以看到 TEST_RNA 项目文件夹. 该文件夹应当包含以下内容
- code: 程序主目录
- meta_data
- download_list.txt: 原始 fastq 文件所在路径, 及其对应的 ID, Sample 名称, 配置方法请参考设置部分
- group_info.txt: 差异表达分析要用到的信息, 配置方法请参考设置部分
- signature.txt / target.txt: 关注的 Gene ID
- gdc-user-token.xxxxxx.txt
- params: 自定义运行参数
- pipeline: 运行结果所在目录
- raw_data: 存放原始 fastq 文件 (非必需), 可将原始数据存放于磁盘任何位置, 只要
设置正确 - referenceFiles: 参考基因组及相应的注释文件等
- run.bash
- start_rna.bash: 任务提交脚本
- meta_data/download_list.txt: 第一列为 sample 名, 第二列为 ID 号, 第三列为原始 fastq.gz 所在路径. fastq.gz 需命名如下:
- ID_L001_R1.fastq.gz
- meta_data/group_info.txt: 第一行为表头. 第一列为 sample 名, 需要与
的 sample 名对应; 第二列为分组信息 - params: 该流程使用
进行比对, 需提供对应基因组注释文件 (GTF 格式). 参数可自定义- 修改 params 中
到对应的 GTF 文件 - 修改 params 中
的参数到对应的 GTF 文件 - 若不希望执行某一步骤, 则注释该行
- 修改 params 中
- signature.txt / target.txt: 关注的 Gene ID, 注意 Gene ID 需要与 3 中基因组注释文件一致
- [高级] 若分析的物种尚未建立
索引, 则项目底下的referenceFiles
为 broken links. 删除并建立名为referenceFile
的文件夹, 拷贝code
中, 根据需要修改files_needed.bash
yhbatch -N 1 start_rna.bash
文件中, 修改 NSLOTS
变量, 指定所需线程数
pipeline 的文件结构为:
|-- fastq
|-- alignment
|-- <sample>: STAR, <Markduplicate> 运行结果所在文件夹. 主要有以下文件:
*.Log.final.out, *count.txt, *.bw (track file), accepted.bam, sequence_info.txt, run.bash
|-- cufflinks
|-- <sample>: assembling 结果所在文件夹. 主要文件: *.fpkm_tracking, transcript.gtf, run.bash
|-- Fastqc
|-- summarize
|-- <group>: align_summary.txt, raw_count.txt
|-- report & report.tar.gz: You can download this file and create a HTML report in your machine
If you wish create HTML report in Tianhe2, do not foget to uncomment the necessary line in `report/run.sh`
|-- <DATE>_<TIME>: the Log files, check these files if you don’t have seen any expected file
USAGE: lncRNA_new <analysis_dir> <RNA_prj> <ChIP_prj> [species_name: hg19]
这里我们创建名为 TEST_lncRNA 的项目, 并使用用于测试的参考基因组 test. 这里 TEST_RNA, TEST_ChIP 分别为你的 RNA-Seq 的项目, ChIP-seq 的项目, 并且与 TEST_lncRNA 位于同一目录下
在当前目录下, 可以看到 TEST_lncRNA 项目文件夹. 该文件夹应当包含以下内容
: ## 表示数字, 这些文件夹表示运行的中间结果, 其中_11_report
包含 HTML 报告- cdoe: 程序主目录
- input:
- pipeline: 运行结果所在目录
- referenceFiles: 参考基因组及相应的注释文件等
- run.bash
- start_lncRNA.bash: 任务提交脚本
- inputs/<ChIP_prj>: Your ChIP-Seq analysis results. Make sure you have peak files in bed format in your
directory. You can generate them by performing your own ChIP-Seq analysis. - inputs/<RNA_prj>: Make sure this link link properly with the directory of your
project you set inSTEP 1
- inputs/group_info.txt: Please follow this example file to set up your sample sheet. Please make sure you match your sample name with the sample name you used in
pipeline (Sample name should match the directory name in theRNA-Seq/pipeline/alignment/
folder andRNA-Seq/pipeline/cufflinks
) - inputs/params.bash: Please change the parameters as your desire in this file
- [description of setting parameters] Building
- inputs/custom-bashrc: this file set up the path to all necessary dependencies. Please check
for all dependencies requirement - [Advanced]: If you are working on a different reference version or species, you may need to set up
directory:- If you found the
is a broken link, it means you provide a wrong ID for the species. If not, you should set upreferenceFiles
by yourself. DeletereferenceFiles
and create a newreferenceFiles
directory. Follow thecode/build_referenceFiles.bash
script (set up the necessary files inreferenceFiles
) and run this command at your project folder<analysis_dir>
- You can obtain the files mentioned in
from well-known projects, for exampleGENCODE
project - Please check if you have replaced all the files in this directory as your own desire. Also, please make sure you use the same reference version or species in
- Please don't forget to change your
file if you use a different reference version or species and finished setting up thereferenceFiles
- If you found the
yhbatch -N 1 start_lncRNA.bash
文件中, 修改 NSLOTS
变量, 指定所需线程数
|-- _01_ChIP-Seq
|-- _01_RNA-Seq
|-- _02_cuffmerge
|-- merged.gtf, assemblies_list.txt, run.bash
|-- _03_identify: Identify lncRNA
|-- 4_novel_nc.gtf
|-- _04_whole_assembly
|-- all.gtf
|-- _05_coding_potential: Accessment of potential coding region by using CPAT tool
|-- gencode-lncRNA*
|-- gencode-mRNA*
|-- novel-lncRNA*
|-- lncRNA.bed, lncRNA_for_table.txt
|-- mRNA.bed, mRNA_for_table.txt
|-- _05_featureCounts
|-- <sample>: counts.txt, featureCounts.txt, featureCounts.txt.summary, run.bash
|-- *out
|-- raw_count.txt
|-- _06_annotate: annotate lncRNA from other annotation files
|-- _06_summarize: run DESeq2 by user-defined group
|-- fpkm_table.txt
|-- norm_count.txt
|-- lncRNA_fpkm_table.txt
|-- mRNA_fpkm_table.txt
|-- _07_fpkm_cutoff: Calculating FPKM Cutoff
|-- Group
|-- _07_peak_overlap: Overlapping ChIP-Seq peaks grouping by user-defined group
|-- Group
|-- _08_integrate_DE: Integrating FPKM cutoff and Differential Expression Analysis result grouping by user-defined group
|-- Group
|-- _08_integrate_HistoneCombine: Integrating FPKM cutoff and ChIP-Seq peaks overlap grouping by user-defined group
|-- Group
|-- _09_pie_matrix_DE: Making Pie-matrix
|-- _09_pie_matrix_peak: Making Pie-matrix
|-- _10_figures
|-- _10_snapshot
|-- _10_tracks
|-- _11_report: a HTML report
pipeline 的文件结构除了以上 _##_<NAME>
部分, 还包括:
|-- <DATE>_<TIME>: the Log files, check these files if you don’t have seen any expected file