测序数据的深度、覆盖度等计算

sequencing depth、coverage

Posted by dulunar on October 14, 2020 | 访问量:

Coverage Depth

coverage_depth

覆盖深度 mapping depth

基因组被测序片段(短读 short reads)“覆盖”的强度有多大?

每一碱基的覆盖率是基因组碱基被测序的平均次数。 基因组的覆盖深度是通过与基因组匹配的所有短读的碱基数目除以该基因组的长度来计算的。 它通常表示为1X、2X、3X、…(1、2或3倍覆盖)。

此处通常被称为测序深度(sequencing depth)或者覆盖深度(depth of coverage)。

Depth 计算公式

原始测序数据的深度,Depth = (# sequenced bases) / (# bases of reference) 或者

比对上的数据的深度,Depth= (# bases of all mapped reads) / (# bases of reference).

覆盖范围 covered length

短读“覆盖”了基因组的多少区域?是否有些区域没有任何的短读覆盖?

覆盖范围是参考基因组被一定深度覆盖的碱基的百分比。例如,一个基因组的90%区域在1X深度覆盖,另外仍然有70%的区域被覆盖了5X深度。

此处通常称为覆盖度(genome covarage)或者覆盖范围(breadth of coverage)。

Coverage 计算公式

Coverage = (# area covered by mapped reads) / (# area of reference).

例子

example

对于全基因组

Depth = (6 * 28nt) / 112nt = 1.2 fold

Coverage = (46nt - 5nt) / 112nt = 36.6%

对于target区域

Depth = (6 * 28nt) / 46nt = 3.7 fold

Coverage = (46nt - 5nt) / 46nt = 89.1%

对于position

Depth = 6 fold

reads数

下机的fq文件中到底有多少条reads?可以使用软件readfq来统计:

git clone https://github.com/billzt/readfq
gcc -o kseq_fastq_base kseq_fastq_base.c -lz
kseq_fastq_base samp.fq
#FASTQ files could be gzipped.
#The statistic results would be printed on STDOUT.

功能单一,优化了速度: It could deal with ~4M reads (1G base) in less than 40 seconds, ~50M reads (14G base) in less than 5 minutes!!!

现在已有很多工具都会提供这个功能,间接或者直接计算测序深度和覆盖度,比如:samtools、sambamba、bedtools、qualimap、gatk等等,下面对各个工具的使用和结果做一点点汇整。

计算测序深度和覆盖度

samtools or sambamba

  1. depth可以得到每一个碱基位点上的测序深度:
    samtools depth samp.bam > base_depth.txt
    sambamba depth -t 10 -w 500 samp.bam > 500base_depth.txt
    # -w breadth of the window, in bp, 计算窗口的平均深度
    
  2. stats对比对后的bam文件进行全面的统计
    samtools stats --threads 10 samp.bam > samp_bamstat.txt
    grep "^SN" samp_bamstat.txt
    SN      raw total sequences:    17870982
    SN      filtered sequences:     0
    SN      sequences:      17870982
    SN      is sorted:      0
    SN      1st fragments:  8935491
    SN      last fragments: 8935491
    SN      reads mapped:   17870982
    SN      reads mapped and paired:        17870982        # paired-end technology bit set + both mates mapped
    SN      reads unmapped: 0
    SN      reads properly paired:  5208824 # proper-pair bit set
    SN      reads paired:   17870982        # paired-end technology bit set
    SN      reads duplicated:       0       # PCR or optical duplicate bit set
    SN      reads MQ0:      556465  # mapped and MQ=0
    SN      reads QC failed:        0
    SN      non-primary alignments: 0
    SN      total length:   1804969182      # ignores clipping
    SN      bases mapped:   1804969182      # ignores clipping
    SN      bases mapped (cigar):   1804969182      # more accurate
    SN      bases trimmed:  0
    SN      bases duplicated:       0
    SN      mismatches:     9941932 # from NM fields
    SN      error rate:     5.508089e-03    # mismatches / bases mapped (cigar)
    SN      average length: 101
    SN      maximum length: 101
    SN      average quality:        35.3
    SN      insert size average:    1330.1
    SN      insert size standard deviation: 1341.6
    SN      inward oriented pairs:  1418095
    SN      outward oriented pairs: 564408
    SN      pairs with other orientation:   6745980
    SN      pairs on different chromosomes: 207008
    
  3. 接下来可以使用plot-bamstats进行结果的可视化,plot-bamstats是samtools中包含的一个软件,可以在samtools的根目录找到,查看使用:
plot-bamstats --help
About: Parses output of samtools stats (former bamcheck) and calls gnuplot to create graphs.
Usage: plot-bamstats [OPTIONS] file.bam.bc
       plot-bamstats -p outdir/ file.bam.bc
Options:
   -k, --keep-files                    Do not remove temporary files.
   -l, --log-y                         Set the Y axis scale of the Insert Size plot to log 10.
   -m, --merge                         Merge multiple bamstats files and output to STDOUT.
   -p, --prefix <path>                 The output files prefix, add a slash to create new directory.
   -r, --ref-stats <file.fa.gc>        Optional reference stats file with expected GC content (created with -s).
   -s, --do-ref-stats <file.fa>        Calculate reference sequence GC for later use with -r
   -t, --targets <file.tab>            Restrict -s to the listed regions (tab-delimited chr,from,to. 1-based, inclusive)
   -h, -?, --help                      This help message.

例子如下:

plot-bamstats -s /home/luna/Desktop/database/homo_bwa/hsa.fa > gc 
plot-bamstats -r gc -p ./test samp_bamstat.txt

先计算基因组的gc含量分布,再给出一系列文件和图片,还有一个html文件:

ll test* gc
-rw-rw-r-- 1 luna luna   1910 Oct 14 16:04 gc
-rw-rw-r-- 1 luna luna   4266 Oct 14 16:04 test-acgt-cycles.gp
-rw-rw-r-- 1 luna luna  16707 Oct 14 16:04 test-acgt-cycles.png
-rw-rw-r-- 1 luna luna    429 Oct 14 16:04 test-coverage.gp
-rw-rw-r-- 1 luna luna  10497 Oct 14 16:04 test-coverage.png
-rw-rw-r-- 1 luna luna   4736 Oct 14 16:04 test-gc-content.gp
-rw-rw-r-- 1 luna luna  21895 Oct 14 16:04 test-gc-content.png
-rw-rw-r-- 1 luna luna   1026 Oct 14 16:04 test-gc-depth.gp
-rw-rw-r-- 1 luna luna  23312 Oct 14 16:04 test-gc-depth.png
-rw-rw-r-- 1 luna luna   4606 Oct 14 16:04 test.html
-rw-rw-r-- 1 luna luna   3528 Oct 14 16:04 test-indel-cycles.gp
-rw-rw-r-- 1 luna luna  28786 Oct 14 16:04 test-indel-cycles.png
-rw-rw-r-- 1 luna luna   1739 Oct 14 16:04 test-indel-dist.gp
-rw-rw-r-- 1 luna luna  29595 Oct 14 16:04 test-indel-dist.png
-rw-rw-r-- 1 luna luna 261710 Oct 14 16:04 test-insert-size.gp
-rw-rw-r-- 1 luna luna  16580 Oct 14 16:04 test-insert-size.png
-rw-rw-r-- 1 luna luna   5939 Oct 14 16:04 test-quals2.gp
-rw-rw-r-- 1 luna luna  21727 Oct 14 16:04 test-quals2.png
-rw-rw-r-- 1 luna luna  76931 Oct 14 16:04 test-quals3.gp
-rw-rw-r-- 1 luna luna  84648 Oct 14 16:04 test-quals3.png
-rw-rw-r-- 1 luna luna   2228 Oct 14 16:04 test-quals.gp
-rw-rw-r-- 1 luna luna  47463 Oct 14 16:04 test-quals-hm.gp
-rw-rw-r-- 1 luna luna  34730 Oct 14 16:04 test-quals-hm.png
-rw-rw-r-- 1 luna luna  15932 Oct 14 16:04 test-quals.png

可以打开html文件展示如下: test

点击小图,也会在旁边展示大图,可以更清晰查看。

BAMStats

BAMStats是建立在Picard Java API上的简单软件工具,可以计算并以图形方式显示从SAM / BAM文件进行质量控制评估时得到的各种指标。 BAMStats为覆盖范围,起始位置,MAPQ值,映射的读取长度和编辑距离提供描述性统计信息。 如果提供了合适的bed或gtf文件,还可以为单个特征(例如外显子或诱饵区域)生成统计指标。 BAMStats需要分配2至6GB的内存(Java程序可以使用-Xmx来设置);另外SAM / BAM文件都可以作为输入,但是必须是排序后的文件。

  1. 下载
wget https://jaist.dl.sourceforge.net/project/bamstats/BAMStats-1.25.zip
unzip BAMStats-1.25.zip
rm BAMStats-1.25.zip
cd BAMStats-1.25 && ls
## 两个jar分别是命令行(CLI)和图形用户(GUI)
java -jar BAMStats-1.25/BAMStats-1.25.jar --help
USAGE:
======

Option                                  Description
------                                  -----------
-d, --distances                         If selected, edit distance statistics
                                          will also be displayed as a separate
                                          table (optional).
-f, --features                          Feature file specifying specific
                                          locations to calculate coverage
                                          metrics for (optional)
-h, --help                              Generate help.
-i, --infile                            SAM or BAM infile (must be sorted).
-l, --lengths                           If selected, mapped read length
                                          statistics will also be displayed as
                                          a separate table (optional).
-m, --mapped                            If selected, coverage statistics, for
                                          mapped regions only, will also be
                                          displayed as a separate table
                                          (optional).
-o, --outfile                           Outfile (optional).
-q, --qualities                         If selected, mapping quality (MAPQ)
                                          statistics will also be displayed as
                                          a separate table (optional).
-s, --starts                            If selected, start position statistics
                                          will also be displayed as a separate
                                          table (optional).
-v, --view                              View option for output format
                                          (currently accepts 'simple' or
                                          'html'; default, simple).
  1. 使用
sambamba sort -o smap.sort.bam -t 20 -p samp.bam
java -jar -Xmx30g BAMStats-1.25.jar -i samp.sort.bam -o BAMStats --view simple
cat BAMStats
ref N mean median sd q1 q3 2.50% 97.50% min max
chr1 249,250,621 0.63 0 11.29 0 1 0 3 0 15051
chr2 243,199,373 0.63 0 1.43 0 1 0 3 0 706
chr3 198,022,430 0.64 0 1 0 1 0 3 0 102
chr4 191,154,276 0.58 0 1.13 0 1 0 3 0 527
                   
chr22 51,304,566 0.36 0 0.91 0 0 0 3 0 207
chrX 155,270,560 0.59 0 1.29 0 1 0 3 0 832
chrY 59,373,566 0.07 0 4.6 0 0 0 0 0 1859
chrMT 16,571 26.14 22 15.47 16 32 7 69.7 0 98

改为html输出,可以看看:

java -jar -Xmx30g BAMStats-1.25.jar -d -l -m -q -s -i samp.sort.bam -o BAMStats2 --view html

会产生一个文件夹BAMStats2.data和一个html文件BAMStats2,进入查看:

cd BAMStats2.data && ls | wc -l
# 531
ll chr1_*
-rw-rw-r-- 1 luna luna  5056 Oct 14 16:43 chr1_Coverage_boxAndWhisker.png
-rw-rw-r-- 1 luna luna 11480 Oct 14 16:43 chr1_Coverage_cumulativeHistogram.png
-rw-rw-r-- 1 luna luna 12513 Oct 14 16:43 chr1_Coverage_histogram.png
-rw-rw-r-- 1 luna luna   753 Oct 14 16:43 chr1_Coverage.html
-rw-rw-r-- 1 luna luna  4528 Oct 14 16:44 chr1_EditDistances_boxAndWhisker.png
-rw-rw-r-- 1 luna luna 10601 Oct 14 16:44 chr1_EditDistances_cumulativeHistogram.png
-rw-rw-r-- 1 luna luna 11156 Oct 14 16:44 chr1_EditDistances_histogram.png
-rw-rw-r-- 1 luna luna   770 Oct 14 16:44 chr1_EditDistances.html
-rw-rw-r-- 1 luna luna  5852 Oct 14 16:43 chr1_MappedCoverage_boxAndWhisker.png
-rw-rw-r-- 1 luna luna 12311 Oct 14 16:43 chr1_MappedCoverage_cumulativeHistogram.png
-rw-rw-r-- 1 luna luna 12688 Oct 14 16:43 chr1_MappedCoverage_histogram.png
-rw-rw-r-- 1 luna luna   784 Oct 14 16:43 chr1_MappedCoverage.html
-rw-rw-r-- 1 luna luna  5107 Oct 14 16:44 chr1_MappingQuality_boxAndWhisker.png
-rw-rw-r-- 1 luna luna 11297 Oct 14 16:44 chr1_MappingQuality_cumulativeHistogram.png
-rw-rw-r-- 1 luna luna 11698 Oct 14 16:44 chr1_MappingQuality_histogram.png
-rw-rw-r-- 1 luna luna   771 Oct 14 16:44 chr1_MappingQuality.html
-rw-rw-r-- 1 luna luna  4979 Oct 14 16:44 chr1_ReadLengths_boxAndWhisker.png
-rw-rw-r-- 1 luna luna 10979 Oct 14 16:44 chr1_ReadLengths_cumulativeHistogram.png
-rw-rw-r-- 1 luna luna 10979 Oct 14 16:44 chr1_ReadLengths_histogram.png
-rw-rw-r-- 1 luna luna   773 Oct 14 16:44 chr1_ReadLengths.html
-rw-rw-r-- 1 luna luna   463 Oct 14 16:43 chr1_StartPositions.html

查看html文件BAMStats2,重命名,同时下载html和文件夹BAMStats2.data到本地,放在同一个目录下,以后使用浏览器打开:

mv BAMStats2 BAMStats2.html
tar -cf Stats2.tar BAMStats2*
sz Stats2.tar
# 本地解压

打开BAMStats2.html,部分结果展示: test2

也可以到文件夹BAMStats2.data中查看每一条染色体: chr1raw

chr1mapped

GUI界面也可以看下图: GUI

GATK DepthOfCoverage

gatk3和gatk4中都有这个工具,DepthOfCoverage,本处使用gatk 4.1.8.1:

which gatk
/home/luna/Desktop/Software/gatk/gatk-4.1.8.1/gatk
# gatk4
gatk DepthOfCoverage -R /home/luna/Desktop/database/homo_bwa/hg19.fa -I samp.sort.bam -O gatk -L chr1 --summary-coverage-threshold 1 --summary-coverage-threshold 4

# gatk3
java -jar /home/luna/Desktop/Software/gatk/GenomeAnalysisTK.jar -T DepthOfCoverage -I samp.sort.bam -R /home/luna/Desktop/database/homo_bwa/hg19.fa --out gatk3 --summaryCoverageThreshold 1 --summaryCoverageThreshold 4 -L chr1

# -L 后面可以接单个染色体的名称,也可以多次使用,或者bed文件包含基因组的多个区域作为限定
# summary Coverage Threshold,可以随意设置数值(N),表示%_above_N,即深度超过N的coverage of breadth.

运行结束会生成多个文件,主要是查看一下summary文件:

no suffix: per locus coverage
_summary: total, mean, median, quartiles, and threshold proportions, aggregated over all bases
_statistics: coverage histograms (# locus with X coverage), aggregated over all bases
_interval_summary: total, mean, median, quartiles, and threshold proportions, aggregated per interval
_interval_statistics: 2x2 table of # of intervals covered to >= X depth in >=Y samples
_gene_summary: total, mean, median, quartiles, and threshold proportions, aggregated per gene
_gene_statistics: 2x2 table of # of genes covered to >= X depth in >= Y samples
_cumulative_coverage_counts: coverage histograms (# locus with >= X coverage), aggregated over all bases
_cumulative_coverage_proportions: proprotions of loci with >= X coverage, aggregated over all bases

## 查看
cat gatk.sample_summary
sample_id,total,mean,granular_third_quartile,granular_median,granular_first_quartile,%_bases_above_1,%_bases_above_4
SRX247249,155122543,0.69,2,1,1,41.8,2.2
Total,155122543,0.69,N/A,N/A,N/A,N/A,N/A

BamStats

此处有多个bamstats的程序,所以作为区分,列出来。

Biopet BamStats

Biopet BamStats is a package that contains tools to generate stats from a BAM file, merge those stats for multiple samples, and validate the generated stats files.

Mode - Generate
Generate reports clipping stats, flag stats, insert size and mapping quality on a BAM file. 

The output of the JSON file is organized in a sample - library - readgroup tree structure. If readgroups in the BAM file are not annotated with sample (SM) and library (LB) tags an error will be thrown. This can be fixed by using samtools addreplacerg or picard AddOrReplaceReadGroups.

Mode - Merge
This module will merge bamstats files together and keep the sample/library/readgroup structure. Values for the same readgroups will be added. It will also validate the resulting file.

Mode - Validate
Validates a BamStats file. If aggregation values can not be regenerated the file is considered corrupt. This should only happen when the file has been manually edited.

website: https://biopet.github.io/bamstats/1.0.1/index.html
wget https://github.com/biopet/bamstats/releases/download/v1.0.1/BamStats-assembly-1.0.1.jar
mv BamStats-assembly-1.0.1.jar BamStats-biopet.jar
mkdir biopet
java -jar /home/luna/Desktop/Software/StatsBAM/BamStats-biopet.jar generate -R /home/luna/Desktop/database/homo_bwa/hg19.fa -o biopet -b samp.sort.bam --tsvOutputs
# 会在biopet生成文件
3prime_clipping.histogram.tsv
5prime_clipping.stats.tsv
clipping.histogram.tsv
insertsize.histogram.tsv
left_clipping.stats.tsv
right_clipping.histogram.tsv
3prime_clipping.stats.tsv
bamstats.json
clipping.stats.tsv
insertsize.stats.tsv
mappingQuality.histogram.tsv
right_clipping.stats.tsv
5prime_clipping.histogram.tsv
bamstats.summary.json
flagstats.tsv
left_clipping.histogram.tsv
mappingQuality.stats.tsv

结果很详细,但是对于json文件我没有找到合适的后续处理,这里仅记录。

guigolab bamstats

wget -O - https://github.com/guigolab/bamstats/releases/download/v0.3.5/bamstats-v0.3.5-linux-x86_64.tar.gz | tar xz -C ./ bamstats

/home/luna/Desktop/Software/StatsBAM/bamstats/bamstats -h
bamstats - compute mapping statistics

Usage:
  bamstats [flags]

Flags:
  -a, --annotaion string   element annotation file
  -c, --cpu int            number of cpus to be used (default 80)
  -h, --help               help for bamstats
  -i, --input string       input file (required)
      --loglevel string    logging level (default "warn")
      --max-buf int        maximum number of buffered records (default 1000000)
  -o, --output string      output file (default "-")
  -n, --reads int          number of records to process (default -1)
  -u, --uniq               output genomic coverage statistics for uniqely mapped reads too
      --version            version for bamstats

Bamstats can currently compute the following mapping statistics: general genome coverage RNA-seq

使用,具体上github查看,https://github.com/guigolab/bamstats/blob/master/scripts/

/home/luna/Desktop/Software/StatsBAM/bamstats/bamstats -i samp.sort.bam --cpu 20 --uniq --output guigolab.bamstats

bamdst

Bamdst is a lightweight tool to stat the depth coverage of target regions of bam file(s). Bam file(s) should be properly sorted, and the probe file (bed file) and the output dir must be specified in the first place.

下载安装

git clone https://github.com/shiquan/bamdst.git
cd bamdst/
make

使用

在使用该软件之前,我们首先制作一个bed文件:

cat chr1.region.bed
chr1	64909787	160398769
## PS中间空白是TAB,不是空格

查看使用说明

/home/luna/Desktop/Software/StatsBAM/bamdst/bamdst -h

bamdst version: 1.0.9
USAGE : bamdst [OPTION] -p <probe.bed> -o <output_dir> [in1.bam [in2.bam ... ]]
   or : bamdst [OPTION] -p <probe.bed> -o <output_dir> -

Option -o and -p are mandatory:
  -o, --outdir         output dir
  -p, --bed            probe or target regions file, the region file will
                       be merged before calculate depths

测试

mkdir ./bamdst
/home/luna/Desktop/Software/StatsBAM/bamdst/bamdst -p chr1.region.bed -o ./bamdst samp.sort.bam
cd ./bamdst
ls
# 列出生成的文件
-coverage.report This file contains all the coverage information of target and flank region, and reads stat information of the input file(s).
-cumu.plot Depth distrbution for plot.
-insert.plot Inferred insert size distribution for plot.
-chromosome.report Depth and coverage information of each chromosome.
-region.tsv.gz For each region in probe file (in.bed), average depth, median depth and coverage of these regions will be listed in the file.
-depth.tsv.gz For each position in the probe file(in.bed), three kinds depth value will be calculated, including raw depth, rmdup depth and the coverage depth. 
-uncover.bed This bed file contains the bad covered or uncovered region in the input bam file(s) against the probe file.

查看coverage.report 和 chromosome.report

cat coverage.report
## The file was created by bamdst
## Version : 1.0.9
## Files : samp.sort.bam
                               [Total] Raw Reads (All reads)    17870982
                                       [Total] QC Fail reads    0
                                        [Total] Raw Data(Mb)    1804.97
                                        [Total] Paired Reads    17870982
                                        [Total] Mapped Reads    17870982
                            [Total] Fraction of Mapped Reads    100.00%
                                     [Total] Mapped Data(Mb)    1804.97
                         [Total] Fraction of Mapped Data(Mb)    100.00%
                                     [Total] Properly paired    5208824
                         [Total] Fraction of Properly paired    29.15%
                                [Total] Read and mate paired    17870982
                    [Total] Fraction of Read and mate paired    100.00%
                                          [Total] Singletons    0
                       [Total] Read and mate map to diff chr    414016
                                               [Total] Read1    8935491
                                               [Total] Read2    8935491
                                        [Total] Read1(rmdup)    8935491
                                        [Total] Read2(rmdup)    8935491
                                [Total] forward strand reads    8940264
                               [Total] backward strand reads    8930718
                                 [Total] PCR duplicate reads    0
                     [Total] Fraction of PCR duplicate reads    0.00%
                            [Total] Map quality cutoff value    20
                       [Total] MapQuality above cutoff reads    17045397
                 [Total] Fraction of MapQ reads in all reads    95.38%
              [Total] Fraction of MapQ reads in mapped reads    95.38%
                                       [Target] Target Reads    548096
              [Target] Fraction of Target Reads in all reads    3.07%
           [Target] Fraction of Target Reads in mapped reads    3.07%
                                    [Target] Target Data(Mb)    55.36
                              [Target] Target Data Rmdup(Mb)    48.10
                [Target] Fraction of Target Data in all data    3.07%
             [Target] Fraction of Target Data in mapped data    3.07%
                                      [Target] Len of region    95488982
                                      [Target] Average depth    0.58
                               [Target] Average depth(rmdup)    0.50
                                     [Target] Coverage (>0x)    32.72%
                                    [Target] Coverage (>=4x)    1.87%
                                   [Target] Coverage (>=10x)    0.09%
                                   [Target] Coverage (>=30x)    0.01%
                                  [Target] Coverage (>=100x)    0.00%
                                [Target] Target Region Count    1
                                [Target] Region covered > 0x    0
                       [Target] Fraction Region covered > 0x    0.00%
                      [Target] Fraction Region covered >= 4x    0.00%
                     [Target] Fraction Region covered >= 10x    0.00%
                     [Target] Fraction Region covered >= 30x    0.00%
                    [Target] Fraction Region covered >= 100x    0.00%
                                          [flank] flank size    200
           [flank] Len of region (not include target region)    200
                                       [flank] Average depth    1.09
                                         [flank] flank Reads    4
                [flank] Fraction of flank Reads in all reads    0.00%
             [flank] Fraction of flank Reads in mapped reads    0.00%
                                      [flank] flank Data(Mb)    0.00
                  [flank] Fraction of flank Data in all data    0.00%
               [flank] Fraction of flank Data in mapped data    0.00%
                                      [flank] Coverage (>0x)    55.50%
                                     [flank] Coverage (>=4x)    2.00%
                                    [flank] Coverage (>=10x)    0.00%
                                    [flank] Coverage (>=30x)    0.00%
                                   [flank] Coverage (>=100x)    0.00%

cat chromosomes.report
#Chromosome DATA(%) Avg depth Median Coverage% Cov 4x % Cov 10x % Cov 30x % Cov 100x %
chr1 100 0.58 0 32.72 1.87 0.09 0.01 0

Qualimap

Qualimap是功能比较全的一款质控软件,提供GUI界面和命令行界面,可以对bam文件,RNA-seq,Counts数据质控,也支持比对数据,counts数据和表观数据的比较和聚类。

Qualimap:http://qualimap.bioinfo.cipf.es/doc_html/index.html

命令行文档:http://qualimap.bioinfo.cipf.es/doc_html/command_line.html#command-line

下载安装

wget https://bitbucket.org/kokonech/qualimap/downloads/qualimap_v2.2.1.zip
unzip qualimap_v2.2.1.zip

正常使用qualimap,需要安装部分R包:

# 安装相关的R包
# NOISeq,Repitools, Rsamtools, GenomicFeatures, rtracklayer
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("NOISeq", version = "3.8")
BiocManager::install("Repitools", version = "3.8")
BiocManager::install("Rsamtools", version = "3.8")
BiocManager::install("GenomicFeatures", version = "3.8")
BiocManager::install("rtracklayer", version = "3.8")

使用

/home/luna/Desktop/Software/qualimap/qualimap bamqc --java-mem-size=40G -c -nt 30 -outdir ./ -outformat PDF -os -outfile samp.bamqc.pdf -bam samp.sort.bam -gd hg19 -nr 2000

mosdepth

Mosdepth是一种新的命令行工具,可以快速计算全基因组测序覆盖率。输入BAM或CRAM文件,计算在基因组中每个核苷酸位置或基因组区域的深度;也可将基因组区域指定为被BED文件覆盖的指定区域,或者指定为CNV calling所需的固定大小窗口。测试来看,Mosdepth比大多数现有工具更快。

下载安装

## github上有编译好的最新版的二进制文件,可以下载使用
wget https://github.com/brentp/mosdepth/releases/download/v0.3.1/mosdepth
chmod +x mosdepth
./mosdepth --help
mosdepth 0.3.1

  Usage: mosdepth [options] <prefix> <BAM-or-CRAM>

Arguments:

  <prefix>       outputs: `{prefix}.mosdepth.dist.txt`
                          `{prefix}.mosdepth.summary.txt`
                          `{prefix}.per-base.bed.gz` (unless -n/--no-per-base is specified)
                          `{prefix}.regions.bed.gz` (if --by is specified)
                          `{prefix}.quantized.bed.gz` (if --quantize is specified)
                          `{prefix}.thresholds.bed.gz` (if --thresholds is specified)

  <BAM-or-CRAM>  the alignment file for which to calculate depth.

Common Options:

  -t --threads <threads>     number of BAM decompression threads [default: 0]
  -c --chrom <chrom>         chromosome to restrict depth calculation.
  -b --by <bed|window>       optional BED file or (integer) window-sizes.
  -n --no-per-base           dont output per-base depth. skipping this output will speed execution
                             substantially. prefer quantized or thresholded values if possible.
  -f --fasta <fasta>         fasta file for use with CRAM files [default: ].
  --d4                       output per-base depth in d4 format.
  
Other options:

  -F --flag <FLAG>              exclude reads with any of the bits in FLAG set [default: 1796]
  -i --include-flag <FLAG>      only include reads with any of the bits in FLAG set. default is unset. [default: 0]
  -x --fast-mode                dont look at internal cigar operations or correct mate overlaps (recommended for most use-cases).
  -q --quantize <segments>      write quantized output see docs for description.
  -Q --mapq <mapq>              mapping quality threshold. reads with a quality less than this value are ignored [default: 0]
  -T --thresholds <thresholds>  for each interval in --by, write number of bases covered by at
                                least threshold bases. Specify multiple integer values separated
                                by ','.
  -m --use-median               output median of each region (in --by) instead of mean.
  -R --read-groups <string>     only calculate depth for these comma-separated read groups IDs.
  -h --help                     show help

参数: -t 线程数,推荐4个,After about 4 threads, there is no benefit for additional threads -c 指定染色体去计算深度 -b 指定基因组区域,比如基因的位置,就能计算基因的覆盖度 -F 排除含有该FLAG的reads。默认是1796,代表着: read unmapped (0x4) not primary alignment (0x100) read fails platform/vendor quality checks (0x200) read is PCR or optical duplicate (0x400) -Q mapping quality,默认是0,一般可以按照自己的需要进行调整为5或者10 -x 快速模式 -T 按照阀值(比如1X,2X,10X)输出达到指定阀值的核苷酸数目

使用

# To calculate the coverage in each exome capture region
mosdepth --by capture.bed sample-output sample.exome.bam

# coverage thresholds 
mosdepth --by capture.bed --thresholds 1,3,6,10 sample-output sample.exome.bam

# WGS example For 500-base windows
mosdepth -n --fast-mode --by 500 sample.wgs $sample.wgs.cram

indexcov

indexcov是goleft内的一个小工具,goleft is a collection of bioinformatics tools distributed under MIT license in a single static binary。

注意到这个软件,是因为mosdepth和indexcov的作图,可以上github上查看,https://github.com/brentp/goleft/tree/master/indexcov

Quickly estimate coverage from a whole-genome bam or cram index. A bam index has 16KB resolution so that’s what this gives, but it provides what appears to be a high-quality coverage estimate in seconds per genome.

The output is scaled to around 1. So a long stretch with values of 1.5 would be a heterozygous duplication. This is useful as a quick QC to get coverage values across the genome.

In our tests, we can estimate depth across 60X genomes for 30 samples in 30 seconds.

测序深度和测序覆盖度,很多时候对于不同的研究有着不同的要求,下面给出网上的一个表格:

Coverage depth recommendations, for human ~3Gb

Category Detection or Application Recommended Depth (x) or Reads (millions) References
Whole genome sequencing Homozygous SNVs 15x Bentley et al., 2008
  Heterozygous SNVs 33x Bentley et al., 2008
  INDELs 60x Feng et al., 2014
  Genotype calls 35x Ajay et al., 2011
  CNV 1-8x Xie et al., 2009; Medvedev at al., 2010
Whole exome sequencing Homozygous SNVs 100x (3x local depth) Clark et al., 2011; Meynert et al., 2013
  Heterozygous SNVs 100x (13x local depth) Clark et al., 2011; Meynert et al., 2013
  INDELs not recommended Feng et al., 2014
Transcriptome Sequencing Differential expression profiling 10-25M Liu Y. et al., 2014; ENCODE 2011 RNA-Seq
  Alternative splicing 50-100M Liu Y. et al., 2013; ENCODE 2011 RNA-Seq
  Allele specific expression 50-100M Liu Y. et al., 2013; ENCODE 2011 RNA-Seq
  De novo assembly >100M Liu Y. et al., 2013; ENCODE 2011 RNA-Seq
DNA Target-Based Sequencing ChIP-Seq 10-14M (sharp peaks); 20-40M (broad marks) Rozowsky et al., 2009; ENCODE 2011 Genome; Landt et al., 2012
  Hi-C 100M Belton, J.M et al., 2012
  4C (Circularized Chromosome Confirmation Capture) 1-5M van de Weken, H.J.G. et al., 2012
  5C (Chromosome Carbon Capture Carbon Copy) 15-25M Sanyal A. et al., 2012
  ChIA-PET (Chromatin Interaction Analysis by Paired-End Tag Sequencing) 15-20M Zhang, J. et al., 2012
  FAIRE-Seq 25-55M ENCODE 2011 Genome; Landt et al., 2012
  DNAse 1-Seq 25-55M Landt et al., 2012
DNA Methylation Sequencing CAP-Seq >20M Long, H.K. et al., 2013
  MeDIP-Seq 60M Taiwo, O. et al., 2012
  RRBS (Reduced Representation Bisulfite Sequencing) 10X ENCODE 2011 Genome
  Bisulfite-Seq 5-15X; 30X Ziller, M.J et al., 2015; Epigenomics Road Map
RNA-Target-Based Sequencing CLIP-Seq 10-40M Cho J. et al., 2012; Eom T. et al., 2013; Sugimoto Y. et al., 2012
  iCLIP 5-15M Sugimoto Y. et al., 2012; Rogelj B. et al., 2012
  PAR-CLIP 5-15M Rogelj B. et al., 2012
  RIP-Seq 5-20M Lu Z. et al., 2014
Small RNA (microRNA) Sequencing Differential Expression ~1-2M Metpally RPR et al., 2013; Campbell et al., 2015
  Discovery ~5-8M Metpally RPR et al., 2013; Campbell et al., 2015

References

Recommended Coverage and Read Depth

bamdst

						—— dulunar 后记于 2020.10