How I started NGS data analysis at home

↓
Books

I bought 3 books on NGS data analysis.

book1: "次世代シーケンサーDRY解析" (Kindle); star ***
book2: "Bioinformatics Data Skills" (O'Reilly PDF); star *****
book3: "" (Kindle) - I removed the title because this book was no use and completely a mess.; star *

↓
Server

I rented Sakura-VPS (Virtual Private Server). I choosed 8G-byte memory plan (8000 yen/month) because, in [book1], an author says bwa (aligner) program requres 8G byte or less memory to run. bwa actually consumed maximally 7G-byte memory when I aligned FES1 NGS data on the mouse genome. The storage size of 200G byte for this plan was very strict to align FES1 NGS data. I had to delete as much as possible data after they were processed.

without compression

read_1.fastq (67G)
read_2.fastq (67G)
reference-genome.fa (4.5G)
reference-genome_index (total 7.5G)
read_1.sai (6G)
read_2.sai (6G)
read.sam (134G)
-----------------------------------
total 292G > 200G

with compression

read_1.fastq.gz (15G)
read_2.fastq.gz (15G)
reference-genome.fa.gz (0.8G)
reference-genome_index (total 7.5G)
read_1.sai (6G)
read_2.sai (6G)
read.bam (44G)
-----------------------------------
total 94G < 200G

I setup Ubuntu OS (amd64) on the server. I also setup sshd and registered my SSH public key generated by PuTTY.exe on Windows. I also bought Metro PuTTY for SSH login from my Surface 2 laptop, which shares my SSH private key with my Windows environment.

↓
Analysis

- Tips

Useful command line tips for NGS data analysis

format raw sequence data

$ cat foo.seq | fold > foo.set.txt

add FASTA header

$ echo ">foo" | cat - foo.seq > foo.fasta

extract FASTA headers

$ grep "^>" foo.fasta

replace FASTA header

$ echo ">foo" | cat - <(tail -n +2 bar.fasta) > foo.fasta

extract bases in a specified range from raw sequence data

$ cat foo.seq.txt | tr -d "\r\n" | cut -c 123-456 > foo_123-456.seq

get reverse complement of a DNA sequence data (simple version)

$ cat foo.seq | rev | tr "ATGCatgc" "TACGtacg" > cfoo.seq

run a process that should require long time without termination

$ nohup bwa mem test_index test.fastq.gz > test.sam 2> stderr.log &

Note: It is important to redirect stderr to a file because without it nohup redirect stderr to stdout that will be saved in the output sam file in this case.

If you want to use pipe ...

$ nohup bwa mem test_index test.fastq.gz 2>stderr.log | grep "test" > test.sam &

calculate average coverage for a .bam file

$ samtools depth foo.bam | awk '{sum+=$3} END {print sum/NR}'

calculate total length of reference sequence for a .bam file

$ samtools view -H foo.bam | grep -P "^@SQ" | cut -f 3 -d ":" | awk '{sum+=$1} END {print sum}'

count the numbers of reads in a .bam file

$ samtools view -c foo.bam

extract read_ids from a .bam file

$ samtools view -h foo.bam | grep -v "^@" | cut -f 1

extract reads from a .fastq file based on an id list

$ sed -e "s/^/^@/" -e "s/$/\\\s" foo.read_id > foo.read_id.pattern

$ zgrep -A3 -f foo.read_id.pattern bar.fastq.gz > extracted.fastq

remove FASTA header

$ tail -n +2 foo.fasta

extract specified region of refarence genome sequence

$ samtools faidx genome.fa

$ samtools faidx genome.fa chromosomeName:123-456

inspect .gz data

use zmore, zless, zgrep, gzip, and gunzip

want to know what file format or what compression (bgzip?) is used for a specific file

use htsfile command included in htslib

convert PHYLIP style sequence list to FASTA format

$ cat woodmouse.phylip | awk 'NR>1 && length($0)>10 { print ">" substr($0,1,10) "\n" substr($0,11) }' > woodmouse.fa

extract only SNPs from .vcf file

$ bcftools view -v snps -Oz DRR028646.vcf.gz > DRR028646.snps.vcf.gz

extract only files in specific directories from a .tar.gz file

$ tar xzvf Mus_musculus_NCBI_GRCm38.tar.gz --wildcards "*WholeGenomeFasta"

write robust bash scripts

start with these lines [book2]:

#!/bin/bash
set -e
set -u
set -o pipefail

Note: this technique can not be used with diff command.

archive the resultant files of `find`

$ find -name "snpexp.129B6.out" -type f -print0 | tar -czvf snpexp.tar.gz --null -T -

ref: http://stackoverflow.com/questions/5891866/find-files-and-tar-them-with-spaces

count uniq items in a list

$ sort -u A.list | wc -l

find common items in 2 lists

$ comm -1 -2 <(sort -u A.list) <(sort -u B.list)

sort .vcf lines lexicographcally by chromosome positions

$ grep '^#' input.vcf > output.vcf && grep -v '^#' | LC_ALL=C sort -t '	' -k1,1 -k2,2n >> output.vcf

Note: input -t ' ' as single quote, CTRL-V and then TAB, single quote

ref: https://www.biostars.org/p/133487/

count each unique item in a list

use cut -f and uniq -c

backup script files only in the current directory

$ find -maxdepth 1 -name "*.sh" -print0 -o -name "*.py" -print0 | tar -czvf scripts.tar.gz --null -T -

extract only chromosome 8 (e.g.) from a BAM file

Note: the name of the chromosome may be "chr8" or something depending on what kind of reference genome you used.

$ samtools view -bS input.bam 8 > input-chr8.bam

view the header of a BAM file

$ samtools view -H input.bam

- Trouble shootings

Metro PuTTY could not connect to the server

This was due to the older implementation of key exchange algorithm of older PuTTY that Metro PuTTY seemed to be based on. I fix this problem by removing key exchange algorithms that contain "group-exhange" by modifying KexAlgorithms setting in /etc/ssh/sshd_conf file.

There are no scp implementations for Windows RT

I had to install vsftpd in the server instead of secure copy.

`samtools flags` didn't work

The version 0.1.19 of samtools which was installed by apt-get didn't support this command

The connection from Metro PuTTY was not kept alive when it was in background for several minutes. This might dependent on the use of the mobile router.

This problem was improved by adding two settings below to /etc/ssh/sshd_config file and doing /etc/init.d/ssh restart (it works as long as for 10 min for the settings below).

ClientAliveInterval 60
ClientAliveMaxCount 10

ref: http://rcmdnk.github.io/blog/2014/08/23/computer-linux-putty/

Chainging settings on the client (Metro PuTTY) side was no use.

Command `:syntax on` didn't work in vi (vim).

$ sudo apt-get install vim-nox

SAMtools/BCFtools didn't work and I got unexpected errors when I copy and paste commands and options found in a web site

The usage of the latest version of SAMtools/BCF tools is significantly different from that of 0.19 or earlier versions. I checked the latest command help and used appropriate commands and options.

I got an error when I used matplotlib.pyplot in non-GUI environment.

If you get "_tkinter.TclError: no display name and no $DISPLAY environment variable", add "matplotlib.use('Agg')" before importing pyplot.
ref: http://yubais.net/doc/matplotlib/introduction.html