I bought 3 books on NGS data analysis.
I rented Sakura-VPS (Virtual Private Server). I choosed 8G-byte memory plan (8000 yen/month) because, in [book1], an author says bwa
(aligner) program requres 8G byte or less memory to run. bwa
actually consumed maximally 7G-byte memory when I aligned FES1 NGS data on the mouse genome. The storage size of 200G byte for this plan was very strict to align FES1 NGS data. I had to delete as much as possible data after they were processed.
read_1.fastq (67G) read_2.fastq (67G) reference-genome.fa (4.5G) reference-genome_index (total 7.5G) read_1.sai (6G) read_2.sai (6G) read.sam (134G) ----------------------------------- total 292G > 200G
read_1.fastq.gz (15G) read_2.fastq.gz (15G) reference-genome.fa.gz (0.8G) reference-genome_index (total 7.5G) read_1.sai (6G) read_2.sai (6G) read.bam (44G) ----------------------------------- total 94G < 200G
I setup Ubuntu OS (amd64) on the server. I also setup sshd
and registered my SSH public key generated by PuTTY.exe on Windows. I also bought Metro PuTTY for SSH login from my Surface 2 laptop, which shares my SSH private key with my Windows environment.
Useful command line tips for NGS data analysis
$ cat foo.seq | fold > foo.set.txt
$ echo ">foo" | cat - foo.seq > foo.fasta
$ grep "^>" foo.fasta
$ echo ">foo" | cat - <(tail -n +2 bar.fasta) > foo.fasta
$ cat foo.seq.txt | tr -d "\r\n" | cut -c 123-456 > foo_123-456.seq
$ cat foo.seq | rev | tr "ATGCatgc" "TACGtacg" > cfoo.seq
$ nohup bwa mem test_index test.fastq.gz > test.sam 2> stderr.log &
Note: It is important to redirect stderr to a file because without it nohup redirect stderr to stdout that will be saved in the output sam file in this case.
If you want to use pipe ...
$ nohup bwa mem test_index test.fastq.gz 2>stderr.log | grep "test" > test.sam &
$ samtools depth foo.bam | awk '{sum+=$3} END {print sum/NR}'
$ samtools view -H foo.bam | grep -P "^@SQ" | cut -f 3 -d ":" | awk '{sum+=$1} END {print sum}'
$ samtools view -c foo.bam
$ samtools view -h foo.bam | grep -v "^@" | cut -f 1
$ sed -e "s/^/^@/" -e "s/$/\\\s" foo.read_id > foo.read_id.pattern
$ zgrep -A3 -f foo.read_id.pattern bar.fastq.gz > extracted.fastq
$ tail -n +2 foo.fasta
$ samtools faidx genome.fa
$ samtools faidx genome.fa chromosomeName:123-456
use zmore, zless, zgrep, gzip, and gunzip
use htsfile command included in htslib
$ cat woodmouse.phylip | awk 'NR>1 && length($0)>10 { print ">" substr($0,1,10) "\n" substr($0,11) }' > woodmouse.fa
$ bcftools view -v snps -Oz DRR028646.vcf.gz > DRR028646.snps.vcf.gz
$ tar xzvf Mus_musculus_NCBI_GRCm38.tar.gz --wildcards "*WholeGenomeFasta"
#!/bin/bash set -e set -u set -o pipefail
Note: this technique can not be used with diff
command.
find
$ find -name "snpexp.129B6.out" -type f -print0 | tar -czvf snpexp.tar.gz --null -T -
ref: http://stackoverflow.com/questions/5891866/find-files-and-tar-them-with-spaces
$ sort -u A.list | wc -l
$ comm -1 -2 <(sort -u A.list) <(sort -u B.list)
$ grep '^#' input.vcf > output.vcf && grep -v '^#' | LC_ALL=C sort -t ' ' -k1,1 -k2,2n >> output.vcf
Note: input -t ' ' as single quote, CTRL-V and then TAB, single quote
ref: https://www.biostars.org/p/133487/
use cut -f
and uniq -c
$ find -maxdepth 1 -name "*.sh" -print0 -o -name "*.py" -print0 | tar -czvf scripts.tar.gz --null -T -
$ samtools view -bS input.bam 8 > input-chr8.bam
$ samtools view -H input.bam
This was due to the older implementation of key exchange algorithm of older PuTTY that Metro PuTTY seemed to be based on. I fix this problem by removing key exchange algorithms that contain "group-exhange" by modifying KexAlgorithms setting in /etc/ssh/sshd_conf file.
I had to install vsftpd in the server instead of secure copy.
samtools flags
didn't workThe version 0.1.19 of samtools which was installed by apt-get didn't support this command
This problem was improved by adding two settings below to /etc/ssh/sshd_config file and doing /etc/init.d/ssh restart (it works as long as for 10 min for the settings below).
ClientAliveInterval 60 ClientAliveMaxCount 10
ref: http://rcmdnk.github.io/blog/2014/08/23/computer-linux-putty/
Chainging settings on the client (Metro PuTTY) side was no use.
:syntax on
didn't work in vi (vim).$ sudo apt-get install vim-nox
The usage of the latest version of SAMtools/BCF tools is significantly different from that of 0.19 or earlier versions. I checked the latest command help and used appropriate commands and options.
If you get "_tkinter.TclError: no display name and no $DISPLAY environment variable", add "matplotlib.use('Agg')" before importing pyplot.
ref: http://yubais.net/doc/matplotlib/introduction.html