De novo assembly of Illumina reads using ABySS and alignment using BWA

table of contents

expected learning outcome

The learning objective of this activity is to understand the basics of the sequence assembly process and some related analysis steps, as well as gain proficiency in some specific software for these analyses. After completion of this activity, you will have learned how to perform basic assembly steps using ABySS, use BWA and BWA-SW to align reads and contigs to a reference genome, use IGV to visualize these alignments, and use bcftools and snpEff to call variants and determine their effect.

The working example is data for a ~200 kbp bacterial artificial chromosome (BAC). The data are paired-end reads from one lane of an Illumina sequencer.

getting started

System requirements
Total run time of the tools alone is ~70 minutes on a 2-core 2 GHz system. 4 GB of RAM and 5 GB of disk space is required. The following software is required:

ABySS 1.3.0: assemble short reads de novo
BWA 0.5.9: align short reads
IGV 2.0.11: visualize a genome
Java 6: execute Java programs
samtools 0.1.16: manipulate SAM/BAM files
snpEff 1.9.5: call the effect of SNPs
tabix 0.2.5: index tab-delimited files

Set up the environment
Each time you open a new terminal, you will need to run the following two commands.
Change to the activity working directory and source the environment script:

    cd ~/wcg_oct2011/activities/abyss
    source environment

Check that the learning activity scripts are in the PATH:

    which run-abyss run-bwa run-bwasw run-snpeff run-bcftools run-bcftools-assembly

The FASTQ files are located in the activities folder:

    ls 30CJCAAXX_4_[12].fq.gz

exercise 0: index the reference using BWA

Index the reference file:

    cd $top
    bwa index $ref

8 min, 1 GB RAM, 260 MB disk space

exercise 1: align the reads to the reference using BWA (optional)


    cd $top
    mkdir bwa
    cd bwa
    run-bwa 2>&1 |tee bwa.log

40 min, 250 MB RAM, 3 GB disk space

While this job is running in the background, open a new terminal and continue with the next section. Remember you have to set the environment, in the new terminal window type:

    cd ~/wcg_oct2011/activities/abyss
    source environment

exercise 2: inspect the reads

There are two FASTQ files for this one lane of paired-end Illumina data: one file for the forward reads and one file for the reverse reads.
Look at the first few reads.

    cd $top
    gunzip -c 30CJCAAXX_4_1.fq.gz |head
  1. How long are the reads?
    [show answer]
  2. How many lines are there in both files? (hint: use wc -l)
    [show answer]
  3. How many lines per read?
    [show answer]
  4. How many reads are there in both files?
    [show answer]
  5. How many bases are sequenced?
    [show answer]
  6. Assuming the BAC is 200 kbp, what is the depth of coverage?
    [show answer]

exercise 3: assemble the reads into contigs using ABySS

Run the assembly.

    cd $top

On Mac OS X uncompress the read files and use the run-abyss-fq script:

    gunzip -c 30CJCAAXX_4_1.fq.gz >30CJCAAXX_4_1.fq
    gunzip -c 30CJCAAXX_4_2.fq.gz >30CJCAAXX_4_2.fq
    k=48 run-abyss-fq 2>&1 |tee abyss.log

On Linux use the run-abyss script:

    k=48 run-abyss 2>&1 |tee abyss.log

5 min, 200 MB RAM, 2 MB disk space

While the assembly is running, view the script in a text editor.

    <text editor command> $top/bin/run-abyss

Look at the option -n,--dry-run of abyss-pe. Its output is the
commands that ABySS will run for the assembly.

    k=32 run-abyss -n

The assembly runs in three stages: assemble contigs without paired-end information, align the paired-end reads to the initial assembly, and merge contigs joined by paired-end information. You can instruct ABySS to stop after any of these stages. Use the -n option to see the commands for each stage.

    k=32 run-abyss se-contigs -n
    k=32 run-abyss pe-sam -n
    k=32 run-abyss pe-contigs -n

Once the assembly has completed, view the contigs in a text editor.

    <text editor command> k48/HS0674-contigs.fa

Disabling line wrap makes it easier to browse the file.

  1. How many contigs are longer than 100 bp?
    [show answer]
  2. What is the length of the longest contig (hint: use awk)?
    [show answer]
  3. What is the N50 of the assembly?
    [show answer]
  4. View the assembly log in a text editor.
      <text editor command> abyss.log

    What portion of the reads align to the assembly? (hint: search for "Aligned")
    [show answer]

  5. What is the median fragment size and standard deviation of this library? (hint: search for "median")
    [show answer]

exercise 4: align the contigs to the reference using web BLAT

Open BLAT in a web browser.
View the assembled contigs in a text editor.

    <text editor command> k48/HS0674-contigs.fa

Disabling line wrap makes it easier to select the full sequence.
Select the two contigs whose lengths are ~8 and ~16.5 kbp and copy-and-paste their sequence into BLAT.

  1. What is the exact length of these two contigs?
    [show answer]
  2. Click "browser" for the best alignment and then zoom out 10x. To which chromosome and band do these contigs align?
    [show answer]
  3. What are the nearest two genes?
    [show answer]

Set the "Common SNPs" track (in Variation and Repeats) to "pack". A SNV is displayed with a red line. Zoom in on a SNV. Is it in dbSNP?

Zoom in on the gap between the two contigs.
Set the "RepeatMasker" track (in Variation and Repeats) to "full".

  1. What feature overlaps the gap that likely caused the assembly gap?
    [show answer]
  2. Zoom in to see the sequence of the feature.
    Zoom out to see the alignment of both contigs.
    Unaligned query sequence is shown with a thin purple line at the end of the alignment. The thin purple line can be difficult to see.
    Which contig has unaligned sequence at one end?
    [show answer]
  3. Select that contig, and copy the unaligned sequence to the clipboard. Aligned sequence is shown in blue upper-case characters, and unaligned sequence is shown in black lower-case characters.
    Open BLAST in a web browser. http://blast.ncbi.nlm.nih.gov
    Select "nucleotide blast".
    Select the database "Nucleotide collection (nr/nt)"
    Paste the sequence into the query box. Click BLAST.
    To what sequence is the best BLAST hit?
    [show answer]
  4. What is the cause of this chimeric contig?
    [show answer]

exercise 5: align the contigs to the reference using BWA-SW

Warning: do not start BWA-SW until BWA has completed unless your machine has at least 2 GB of RAM.

    cd $top
    mkdir k48/bwasw
    cd k48/bwasw
    ln -s ../HS0674-contigs.fa .

1 min, 800 MB RAM, 1 MB disk space
While the alignment is running, view the script in a text editor.

    <text editor command> $top/bin/run-bwasw

exercise 6: Browse the contig to reference alignments using samtools tview

Run samtools tview.

    cd $top/k48/bwasw
    samtools tview HS0674-contigs.bam $ref

Go to the region chr3:186,648,940

    g chr3:186,648,940
  1. What variants do you see?
    [show answer]
  2. Go to the region chr3:186,676,730
      g chr3:186,676,730

    Notice the Ns in the contig sequence, indicating a scaffold gap.
    How many Ns are in the contig?
    [show answer]

  3. How many Ns should there be for the size of the gap to agree with the reference?
    [show answer]

exercise 7: browse the contig to reference alignments using IGV

Start IGV:


Select View -> Preferences... -> Alignments and change "Visibility range threshold (kb)" to 1000. Close and restart IGV to make this configuration take effect.
Select the reference "Human hg19".
Then select File->Load from File...
Go to the region chr3:186,600,000-187,600,000 by entering it into the box labeled "Go".
IGV may take up to a minute to load.

  1. What genes overlap the contigs?
    [show answer]
  2. Add the dbSNP track.
    File->Load from Server... then expand "hg19", "Variation and Repeats"
    and select "dbSNP 1.3.1".
    Zoom in on a SNV. Is it in dbSNP? Is it coding?

  3. Bonus: Find a coding SNV. What is its dbSNP rs ID?
    [show answer]

exercise 8: view the contig to reference alignments SAM file

View the SAM file in a text editor.

    cd $top/k48/bwasw
    <text editor command> HS0674-contigs.sam

Disable line wrap.

    The contig ID is given in the first column, and the position of the contig on the reference is given in the third and fourth columns. Which contig has two alignments, and what are the positions of these two alignments?
    [show answer]

    The orientation is given in the second column. The numbers 0 and 16 indicate positive and negative orientation respectively. What is the position and orientation of these two alignments?
    [show answer]

    In IGV, go to the region chr3:186,600,000-187,600,000 and find these two alignments. What large-scale structural rearrangement has occurred, and what is its approximate size?
    [show answer]

    Which two genes are fused as a result of this rearrangement?
    [show answer]

exercise 9: call variants of the reads-to-reference alignments using bcftools (optional)

Check that BWA has completed aligning the reads to the reference.
Run bcftools in this terminal.

    cd $top/bwa
    run-bcftools 2>&1 |tee bcftools.log

20 min, 250 MB RAM, 120 MB disk space
While bcftools is running, view the script in a text editor.

    <text editor command> $top/bin/run-bcftools

exercise 10: call variants of the contigs-to-reference alignments using bcftools

Run bcftools.

    cd $top/k48/bwasw
    run-bcftools-assembly 2>&1 |tee bcftools.log

Browse the variants using IGV.
File->Load from File... k48/bwasw/HS0674-contigs.var.vcf.gz
Right-click on the VCF track and select "Color by Allele".

exercise 11: call the effects of the SNVs

Run snpEff.

    cd $top/k48/bwasw

On Mac OS X uncompress the read files and run the run-snpeff script:

    gunzip -c HS0674-contigs.var.vcf.gz >HS0674-contigs.var.vcf
    run-snpeff HS0674-contigs.var.vcf >HS0674-contigs.var.snpeff

On Linux download the snpEff database and run the run-snpeff script:

    wget sf.net/projects/snpeff/files/databases/v1_9_5/snpEff_v1_9_5_hg37.zip
    unzip snpEff_v1_9_5_hg37.zip
    mv data ~/wcg_oct2011/software/snpEff_v1_9_5/
    run-snpeff HS0674-contigs.var.vcf.gz >HS0674-contigs.var.snpeff

1 min, 1.0 GB RAM

View the output of snpEff in a text editor.

    <text editor command> HS0674-contigs.var.snpeff

Count the number of SNVs in each category of effect.

    sort -uk1,2 HS0674-contigs.var.snpeff |cut -f16 |cut -d: -f1 |sort |uniq -c |sort -rn

Find all the coding SNVs.

    grep CODING HS0674-contigs.var.snpeff |cut -f1-4,16-18 |uniq
    Find the non-synonymous SNV. What is its location?
    [show answer]
  1. What is its dbSNP rs ID? (hint: use IGV or the UCSC genome browser)
    [show answer]
  2. Open dbSNP in a web browser. http://www.ncbi.nlm.nih.gov/projects/SNP/
    What is the minor allele frequency (MAF) of this SNP?
    [show answer]

exercise 12: compare the assembly variants to the read-alignment variants (optional)

Browse both variants using IGV.
Start IGV.
File->Load from File... k48/bwasw/HS0674-contigs.bam
File->Load from File... k48/bwasw/HS0674-contigs.var.vcf.gz
File->Load from File... bwa/HS0674.var.vcf.gz
Go to the region chr3:186,648,960

  1. Is the SNV called by both methods?
    [show answer]
  2. Is the insertion called by both methods?
    [show answer]
  3. Hover the mouse cursor over the insertion to see the inserted sequence. What is the inserted sequence?
    [show answer]
  4. Compare the inserted sequence to the reference sequence at this location. What sequence do the inserted sequence and the reference sequence have in common?
    [show answer]
  5. Load the alignments of the reads to the reference.
    File->Load from File... bwa/30CJCAAXX_4.bam
    The aligned reads do not show the insertion, but the alignments that span the insertion have mismatches at the end of the alignment. Do those mismatched bases agree with the inserted sequence?
    [show answer]

supplementary exercise 13: align the reads to the contigs using BWA

Run BWA.

    cd $top
    mkdir k48/bwa
    cd k48/bwa
    bwa index ../HS0674-contigs.fa
    ref=../HS0674-contigs.fa run-bwa 2>&1 |tee bwa.log

12 min, 200 MB RAM, 3 GB disk space

Index the assembly FASTA file.

    samtools faidx ../HS0674-contigs.fa

This command may fail with the following error:

    [fai_build_core] line length exceeds 65535 in sequence '94'.
  1. How long is the longest line? (hint: use awk)
    [show answer]

Break the long lines into short lines.

    fold ../HS0674-contigs.fa >HS0674-contigs.fa

Index the FASTA file.

    samtools faidx HS0674-contigs.fa

Browse the BAM file using samtools tview.

    samtools tview 30CJCAAXX_4.bam HS0674-contigs.fa

Browse the BAM file using IGV.
File->Import Genome... k48/HS0674-contigs.fa
File->Load from File... k48/bwa/30CJCAAXX_4.bam

Find a scaffold gap on the largest contig.
Find the two contigs that have consistent mate pairs joining them.