Working with High-Quality Reference Genomes

Learn how to work with high-quality reference genomes in this article by Tiago Antao, a bioinformatician currently working in the field of genomics. Originally a computer scientist, he crossed over to computational biology with an MSc in Bioinformatics from the Faculty of Sciences at the University of Porto, Portugal, and a Ph.D. on the spread of drug-resistant malaria from the Liverpool School of Tropical Medicine in the UK.

Many tasks in computational biology are dependent on the existence of reference genomes. If you are performing a sequence alignment, finding genes, or studying the genetics of populations at several points of your work, you will be directly or indirectly using a genome reference. In this article, you will learn a few general techniques to manipulate reference genomes. As an illustrative example, you’ll study the GC content (the fraction of the genome that is based on Guanine-Cytosine). Reference genomes are normally made available as FASTA files.

Getting ready

Genomes come in widely different sizes, ranging from viruses such as HIV (which is 9.7 kbp) to bacteria such as E. coli, to protozoans such as Plasmodium falciparum (the most important parasite species causing malaria) with its 14 chromosomes, mitochondrion, and apicoplast, to the fruit fly with three autosomes, a mitochondrion, and X/Y sex chromosomes, to humans with its three Gbp pairs spread across 22 autosomes, X/Y chromosomes, and mitochondria, all the way up to Paris japonica, a plant with 150 Gbp of genome. Along the way, you have different ploidy and different sex chromosome organizations.

As you can see, different organisms have very different genome sizes. This difference can be of several orders of magnitude. This can have significant implications for your programming style. Working with a large genome will require you to be more conservative with the use of memory. Unfortunately, larger genomes also benefit from more speed-efficient programming techniques (as you have much more data to analyze); these are conflicting requirements. The general rule is that you have to be much more careful with efficiency (both speed and memory) with larger genomes.

In order to make this article less burdensome, you’ll use a small eukaryotic genome from Plasmodium falciparum. This genome still has many typical features of larger genomes (for example, multiple chromosomes). So, it’s a good compromise between complexity and size. Note that with a genome of the size of P. falciparum, it will be possible to perform many operations by loading the whole genome in-memory. However, you can opt for a programming style that can be used with bigger genomes (for example, mammals) so that you can use this article in a more general way, but feel free to use more memory-intensive approaches with small genomes like this.

You can use Biopython, which is available in the IPython Notebook at https://github.com/tiagoantao/bioinf-python/tree/master/notebooks/02_Genomes. If you are not using notebooks, download the P. falciparum genome from the datasets page at https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/Datasets.ipynb (file pfalciparum.fasta).

How to do it…

  1. Start by inspecting the description of all the sequences on the reference genome FASTA file:
from Bio import SeqIO
genome_name = 'PlasmoDB-9.3_Pfalciparum3D7_Genome.fasta'
recs = SeqIO.parse(genome_name, 'fasta')
for rec in recs:
    print(rec.description)
  • Take a look at a part of the output:
  • Different genome references will have different description lines, but they will generally have important information over there. In this example, you have chromosomes, mitochondria, and apicoplast. You can also view chromosome sizes but take the value from the sequence length instead.

2. Parse the description line to extract the chromosome number. You can retrieve the chromosome size from the sequence and compute the GC content across chromosomes on a window basis:

from __future__ import print_function
from Bio import SeqUtils

recs = SeqIO.parse(genome_name, 'fasta')
chrom_sizes = {}
chrom_GC = {}
block_size = 50000
min_GC = 100.0
max_GC = 0.0
for rec in recs:
    if rec.description.find('SO=chromosome') == -1:
        continue
    chrom = int(rec.description.split('_')[1])
    chrom_GC[chrom] = []
    size = len(rec.seq)
    chrom_sizes[chrom] = size
    num_blocks = size // block_size + 1
    for block in range(num_blocks):
        start = block_size * block
        if block == num_blocks - 1:
            end = size
        else:
            end = block_size + start + 1
        block_seq = rec.seq[start:end]
        block_GC = SeqUtils.GC(block_seq)
        if block_GC < min_GC:
            min_GC = block_GC
        if block_GC > max_GC:
            max_GC = block_GC
        chrom_GC[chrom].append(block_GC)
print(min_GC, max_GC)
  • Perform a windowed analysis of all chromosomes. Start by defining a window size of 50 kbp. This is appropriate for P. falciparum (feel free to vary its size), but you will want to consider other values for genomes with chromosomes that are orders of magnitude different from this.
  • Note that you are re-reading the file. With such a small genome, it would have been feasible (in step one) to do an in-memory load of the whole genome. By all means, feel free to try this programming style for small genomes—it’s faster! However, this code is more generalized for larger genomes.
  • Note that in the for loop, you ignore the mitochondrion and apicoplast by parsing the SO entry to the description. The chrom_sizes dictionary will maintain the size of chromosomes.
  • The chrom_GC dictionary is your most interesting data structure and will have a list of the faction of the GC content for each 50 kbp window. So, for chromosome 1, which has a size of 640,851 bp, there will be 14 entries because this chromosome size has 14 blocks of 50 kbp.

Be aware of two unusual features of the P. falciparum genome: the genome is very AT-rich, that is, GC-poor. So, the numbers that you will get will be very low. Also, chromosomes are ordered based on size (as it’s common), but starting with the smallest size. The usual convention is to start with the largest size (for example, like genomes in humans).

3. Now, perform a genome plot of the GC distribution. Use shades of blue for the GC content. However, for high outliers, use shades of red. For low outliers, use shades of yellow:

from __future__ import division
from reportlab.lib import colors
from reportlab.lib.units import cm
from Bio.Graphics import BasicChromosome
  • Use float division and import functions required by Biopython from the reportlab library:

The Biopython code has evolved over time, before Python was such a fashionable language. In the past, availability of libraries was quite limited. The usage of reportlab can be seen mostly as a legacy issue. It is suggested that you learn just enough from it to use it with Biopython. If you are planning on learning a modern plotting library in Python, you will probably want to consider matplotlib, Bokeh, or Python’s version of ggplot (or other visualization alternatives, such as Mayavi, VTK, or even Blender’s API).

chroms = list(chrom_sizes.keys())
chroms.sort()
biggest_chrom = max(chrom_sizes.values())
my_genome = BasicChromosome.Organism(output_format='png')
my_genome.page_size = (29.7*cm, 21*cm)
telomere_length = 10
bottom_GC = 17.5
top_GC = 22.0
for chrom in chroms:
    chrom_size = chrom_sizes[chrom]
    chrom_representation = BasicChromosome.Chromosome \('Cr %d' % chrom)
    chrom_representation.scale_num = biggest_chrom
    tel = BasicChromosome.TelomereSegment()
    tel.scale = telomere_length
    chrom_representation.add(tel)
    num_blocks = len(chrom_GC[chrom])
    for block, gc in enumerate(chrom_GC[chrom]):
        my_GC = chrom_GC[chrom][block]
        body = BasicChromosome.ChromosomeSegment()
        if my_GC > top_GC:
            body.fill_color = colors.Color(1, 0, 0)
        elif my_GC < bottom_GC:
            body.fill_color = colors.Color(1, 1, 0)
        else:
            my_color = (my_GC - bottom_GC) / (top_GC - bottom_GC)
            body.fill_color = colors.Color(my_color, my_color, 1)
        if block < num_blocks - 1:
            body.scale = block_size
        else:
            body.scale = chrom_size % block_size
        chrom_representation.add(body)
    tel = BasicChromosome.TelomereSegment(inverted=True)
    tel.scale = telomere_length
    chrom_representation.add(tel)
    my_genome.add(chrom_representation)
my_genome.draw('falciparum.png', 'Plasmodium falciparum')
  • The first line converts the return of the keys method to a list. This is redundant in Python 2, but not in Python 3, where the keys method has a specific return type: dict_keys.
  • Now, draw the chromosomes in order (hence the sort). You’ll need the size of the biggest chromosome (14 in P. falciparum) in order to assure that the size of chromosomes is printed with the correct scale (the biggest_chrom variable).
  • Create an A4-sized representation of an organism with a PNG output. Note that you will have to draw very small telomeres of 10 bp. This will produce a rectangular-like chromosome. You can make the telomeres bigger, giving it a roundish representation (or you may have a better idea of the correct telomere size for your species).
  • Now, declare that anything with GC content below 17.5 percent or above 22.0 percent will be considered an outlier. Remember that for most other species, this will be much higher.
  • Then, print these chromosomes proper. They are bounded by telomeres and composed of 50 kbp chromosome segments (the last segment is sized with the remainder). Each segment will be colored in blue with a red-green component based on the linear normalization between two outlier values. Each chromosome segment will either be 50 kbp or potentially smaller if the last one of the chromosome. The output is shown in the following figure:

The 14 chromosomes of Plasmodium falciparum color-coded with the GC content (red is more than 22 percent, yellow less than 17 percent, and the blue shades represent a linear gradient between both numbers.

4. Finally, if you are on the IPython Notebook (that is, do not perform this outside IPython), you can print the image inline:

from IPython.core.display import Image
Image("falciparum.png")

There’s more…

P. falciparum is a reasonable example for a eukaryote with a small genome that allows you to perform a small data exercise with enough features that make it still useful for most eukaryotes. Of course, there are no sex chromosomes (such as X/Y in humans), but these should be easy to process because reference genomes do not deal with ploidy issues.

P. falciparum does have a mitochondrion, but it is not dealt with here due to space issues. Biopython does have the functionality to print circular genomes, which you can also use with bacteria. With regards to bacteria and viruses, these genomes are much easier to process because their size is very small.

There are plenty of websites dedicated to a single organism (or a set of related organisms). Apart from PlasmoDB (http://plasmodb.org/plasmo/), from which you downloaded the P. falciparum genome, you can find VectorBase (https://www.vectorbase.org/). Flybase () for Drosophila is also worth mentioning, but do not forget to search for your organism of interest.

If you found this article interesting, you can explore to learn how to use modern Python bioinformatics libraries and applications to do cutting-edge research in computational biology. The main focus of Bioinformatics with Python Cookbook is the practical application of bioinformatics along with modern programming techniques and frameworks to deal with the ever-increasing deluge of bioinformatics data.

Leave a Reply