Contents
sashimi_plot is a utility for automatically producing publication-quality plots for RNA-Seq analyses of isoform expression. It is part of the MISO framework. In particular, sashimi_plot can: (1) plot raw RNA-Seq densities along exons and junctions for multiple samples, while simultaneously visualizing the gene model/isoforms to which reads map, and (2) plot MISO output alongside the raw data or separately.
The MISO framework is described in Katz et. al., Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nature Methods (2010).
We chose “Sashimi” because our tool plots the raw RNA-Seq data in addition to inferences made about the RNA-Seq reads (hat tip to Vincent Butty.) Also, the variations and various “bumps” in read densities that one commonly observes also look a bit like rolls of Sashimi. Besides, we thought Sashimi would go well with MISO.
2012
- Dynamic but consistent y-axis scaling: if ymax is omitted, the same yscale will be chosen for all samples, whose maximum value is determined by the maximum y-value across all samples being plotted.
- x-axis for bar_posterior feature is now thinner and has smaller ticks, which results in much better visualization (especially if you’re plotting many samples.)
- Bugfix for scenario that caused plots to fail with KeyError on some events
- Now skipping reads with insertions or deletions in their CIGAR strings
- Figure height/width now correctly read from settings file
- Optional sample_labels argument for labeling each sample’s track in --plot-event
Thanks to Sol Katzman, Michael Lovci, Sean O’Keeffe and Vincent Butty for their contributions and suggestions.
2011
sashimi_plot comes packaged with MISO. If you have MISO installed, you already have sashimi_plot. sashimi_plot itself only requires the Python package matplotlib (version 1.1.0 or higher) as well as samtools.
To test that sashimi_plot is working, we first need to get a GFF annotation of the alternative events to be visualized. An example GFF annotation file of events is provided in sashimi_plot/test-data/events.gff. This GFF file has to be indexed in order to be used with MISO, with the index_gff.py script:
cd sashimi_plot/
python ../index_gff.py --index test-data/events.gff test-data/event-data/
We can now plot this event by running the following from within the sashimi_plot directory in MISO:
python plot.py --plot-event "chr17:45816186:45816265:-@chr17:45815912:45815950:-@chr17:45814875:45814965:-" test-data/event-data/ settings/sashimi_plot_settings.txt --output-dir test-plot
If successful, you should get a plot in the directory test-plot/ called chr17:45816186:45816265:-@chr17:45815912:45815950:-@chr17:45814875:45814965:-.pdf. An annotated graphical explanation of the main features of the output is shown below.
Key items to notice:
We return to our main test example of the --plot-event feature. The call:
python plot.py --plot-event "chr17:45816186:45816265:-@chr17:45815912:45815950:-@chr17:45814875:45814965:-" test-data/event-data/ settings/sashimi_plot_settings.txt --output-dir test-plot
Plots the event called chr17:45816186:45816265:-@chr17:45815912:45815950:-@chr17:45814875:45814965:-, using the directory pickled event test-data/event-data/ and plotting according to the information provided in the settings file settings/sashimi_plot_settings.txt. The name of this event in this case is simply the ID given to this skipped exon in the GFF annotations provided with MISO (see Mouse genome (mm9) alternative events). The name is arbitrary, and sashimi_plot will visualize whatever events you give it as long as they have a corresponding indexed GFF file.
The directory containing the event/gene isoform information (in the above example, test-data/event-data) can be any directory generated by indexing a GFF3 file, using the index_gff.py script that is part of MISO. For more information on indexing, see Preparing the alternative isoforms annotation.
The settings file for sashimi_plot specifies the name of each of the samples to be plotted, the directory containing their corresponding BAM files and MISO output, and a variety of plotting parameters, such as the figure colors and dimensions. The example settings file settings/sashimi_plot_settings.txt is:
[data]
# directory where BAM files are
bam_prefix = ./test-data/bam-data/
# directory where MISO output is
miso_prefix = ./test-data/miso-data/
bam_files = [
"heartWT1.sorted.bam",
"heartWT2.sorted.bam",
"heartKOa.sorted.bam",
"heartKOb.sorted.bam"]
miso_files = [
"heartWT1",
"heartWT2",
"heartKOa",
"heartKOb"]
[plotting]
# Dimensions of figure to be plotted (in inches)
fig_width = 7
fig_height = 5
# Factor to scale down introns and exons by
intron_scale = 30
exon_scale = 4
# Whether to use a log scale or not when plotting
logged = False
font_size = 6
# Max y-axis
ymax = 150
# Whether to plot posterior distributions inferred by MISO
show_posteriors = True
# Whether to show posterior distributions as bar summaries
bar_posteriors = False
# Whether to plot the number of reads in each junction
number_junctions = True
resolution = .5
posterior_bins = 40
gene_posterior_ratio = 5
# List of colors for read denisites of each sample
colors = [
"#CC0011",
"#CC0011",
"#FF8800",
"#FF8800"]
# Number of mapped reads in each sample
# (Used to normalize the read density for RPKM calculation)
coverages = [
6830944,
14039751,
4449737,
6720151]
# Bar color for Bayes factor distribution
# plots (--plot-bf-dist)
# Paint them blue
bar_color = "b"
# Bayes factors thresholds to use for --plot-bf-dist
bf_thresholds = [0, 1, 2, 5, 10, 20]
The above settings file specifies where the BAM files for each sample are (and their corresponding MISO output files) and also controls several useful plotting parameters. The parameters are:
- bam_prefix: directory where BAM files for the samples to plot are. These BAM files should be coordinate-sorted and indexed.
- miso_prefix: directory where MISO output directories are for the events to be plotted. For example, if plotting a skipped exon event for which the MISO output lives in /data/miso_output/SE/, then miso_prefix should be set to /data/miso_output/SE.
- bam_files: list of BAM files for RNA-Seq samples in the order in which you’d like them to be plotted. Each value in the list should be a filename that resides in the directory specified by bam_prefix.
- miso_files: list of MISO output directories for each sample. Should follow same order of samples as bam_files. Each value in the list should be a MISO output directory that resides in the directory specified by miso_prefix.
Note
sashimi_plot will look recursively in paths of miso_files to find the MISO output file (ending in .miso) associated with the event that is being plotted. For example, if we have these settings:
miso_prefix = /miso/output/ miso_files = ['control']If our event is on a chromosome called chr7 in the annotation, then the program will check every subdirectory of /miso/output/control for a directory called chr7, and look for a file that has the form event_name.miso in that directory. If it cannot find such a directory in the first-level subdirectories, it will recurse into the subdirectories until it can find the file or until there are no more subdirectories to search.
- fig_height: output figure’s height (in inches.)
- exon_scale / intron_scale: factor by which to scale down exons and introns, respectively.
- logged: whether to log the RNA-Seq read densities (set to False for linear scaling)
- ymax: maximum value of y-axis for RNA-Seq read densities. If not given, then the highest y-axis value across all samples will be set for each, resulting in comparable y-scaling.
- show_posteriors: plot MISO posterior distributions if True, do not if False
- bar_posteriors: plot MISO posterior distributions not as histograms, but as a horizontal bar that simply shows the mean and confidence intervals of the distribution in each sample.
- colors: Colors to use for each sample. Colors should be listed in same order as bam_files and miso_files lists.
- coverages: Number of mapping reads in each sample, for use when when computing normalized (i.e. RPKM) RNA-Seq read densities. Should be listed in same order as bam_files and miso_files. These numbers correspond to the “per million” denominators used for calculating RPKM.
Additional parameters (all optional):
- sample_labels: a list of string labels for each sample. By default, sashimi_plot will use the BAM filename from bam_files as the label for the sample. This option provides alternative labels. Note that sample_labels must have the same number of entries as bam_files.
- reverse_minus: specifies whether minus strand (-) event isoforms are to be plotted in same direction as plus strand events. By default, set to False, meaning minus strand events will be plotted in direction opposite to plus strand events.
- nxticks: number of x-axis ticks to plot
- nyticks: number of y-axis ticks to plot
Note
For junction visualization, sashimi_plot currently uses only reads that cross a single junction. If a read crosses multiple exon-exon junctions, it is currently skipped, although MISO will use such a read in isoform estimation if it consistent with the given isoform annotation. Also, sashimi_plot currently ignores reads containing insertions or deletions and does not visualize sequence mismatches.
sashimi_plot/plot.py takes the following arguments:
--plot-posterior
Plot the posterior distribution. Takes as input a raw
MISO output file (.miso)
--plot-insert-len
Plot the insert length distribution from a given
insert length (*.insert_len) filename. Second
argument is a settings filename.
--plot-bf-dist
Plot Bayes factor distributon. Takes the arguments:
(1) Bayes factor filename (*.miso_bf) settings
filename, (2) a settings filename.
--plot-event
Plot read densities and MISO inferences for a given
alternative event. Takes the arguments: (1) event name
(i.e. the ID= of the event based on MISO gff3
annotation file, (2) directory where MISO output is
for that event type (e.g. if event is a skipped exon,
provide the directory where the output for all SE
events are), (3) path to plotting settings file.
--output-dir
Output directory.
MISO comes with several built-in utilities for plotting its output, which all make use of the Python matplotlib environment package. These can be accessed through the plot.py utility.
In the main example of --plot-event shown above, the MISO posterior distributions are shown fully as a histogram. Sometimes it’s easier to compare a group of samples by just comparing the mean expression level (along with confidence intervals) in each sample, without plotting the entire distribution. Using the bar_posteriors option in the settings file, this can be done. Setting:
bar_posteriors = True
yields the plot below:
The mean of each sample’s posterior distribution over Ψ is shown as a circle, with horizontal error bars extending to the upper and lower bounds of the confidence interval in each sample. Since the x-axis remains fixed in all samples, this makes it easy to visually compare the means of all samples and the overlap between their confidence intervals.
It is often useful to plot the distribution of events that meet various Bayes factor thresholds. For any Bayes factor threshold, we can compute the number of events that meet that threshold in a given comparison file and visualize this as a distribution. The option --plot-bf-dist does this, as follows:
plot.py --plot-bf-dist control_vs_knockdown.miso_bf settings.txt --output-dir plots/
This will plot the distribution of events meeting various Bayes factors thresholds in the file control_vs_knockdown.miso_bf (outputted by calling --compare-samples in MISO) using the plotting settings file settings.txt, and output the resulting plot to plots/. The resulting plot will look like:
This figure shows the number of events (in logarithmic scale) in the .miso_bf file that have Bayes factor greater than or equal to 0, greater than or equal to 1, greater than or equal to 2, etc. all the way to events with Bayes factor greater than or equal to 20.
The title of the plot says how many of the events in the input .miso_bf file were used in plotting the distribution. In the above example all 5231 entries in the file were used, but if the lowest Bayes factor threshold for the x-axis was set to be 2, for example, then only a subset of the entries would be plotted since there are events with Bayes factor less than 2.
The color of the bars used in the plot and Bayes factor thresholds for the x-axis can be customized through the setting file options bar_color and bf_thresholds, respectively. The default settings are:
# Color of bar for --plot-bf-dist
bar_color = "b"
# Bayes factor thresholds for --plot-bf-dist
bf_thresholds = [0, 1, 2, 5, 10, 20]
We can visualize posterior distributions as histograms using the --plot-posterior argument to plot.py. For example, to plot the MISO posterior distribution of our example event in the heartKOa sample, run:
python plot.py --plot-posterior "test-data/miso-data/heartKOa/chr17/chr17:45816186:45816265:-@chr17:45815912:45815950:-@chr17:45814875:45814965:-.miso" --output-dir test-plot
This will produce a PDF plot that looks like this:
For paired-end RNA-Seq samples, we can visualize the insert length distribution. This distribution is informative about the quality of the RNA-Seq sample, since it can tell us how precisely or cleanly the insert length of interest was selected during the RNA-Seq library preparation. This distribution is also used by MISO in order to assign read pairs to isoforms, and so the tighter this distribution is, the more confident we can be in assigning read pairs to isoforms based on their insert length.
The distribution can be plotted using the --plot-insert-len option, which takes as input: (1) an insert length file (ending in .insert_len) produced by MISO and (2) a plotting settings filename. For example:
python plot.py --plot-insert-len sample.insert_len settings.txt --output-dir plots/
will produce a histogram of the insert length in sample.insert_len and place it in the plots directory. The histogram might look like this:
1. I’d like to plot RNA-Seq data for my own annotations, which are not part of the MISO events. Can this be done? Yes. sashimi_plot can plot any event, as long as it is specified in the GFF3 format and indexed by the index_gff.py script that we provide. See Preparing the alternative isoforms annotation.
2. I get the error that the .positions field is undefined. This is caused by using an older version of the pysam module. Upgrading to version 0.6 or higher fixes the issue.
Note
Section under construction
Thanks to: