Comparing the viral quality of resolved genomes and their constituent unitigs

You can combine the resolved genomes (resolved_paths.fasta) and their constituent unitigs (resolved_edges.fasta), and compare the viral quality.

Run CheckV

You can combine the resolved genome sequences and unitig sequences and run CheckV as follows.

# Combine resolved_paths.fasta and resolved_edges.fasta
cat resolved_paths.fasta resolved_edges.fasta > all_sequences.fasta

# Run CheckV
checkv end_to_end all_sequences.fasta checkv_result

Now you can compare and visualise the quality of the resolved genomes and their constituent unitigs. The following example code shows how to visualise the results using Python.

Importing Python packages

Assuming you have installed Python and the packages matplotlib, pandas and seaborn, let's import the following.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Load the data

Now we will load the quality_summary.tsv file into a dataframe called checkv_res.

# Load the quality_summary.tsv from the CheckV results
checkv_res = pd.read_csv("checkv_resolved_pathsquality_summary.tsv", delimiter="\t", header=0)

Format the data

Now we will convert the sequence lengths into kilobases by dividing the lengths by 1000.

# Format the genome length to kb
checkv_res['contig_length'] = checkv_res['contig_length'].div(1000)

Then we will add a new column to our dataframe called Sequence type to denote whether the sequence is a resolved genome or a unitig.

# Add a new column as "Sequence type"
seq_type = []

for index, row in checkv_res.iterrows():
    if row['contig_id'].startswith("phage"):
        seq_type.append("Resolved genomes")
    else:
        seq_type.append("Individual unitigs")

checkv_res.insert(2, "Sequence type", seq_type, True)

Plot the data

Now we can plot the viral quality (Complete, High-quality, Medium-quality or Low-quality) of the resolved genomes and their constituent unitigs using boxen plots and the save the figure as follows.

# Set the order of viral quality
myorder=["Complete", "High-quality", "Medium-quality", "Low-quality"]

# Plot using catplot
ax = sns.catplot(y="checkv_quality", x="contig_length", hue="Sequence type", kind="boxen", data=checkv_res, height=5, aspect=1.5, order=myorder, showfliers=False)

# Set axis titles
ax.set(xlabel='Viral genome length (kbp)', ylabel='CheckV quality')

# Save figure
plt.savefig("checkv_qual_boxen.pdf", dpi=300, bbox_inches='tight', format='pdf') 

You can change the kind of the plot as you wish. For example, you can draw a violin plot by changing kind="violin" as follows.

ax = sns.catplot(y="checkv_quality", x="contig_length", hue="Sequence type", kind="violin", data=checkv_res, height=5, aspect=1.5, order=myorder, showfliers=False)