Graph Statistics

Phables outputs a file named resolved_component_info.txt that contains the following information of the phage bubbles resolved.

  • Number of nodes
  • Number of paths resolved
  • Fraction of unitigs recovered in the paths
  • Maximum degree of the graph
  • Minimum degree of the graph
  • Maximum in degree of the graph
  • Maximum out degree of the graph
  • Average degree of the graph
  • Average in degree of the graph
  • Average out degree of the graph
  • Density of the graph
  • Maximum path length: length of the longest path
  • Minimum path length: length of the shortest path
  • Length ratio (long/short): (Maximum path length / Minimum path length)
  • Maximum coverage path length: length of the path with the highest coverage
  • Minimum coverage path length: length of the path with the lowest coverage
  • Length ratio (highest cov/lowest cov): (Maximum coverage path length / Minimum coverage path length)
  • Maximum coverage
  • Minimum coverage
  • Coverage ratio (highest/lowest): (Maximum coverage / Minimum coverage)

You can compare and visualise the graph statistics of the resolved components using this information. The following example code shows how to visualise the results using Python.

Importing Python packages

Assuming you have installed Python and the packages matplotlib, pandas and seaborn, let's import the following.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Load the data

Now we will load the resolved_component_info.txt file into a dataframe called component_stats.

# Load the resolved_component_info.txt from Phables results
component_stats = pd.read_csv("resolved_component_info.txt", delimiter="\t", header=0)

You can list the columns using component_stats.columns. The following columns will be listed.

Index(['Component', 'Number of nodes', 'Number of paths',
       'Fraction of unitigs recovered', 'Maximum degree', 'Maximum in degree',
       'Maximum out degree', 'Average degree', 'Average in degree',
       'Average out degree', 'Density', 'Maximum path length',
       'Minimum path length', 'Length ratio (long/short)',
       'Maximum coverage path length', 'Minimum coverage path length',
       'Length ratio (highest cov/lowest cov)', 'Maximum coverage',
       'Minimum coverage', 'Coverage ratio (highest/lowest)'],
      dtype='object')

Plot histograms

You can plot histograms of the different columns. The following code plots a histogram of the Number of nodes column.

# Get the column
df = component_stats["Number of nodes"]

# Plot the histogram
ax = df.plot.hist(bins=100, alpha=0.5, figsize=(12, 8))

# Set axis titles
ax.set(xlabel='Number of nodes', ylabel='Frequency')

# Save figure
plt.savefig("histogram_n_nodes.png", format='png', dpi=300, bbox_inches='tight')

Plot heatmaps

You can plot heatmaps for correlations of all the graph statistics as follows.

# Use Pearson correlation
df_cor = component_stats.corr(method='pearson')

# Plot heatmap
sns.heatmap(df_cor, cmap="Blues")

# Save figure
plt.savefig("pearson_heatmap.png", format='png', dpi=300, bbox_inches='tight')

Plot hierarchically-clustered heatmaps

As the heatmap above looks a bit messy and hard to interpret, we can clean it up by clustering so we can observe some patterns. For this we can use the clustermap function from seaborn which produces a hierarchically-clustered heatmap.

# Plot the hierarchically-clustered heatmap
pearson_clustermap = sns.clustermap(df_cor, cmap="Blues", method="ward")

# Save figure
pearson_clustermap.savefig("pearson_clustermap.png", format='png', dpi=300, bbox_inches='tight')