Substructure Index-based Approximate Graph Alignment (SAGA)

Category Cross-Omics>Pathway Analysis/Tools

Abstract Substructure Index-based Approximate Graph Alignment (SAGA) is a tool for querying a biological graph database to retrieve matches between sub-graphs of molecular interactions that scientists select and biological networks.

SAGA implements an efficient approximate sub-graph matching algorithm that can be used for a variety of biological graph matching problems.

One application is: Pathway Matching - In which SAGA is used to compare pathways in KEGG and Reactome (see G6G Abstract Number 20267).

You can also use SAGA to find matches in literature databases that have been parsed into semantic graphs.

In this use of SAGA, portions of PubMed have been parsed into graphs that have nodes representing gene names.

A link is drawn between two genes if they are discussed in the same sentence (indicating there is potential association between the two genes).

SAGA let you match graphs to these two (2) different databases even though the content is distinct and the databases organize pathways in different ways.

This is achieved by SAGA’s flexible approximate sub-graph matching model which computes graph similarity, and allows for node gaps, node mismatches, and graph structural differences.

Note: Comparing pathways from different databases can be a precursor to pathway data integration.

SAGA employs a match model that can be used to accurately incorporate ‘domain knowledge’ for capturing the domain-specific notion of graph similarity.

An index-based algorithm makes approximate sub-graph matching queries very efficient.

The manufacturer’s evaluations using a number of actual biomedical applications show that SAGA can produce biologically relevant matches on actual examples, whereas existing tools fail.

In addition, the manufacturer has demonstrated the efficiency of the SAGA approach.

SAGA is very effective and efficient for querying relatively small graphs (ideally sparse graphs with less than 100 nodes) against very large databases, and there are many compelling applications in this setting.

However, the manufacturer does Not recommend using the existing tool when the query graph is very dense and/or has a large number of nodes.

For such large query graphs, the performance of the existing SAGA method degrades since potentially a large number of small hits can be produced by Step 1 of the matching algorithm.

Assembling these hits is computationally expensive with the existing SAGA algorithm.

To improve the performance of SAGA for large query graphs, one can leverage the observation that biological graphs have very strong modular structures (Tornow and Mewes, 2003).

In other words, these graphs can be naturally divided into groups with dense intra-group connections and very sparse inter-group connections.

As part of future work, the manufacturer plans on making use of the modular property and employ a divide and conquer strategy to handle large queries.

More specifically, the manufacturer can divide the queries into several sub-queries (the union of the sub-queries should cover the original query), then use SAGA to match each sub-query.

Finally, the manufacturer can assemble the results for the sub-queries to produce the final matches.

In the current version of SAGA, the Monte Carlo simulation approach is employed to evaluate the statistical significance of the matching results.

As the simulation is applied to a large number of random graphs, the cost of computing this statistical significance can be much higher than the query execution cost itself.

However, due to the efficiency of the SAGA indexing and matching mechanism, in many cases, the significance test can still be computed quickly.

For example, the manufacturer generated 654 × 100 = 65,400 random graphs for the manufacturer's largest database d5.

The p-value evaluation for the 10 disease associated queries, on these random graphs, ranges from 1 millisecond for query ‘hsa04940’ to 19 seconds for query ‘hsa04930’.

However, as the number of random graphs used for significance test increases, the overhead will become more significant.

In the future, the manufacturer plans on developing efficient analytical methods for assessing significance of the SAGA matching results.

System Requirements

Contact manufacturer.


The National Center for Integrative Biomedical Informatics (NCIBI). NCIBI is based at the University of Michigan as a part of The Center for Computational Medicine and Biology (CCMB).

For SAGA/TALE support and/or any questions regarding SAGA/TALE email

Manufacturer Web Site SAGA

Price Contact manufacturer.

G6G Abstract Number 20312

G6G Manufacturer Number 101826