G6G Directory of Omics and Intelligent Software

BioMAJ

Category Cross-Omics>Workflow Knowledge Bases/Systems/Tools and Cross-Omics>Knowledge Bases/Databases/Tools

Abstract BioMAJ (BIOlogie Mises A Jour) is a workflow engine for biological databank (database) management, dedicated to data synchronization and processing.

It is designed to automate and manage data workflows associated with updating and processing local mirrors of large biological databases.

This software can be used by both large-scale bioinformatics projects and administrators of large computational infrastructures that provide services based on well-known bioinformatics suites such as The European Molecular Biology Open Software Suite (EMBOSS), Sequence Retrieval System (SRS) and GCG.

Why BioMAJ?

‘Biological knowledge’ in a genomic or post-genomic context is mainly based on transitive bioinformatics analysis consisting of an iterative and periodic comparison of newly produced data against a corpus of known information.

In large scale projects, this approach needs accurate bioinformatics software, pipelines (workflows), interfaces and numerous heterogeneous biological banks (databases), which are distributed around the world.

An integration process that consists of mirroring and indexing those data is obviously an essential preliminary step which represents a major challenge and bottleneck in most bioinformatics projects; BioMAJ aims to resolve this problem, by providing a flexible and robust fully automated environment.

BioMAJ Engine Behavior --

BioMAJ has been specifically designed to manage databank update cycles. It permits flexible data synchronization, controls execution of local post-download processing tasks and logs all activity for ulterior usage. All processing tasks are highly configurable and can be executed serially or in parallel. The engine supervises the execution of all tasks declared within each processing stage.

In case of an error, only the faulty sub-parts of a treatment are re- executed, which is extremely useful when a treatment requires extensive computational resources. BioMaJ's features have been developed to iteratively ‘execute workflows’ in order to routinely update huge and/or numerous databanks in batch mode.

The engine follows a predefined template mapped onto the processes of updating and indexing. Some parts of the template are static and only need ‘custom properties’ to define the remote server address, the file transfer protocol, regular expressions that select remote files and whether or not downloaded files should be uncompressed.

Other stages of BioMAJ templates are more open and can utilize a meta- scheduler for deferred program execution and a basic ‘description language’ that enables one to implement personalized databank processing.

Each personalized databank update cycle is divided into five (5) stages:

1) Initialization,

2) preprocessing,

3) synchronization,

4) post-processing, and

5) deployment.

The exact steps to be executed are described in a text file, referred to as the ‘properties file’. The standard behavior of each stage is as follows:

1) The initialization stage consists of setting up the session, loading the workflow and checking the state of both the current and previous databank releases.

2) The preprocessing stage handles various actions that must be performed prior to synchronization, such as sending e-mail alert messages and ensuring available disk space. This stage is customizable and has the same features as the post-processing stage (see below...).

3) During synchronization, the version of the latest databank release is extracted from the remote server using regular expressions on a specified remote file or through interrogation of the remote file's timestamp. Selected files are fetched using various protocols (ftp, http, rsync, local copy) and transfer integrity checks are performed to ensure that valid local copies have been made.

Remote databank tree hierarchies can be preserved and the organization of both local and global repository contents is managed by the application. BioMAJ can perform multiple downloads or updates simultaneously.

After the files are downloaded, BioMAJ can automatically uncompress files and reconstruct the desired local release. All file attributes as well as history and ‘provenance information’ stored in log files, that can also be used to back trace local files and determine which files require resynchronization.

4) The post-processing stage consists of performing various tasks on synchronized data. Integration of processing programs is easy and flexible, as BioMAJ relies on system calls to execute shell scripts.

Information about the context of the databank update cycle, such as input files, parameters and output locations, is transferred to each processing task using parameters declared in the template or shell variables, which are automatically set during system calls. Thus, generic wrappers for bioinformatics programs can be easily developed and reused by ‘multiple workflows’.

In addition, BioMAJ provides a facility whereby task execution can be organized on a local machine or on a cluster using an external scheduler system. Description of the post-processing stage is handled by three (3) hierarchical elements.

The most basic unit of processing is a task, usually a wrapper script containing a set of serial processing commands. Tasks are grouped into meta-processes, which can be further organized into blocks. Blocks, meta-processes and processing tasks allow one to describe ‘customized topologies’ for data processing.

It also permits one to control the order in which the data processing tasks are executed. Blocks are launched serially following their declaration order. In a given block, each meta-process is associated with a specific thread so that individual processing tasks can be run in parallel.

This allows one to easily design a ‘directed acyclic graph’ (DAG) in which each vertex is a processing task with specific attributes and edges are the chronology of execution.

Unlike more sophisticated workflow engines such as Taverna (see G6G Abstract Number 20514), neither explicit dependencies of data nor specific semantics have been formalized for the input and output channels of treatments.

Users need a priori knowledge of the location of produced data, but this job is greatly facilitated by the default tree directory provided by the application.

Thus, BioMAJ makes it possible to define dependencies between different stages of data processing and to take into account relationships and inter-dependencies between treatments. Tasks can be executed either sequentially or in parallel to optimize execution time.

5) The deployment stage makes the new release available and removes all temporary files and obsolete releases, based upon specified retention/release parameters. Deployment concludes a successful update cycle but BioMAJ can re-execute any faulty steps through its exception handling facilities.

BioMAJ Databank administration and Monitoring --

BioMAJ also has many administrative functions such as online querying tools that interrogate repository contents and management commands that import, delete, rename and move databank releases.

Therefore, it is possible to manage the local repository using mainly BioMAJ administrative functions.

Each session for a specific database is registered in an XML State file, which can be exploited under different time scales for monitoring and querying/updating.

BioMAJ Results --

The BioMAJ package currently provides the required functionality to mirror over 100 public domain databases [from servers such as NCBI, EBI, Expasy, TiGER (see G6G Abstract Number 20196), etc.]. New databases can easily be added through the configuration of a single properties file.

Samples for the most common bioinformatics databases are available in the package. Each sample describes a dedicated workflow of databank synchronization, including in some cases data post- processing.

The manufacturer’s website has been specifically designed to share properties files and post-processing scripts between BioMAJ users.

Concerning data processing, multiple indexation post-processes are supported for various applications: NCBI BLAST, SRS, EMBOSS, GCG. Furthermore, post-processing scripts for format conversion and testing the index integrity after data processing are also available.

The BioMAJ architecture is open; so that users can also easily integrate their personal homemade processing scripts (independent of programming language). Full guidelines on how to develop and integrate scripts can be found in the application manual.

System Requirements

Web-based.

Manufacturer

BioMAJ is a collaborative effort between two French Research Institutes - INRIA (The French National Institute for Research in Computer Science and Control) & INRA (The French National Institute for Agricultural Research).

Manufacturer Web Site BioMAJ

Price Contact manufacturer.

G6G Abstract Number 20529

G6G Manufacturer Number 104146

The G6G Directory of Omics and Intelligent Software

BioMAJ