A more complex task was faced by the CAncer BIoinformatics Grid (caBIG) project, which was launched in 2003 with the ambitious goal of providing a common information platform to support the diverse clinical and basic research programmes of the US National Cancer Institute’s 87 cancer centers. This project had to integrate a highly heterogeneous set of databases and software tools, ranging from workflow systems for managing clinical trials to research tools for genome analysis and annotation.
Like BIRN, caBIG chose the Globus Toolkit as its underlying grid technology, creating a web services network called “caGrid.” In order to handle the high degree of heterogeneity among the cancer research services they wished to interconnect, the developers of caBIG then had to undertake an extended and painstaking process of unifying the data models used by each of the services. For example, the concept of “blood pressure” appears in dozens of subtly different ways in the various clinical databases used by the cancer centers. One of the earliest tasks that caBIG took on was to unify all key concepts into a reference vocabulary and set of common data elements (the “VCDE”), using existing ontologies whenever possible, and creating new ones when necessary. To add a new resource to caGrid, its developers must ensure that their tool reads and writes data types that are already described by the VCDE; if the VCDE is missing a concept that they need, there is a standard submission and approval process for getting the new concept incorporated into the VCDE.
Currently caBIG supports over 40 software tools, most of which interoperate with each other at some level. For example, the clinical trial data collection system, called C3D, stores surgical pathology reports on tumor specimens. These reports can then be read by a text information extraction system called caTIES and converted into a standardized format that describes the type and characteristics of the tumor. It is then possible to associate this histopathology information with gene expression profiles that are captured and stored in caBIG’s microarray database, caArray, and finally analyzed for expression signatures that correlate with tumor type or grade using the genePattern tool mentioned earlier. The Taverna workflow management tool described earlier has also recently been ported to work with caBIG, allowing researchers to discover and interconnect caBIG data and compute services using an intuitive graphical user interface.
Although the caBIG and BIRN grids both use the Globus Toolkit and share many ontologies, they cannot yet interoperate with each other due to different design decisions, particularly with regard to how services are registered and discovered. In principle, this obstacle can be overcome with technical gymnastics.