RFC Name : Advanced Network Merge and ID Mapping API

Editor(s): Jianjiong Gao, Keiichiro Ono, Thomas Kelder, Martijn van Iersel, Gary Bader

Date: 2008-05-02

Status: <!>Under Construction

TableOfContents([2])

Proposal

Cytocape has implemented a plugin of Merge Networks, which allows the user to find the union, intersection and difference of networks based on node identifiers/IDs. However, nodes with different IDs may actually have the same biological meaning (e.g. the same protein/gene), especially when they come from different sources.

In this project, the Merge Networks feature will be enhanced. Nodes will be matched according to ID mappings, i.e. the nodes whose IDs are mapped to each other are considered as the same node. The basic idea is as follows:

  1. Get ID mappings by a user-custom file, a local database or a web service;
  2. Merge the nodes according the ID mappings;
  3. Let the user decide attributes conflicts if occurring.

Use Cases

1. Get ID mappings

2. Merge the networks according to the ID mappings

3. Deal with attribute conflicts

Implementation Plan

Modules

This project can be divided into the following parts: (Ref: [http://web.missouri.edu/~jg722/GSoC/networkmerge.pdf diagram] by KeiOno)

  1. General-purpose embedded DB plugin
    • Wrap Derby with an Interface for Cytoscape. The interface should be generic rather than specific for ID mapping, because embedded DB may be used by other plugins or core in the future.
  2. Web Service Clients development
    • KeiOno has implemented a framework to manage web service clients in Cytoscape 2.6

      • Some of web service clients (e.g. NCBI, BioMart, and PICR) could be used in this project (Ref: SampleWebServiceClients)

      • KeiOno has also prototyped a client to use UniProt API

      • Add additional clients (e.g. Ensembl web service) to the framework
  3. ID Mapper Core
    • Implement a generic interface to to provide ID mappings, which integrates
      • Custom ID mapping files
      • Embedded DB
      • Web services
  4. Build a new Network Merge plugin
    • New GUI for the plugin
    • Utilize the old code for merging nodes
    • Deal with the conflicts
  5. (Optional) ID mapping plugin
    • Implement a UI to use ID Mapper module alone

Discussion/Further investigation

  1. How to design the public interface for ID mapping function? How should Derby DB, Web service clients and ID mapper interact with each other?
    • Better to keep the ID mapping interface simple: it just takes original ID, original ID type, destination ID type as parameters and returns destination ID.
    • Implementation of the ID mapping interface should take the ID mapping sources (Derby DB, custom file and web service) into consideration
      • If the user chooses Derby DB to get ID mapping, just search local Derby DB and return the result.
      • If the user chooses custom file or web service,
        • Try to search the data in Derby first
        • For IDs not found in Derby, try custom file or web service
        • Cache the ID mappings retrieved from web service into Derby (merge data in Derby may be needed)
  2. How to design the DB tables for ID mapping? How to store caching data?
    • Three tables similar as in PathVisio: DataNode(ID, Type), Link(ID1,ID2), Info(Version).

    • Separate tables for caching data or additional fields in DataNode and Link?

  3. Should we implement a number of validations to detect mistakes in the ID mappings--GaryBader? How? -- Need further investigation.

    • (According to GaryBader) It would be good to have a validation and cleaning system that implements a number of validation rules to detect mistakes in the input ID mapping files. Bad ID mappings downloaded from external sources can cause errors in the network merge, which will degrade the data. For instance, if you download uniprot and extract all of the ID mappings, you will find that some of the supposedly equivalent IDs actually point to proteins from different species. So organism, molecule type (rna, dna, protein, gene), etc. should be part of the system and associated with each ID to enable better validation.

  4. For ID mappings, should we limit the types of IDs supported?
    • According to GaryBader, Yes, in order to minimize the chances for errors. Maybe we can start from some most common ID types (e.g. Entrez Gene, RefSeq, UniProt), and then extend gradually.

  5. Should it be required that one network can only contain one type of IDs?
    • (According to ThomasKelder) Maybe there are also different options here, depending on the network type. Sometimes the ID type is fixed for an attribute (e.g. an attribute called 'uniprot' typically only contains uniprot IDs), but sometimes the type is defined in another attribute (e.g. an attribute 'ID' and 'system') or concatenated to the ID (e.g. 'Uniprot:U1234'). It will be hard to cover all these options, maybe some investigation is needed on what situations occur most.

    • KeiOno: Ideally, each network contains only one ID set. However, in some cases it is impossible since one ID set does not contain all of the macro/small molecules (see this example from IntAct. Node IDs are UniProt unified acc. number, but it is not available for small molecules). Just like IntAct, probably it is OK to use DBName:ID style ID if the object does not exists in the target ID set. In that case, the DBName:ID should be taken from the primary source of the database, i.e., if the object only exists in KEGG and not in UniProt, it should be KEGG:xxxxx

Resource

Project Management

Project Timeline

Provide a timeline for implementation. Insert a graphic if you can. Try this free online tool for making project timelines -> [http://www.helpuplan.com/index.asp Help-u-Plan] (create a new chart; modify; right-click to save gif; then attach to this page)

Tasks and Milestones

Outline the major milestones and tasks involved in implementation.

  1. Milestone 1: …

    1. Task 1: ...
    2. Task 2: ...
  2. Milestone 2: …

Project Dependencies

Another GSoC project taken by ArmanAksoy is also Advanced Network Merge. His approach is different: he proposed to merge nodes according to similarity score rather than ID mappings. However, for the GUI part and the part of deciding attribute conflicts, these two projects can share a common mechanism.

Link to other related RFCs

Issues

List any issues, conflict, or dependencies raised by this proposal

Comments

How to Comment

Edit the page and add your comments under the provided header. By adding your ideas to the Wiki directly, we can more easily organize everyone's ideas, and keep clear records. Be sure to include today's date and your name for each comment. Try to keep your comments as concrete and constructive as possible. For example, if you find a part of the RFC makes no sense, please say so, but don't stop there. Take the extra step and propose alternatives.

Funding for Cytoscape is provided by a federal grant from the U.S. National Institute of General Medical Sciences (NIGMS) of the Na tional Institutes of Health (NIH) under award number GM070743-01. Corporate funding is provided through a contract from Unilever PLC.

MoinMoin Appliance - Powered by TurnKey Linux