RFC Name : Advanced Network Merge and ID Mapping API

Editor(s): Jianjiong Gao, Keiichiro Ono, Thomas Kelder, Martijn van Iersel, Gary Bader

Date: 2008-05-02

Status: open for comment

Updates

Proposal

Cytocape has implemented a plugin of Merge Networks, which allows the user to find the union, intersection and difference of networks based on node identifiers/IDs. However, nodes with different IDs may actually have the same biological meaning (e.g. the same protein/gene), especially when they come from different sources. In this project, the Merge Networks feature will be enhanced. Nodes will be matched according to ID mappings, i.e. the nodes whose IDs are mapped to each other are considered as the same node.

Use Cases

5 related use cases have been identified on Bader Lab ID Mapping page. 2 of them are closely related to this project:

Workflows

Main procedure

Given a set of source networks, each of which has a set of attributes, Network Merge procedure should have the following three steps:

  1. Node matching: select attribute(s) of each source network to identify the nodes in the network. Two nodes in two networks match when the value of their selected attribute(s) match (i.e. map to each other if using ID mapping, or the same if node). Note: the main purpose of this step is to matching nodes among source networks. Using ID mapping is just for matching nodes in this step.
  2. Attribute merging: merge the attribute of source networks into attributes in the resulting network. The user can define the attributes in the resulting network: the names of the attributes and which attribute in which source network it comes from. The user can define as many or as few attributes as he/she likes. If the values of attributes of the matched nodes are the same, then merge without problem; otherwise, conflicts occurs. Note: ID mapping can be also used in this step to match values of attributes of the source network.
  3. Conflicts handling: let the user decide some rule based on priorities of networks, priorities of ID types, etc.. Or let the user assign which attribute value should be used for each node. Note: using IDs of a destination id type to assign the IDs in the resulting network is actually one of the strategies to solve the conflicts of IDs among different ID types.

ID mapping procedure

  1. Selecting which ID mapping source to use. Below are the possible options from which the user can choose.
    • Provide custom mapping with a local file
    • Get ID mappings from a web service
    • Get ID mappings from a relational database (e.g. embeded Derby database)
  2. Selecting which ID types are used in the networks from a list of ID types.
    • The list of ID types may be different for different database or web service. For example, if the user chooses GenMAPP gene database, he/she can select a few out of the 19 supported ID types.
    • Or according to GaryBader's suggestion, we should limit the types of IDs supported in order to minimize the chances for errors. On that purpose, we should only provide the most common used ID types (e.g. Entrez Gene, RefSeq, UniProt and some others) in the list of ID types, no matter what web service/database the user has chosen.

    • If the user is not sure about what ID types are used in the networks, he/she can choose all.
  3. After retrieving the data, report to the user how many IDs are found, how many are not and how many are ambiguous (the same ID existing in different type--should be rare). For the ambiguous ones, ask the user to decide. (This is similar to what DAVID has done.)

A simple case study

Assumptions

Because of internal architecture of Cytoscape, we assume the following for merge operation:

Examples

Implementation Plan

General Architecture

system_design.png

Code Base

GUI Design

Some Requirments Based on Other Developer's Feedback

API Design

API for ID mapping

Basically, the interface takes a list of orignal IDs, original ID types and destination ID type as parameter and return a list of destination IDs.

   1  public interface IDMapper {
   2      /**
   3       * Map a list of IDs of one type to a list of IDs of another type
   4       * @param idSrc The list of source IDs, i.e., the list of IDs to be mapped
   5       * @param typeSrc The type (e.g. UniProt ID) of the source IDs
   6       * @param typeDst The destination ID type
   7       * @return a list of distination IDs
   8       */
   9      public List<String> mapID(List<String> idSrc, String typeSrc, String typeDst);
  10  }
  11  public interface IDMapperDB extents IDMapper {...}
  12  public class IDMapperDerby implements IDMapperDB {...}
  13  public class IDMapperWebservice implements IDMapper {...}
  14  public class IDMapperFile implements IDMapper {...}
  15 

   1 public interface IDMapper {
   2         // Supports one-to-one mapping and one-to-many mapping.
   3         public Map<String, Set<String>> mapID(Set<String> ids, String srcType, String tgtType);
   4 
   5         // Check whether an ID exists in a specific type.
   6         public boolean idExistsInSrcIDType(String srcID, String srcType);
   7 
   8         // returns supported source ID types
   9         public Set<String> getSupportedSrcIDType();
  10         
  11         // returns supported target ID types
  12         public Set<String> getSupportedTgtIDType();
  13 }
  14 

   1 public interface IDMapperFile extends IDMapper {}
   2 public interface IDMapperRDB extends IDMapper {}
   3 public interface IDMapperWebService extends IDMapper {}
   4 

   1 public class IDMapperText extends IDMapperFile {
   2         // Delimited text file mapper implementation
   3 }
   4 public class IDMapperExcel extends IDMapperFile {
   5         // Excel file mapper implementation
   6 }
   7 public class IDMapperDerby extends IDMapperRDB {
   8         // Apache Derby specific mapper implementation
   9 }
  10 public class IDMapperMySQL extends IDMapperRDB {
  11         // MySQL specific mapper implementation
  12 }
  13 public class IDMapperUniprotWS extends IDMapperWebService {
  14         // Uniprot web service specific mapper implementation
  15 }
  16 .
  17 .
  18 .
  19 

Comments on API

File format

  1. Free format text table file or MS Excel file (ref: import network from table)
    • Each column for one ID type
    • Each row except the first one represents IDs of different types mapping to each other
    • First row contains ID types
    • Multiple IDs are allowed to be contained in one cell (One to many mapping, or IDs of the same type maps to each other). Use special character (e.g., ';', '/', etc, or user defined) to separate IDs.
  2. Relational database tables to store ID mapping
    • Two solutions:
      • 1. Three tables similar as in PathVisio:

        • DataNode (ID, Type): each tuple contains the ID of a node and its ID type.

        • Link (ID1,ID2): each type contains two IDs mapping to each other.
        • Info (Version): containing the version of data.
      • 2. Two tables:
        • Data(VID,ID,Type): each tuple contains a virtual ID, the ID of a node and its ID type. IDs mapped to each other have the same virtual ID.
          • Info (Version)
    • Question:
      • Which solution to use?
      • How to store cached data? Separate tables or additional fields?

References/Resources

Web services and Relational DB

Tools

Project Management

Project Timeline

Provide a timeline for implementation. Insert a graphic if you can. Try this free online tool for making project timelines -> Help-u-Plan (create a new chart; modify; right-click to save gif; then attach to this page)

Tasks and Milestones

This project will use incremental approach.

Outline the major milestones and tasks involved in implementation.

Tentative Schedule

Milestones
  1. Define ID Mapping API

  2. Design overall system structure and moch UI

  3. Implement existing attribute-based network merge - First version available. Advanced Network Merge plugin jar file

  4. Finish File-Based ID Mapping

  5. RDB-based ID Mapping

  6. Web Service Based ID Mapping

  7. Integration and Testing

  8. Documentation and Public Release

  9. Optional Tasks
    • XML file readers for ID import
    • Use multiple ID data sources for mapping at once

Project Dependencies

Another GSoC project taken by ArmanAksoy is also Advanced Network Merge. His approach is different: he proposed to merge nodes according to similarity score rather than ID mappings. However, for the GUI part and the part of deciding attribute conflicts, these two projects can share a common mechanism.

Issues/Discussions/Further investigations

  1. How to design the public interface for ID mapping function? How should Derby DB, Web service clients and ID mapper interact with each other?
    • Better to keep the ID mapping interface simple: it just takes original ID, original ID type, destination ID type as parameters and returns destination ID.
    • Implementation of the ID mapping interface should take the ID mapping sources (Derby DB, custom file and web service) into consideration
      • If the user chooses Derby DB to get ID mapping, just search local Derby DB and return the result.
      • If the user chooses custom file or web service,
        • Try to search the data in Derby first
        • For IDs not found in Derby, try custom file or web service
        • Cache the ID mappings retrieved from web service into Derby (merge data in Derby may be needed)
  2. Should we implement a number of validations to detect mistakes in the ID mappings? How? -- Need further investigation.
    • GaryBader: It would be good to have a validation and cleaning system that implements a number of validation rules to detect mistakes in the input ID mapping files. Bad ID mappings downloaded from external sources can cause errors in the network merge, which will degrade the data. For instance, if you download uniprot and extract all of the ID mappings, you will find that some of the supposedly equivalent IDs actually point to proteins from different species. So organism, molecule type (rna, dna, protein, gene), etc. should be part of the system and associated with each ID to enable better validation.

    • GaryBader: What I'm thinking here is a module (separate from Cytoscape) that validates and cleans ID mappings gathered from different relatively trusted sources e.g. Ensembl, uniprot, entrez gene, refseq. We've found ID mapping errors in these sources, so it would be useful to create a cleaned set for our users. This could be a separate module to simplify the main API i.e. the main API could assume that the IDs are clean and will simply use them as is. This would also simplify the input formats (text file or database), since they would just need ID1:DB1, ID2:DB2, ID mapping type. If large file size is an issue, these could be compressed in interesting ways given there is a lot of repetition in IDs i.e. all human ensembl IDs start with ENSP000.. and you can convert the strings to integers (but this optimization is getting ahead of ourselves and should be the last thing considered, only if necessary)

  3. For ID mappings, should we limit the types of IDs supported?
    • GaryBader: The user should be allowed to use whatever IDs they want, however we may only want to provide a limited 'suggested' or 'recommended' subset in any files or databases we provide to encourage users to use more consistent IDs in their work and reduce errors and our maintenance costs.

  4. Should it be required that one network can only contain one type of IDs?
    • ThomasKelder: Maybe there are also different options here, depending on the network type. Sometimes the ID type is fixed for an attribute (e.g. an attribute called 'uniprot' typically only contains uniprot IDs), but sometimes the type is defined in another attribute (e.g. an attribute 'ID' and 'system') or concatenated to the ID (e.g. 'Uniprot:U1234'). It will be hard to cover all these options, maybe some investigation is needed on what situations occur most.

    • KeiichiroOno: Ideally, each network contains only one ID set. However, in some cases it is impossible since one ID set does not contain all of the macro/small molecules (see this example from IntAct. Node IDs are UniProt unified acc. number, but it is not available for small molecules). Just like IntAct, probably it is OK to use DBName:ID style ID if the object does not exists in the target ID set. In that case, the DBName:ID should be taken from the primary source of the database, i.e., if the object only exists in KEGG and not in UniProt, it should be KEGG:xxxxx

  5. How to maintain a Derby database on the server and client? is it affordable?
    • For very large ID lists, it may take too much time to get ID mappings from web service, a local database may save a lot of time. The issue is that effort may be needed to maintain/update such a database and the Cytoscape need to update the local database (either manually or automatically) from time to time. Is it affordable?
    • An possible way may be like this: we maintain an ID mapping database/file on the server, update it every a few days from online web services/databases (e.g. NCBI, UniProt). The Cytoscape user can manually download/update it. when the user chooses to get ID mappings from web services/online databases, search the local database first, and then for the IDs that are not contained in the local database, try to get them from online web services/databases, if successful, save these ID mappings in the local database. The advantage of the method is that only the used IDs will be updated in the local database, rather then the whole database (though, the user can manually update it). This is desirable, because most likely one user will use a subset of the IDs. It's not necessary to update the whole database.

    • One question: Are the ID mappings changeable? Or does updating ID mappings mean only adding new mappings or possibly change the old mappings? If the ID mappings are changeable, it's better to add a time attribute for each ID mapping in the database, so we can update the old (e.g. >3 months) ID mappings in the local database (only update the used the IDs when searching).

    • Even if we do not maintain an ID mapping database on the server, it is still good to maintain a local database/file, which contains the ID mappings the user has used, so that the user do not need get them from web service every time. It may be especially desirable, when a tool for mapping IDs is implemented.

Comments

How to Comment

Edit the page and add your comments under the provided header. By adding your ideas to the Wiki directly, we can more easily organize everyone's ideas, and keep clear records. Be sure to include today's date and your name for each comment. Try to keep your comments as concrete and constructive as possible. For example, if you find a part of the RFC makes no sense, please say so, but don't stop there. Take the extra step and propose alternatives.

Advanced_Network_Merge_and_ID_Mapping (last edited 2010-04-02 04:45:58 by asp)

Funding for Cytoscape is provided by a federal grant from the U.S. National Institute of General Medical Sciences (NIGMS) of the Na tional Institutes of Health (NIH) under award number GM070743-01. Corporate funding is provided through a contract from Unilever PLC.

MoinMoin Appliance - Powered by TurnKey Linux