RFC Name : Advanced Network Merge and ID Mapping API

Editor(s): Jianjiong Gao, Keiichiro Ono, Thomas Kelder, Martijn van Iersel, Gary Bader

Date: 2008-05-02

Status: open for comment

TableOfContents([2])

Proposal

Cytocape has implemented a plugin of Merge Networks, which allows the user to find the union, intersection and difference of networks based on node identifiers/IDs. However, nodes with different IDs may actually have the same biological meaning (e.g. the same protein/gene), especially when they come from different sources. In this project, the Merge Networks feature will be enhanced. Nodes will be matched according to ID mappings, i.e. the nodes whose IDs are mapped to each other are considered as the same node.

Use Cases

5 related use cases have been identified on [http://baderlab.org/IdentifierMapping Bader Lab ID Mapping page]. 2 of them are closely related to this project:

Implementation Plan

General Architecture

attachment:system_design.png

Code Base

Workflows

  1. Get ID mappings
    • Below are the possible options from which the user can choose which ID mapping source to use:
      • Choose an attribute as ID to match.
      • Provide custom mapping with a text file. (IDs in each row match--biologically the same)
      • Get ID mappings from a web service
      • Get ID mappings from a relational database.
      • Get ID mapping from a local database (e.g. embeded Derby database)
    • Ask the user to selected which ID types are used in the networks from a list of ID types.
      • The list of ID types may be different for different database or web service. For example, if the user chooses GenMAPP gene database, he/she can select a few out of the 19 supported ID types.
      • Or according to GaryBader's suggestion, we should limit the types of IDs supported in order to minimize the chances for errors. On that purpose, we should only provide the most common used ID types (e.g. Entrez Gene, RefSeq, UniProt and some others) in the list of ID types, no matter what web service/database the user has chosen.

      • If the user is not sure about what ID types are used in the networks, he/she can choose all.
    • After retrieving the data, report to the user how many IDs are found, how many are not and how many are ambiguous (the same ID existing in different type--should be rare). For the ambiguous ones, ask the user to decide. (This is similar to what [http://david.abcc.ncifcrf.gov/ DAVID] has done.)

  2. Merge the networks according to the ID mappings
    • For those nodes whose IDs cannot be found in the ID mappings, merge them according to IDs.
    • For those nodes whose IDs are found in the ID mappings, merge them according to ID mapping.
    • When merging node in both of above two cases, conflicts of attributes need to be addressed as in 3.
  3. Deal with attribute conflicts
    • Let the user sort the priorities of the networks to be merged, follow the one with higher priority when conflicts occur.
    • Let the user assign which attribute value should be used for each node. After each assignment, pop out to ask “apply to all?” with check box “don’t ask again”.

GUI Design

API Design

API for ID mapping

Basically, the interface takes a list of orignal IDs, original ID types and destination ID type as parameter and return a list of destination IDs.

Comments on API

File format

  1. User custom file format
    • Similar as the text file for importing networks, because ID mapping can be view as a graph: nodes represent IDs, edges represent mappings between IDs.
    • The file is comprised of 3 section:
      • First line: n - number of nodes in this file

      • Each of the next n lines: ID followed by ID type

      • Each of the rest of lines: link (ID to ID)
    • An example
         3
         P15700   UniProt
         M31455   GenBank
         853844   EntrezGene
         P15700   M31455
         P15700   853844
         M31455   853844
    • Note:
      • Redundant ID mapping in the file is allowed. (e.g. the last row “M31455 853844” is redundant. It can be inferred from “P15700 M31455” and “P15700 853844”. )
    • Questions:
      • Could two IDs of the same type be linked?
  2. Relational database tables to store ID mapping
    • Two solutions:
      • 1. Three tables similar as in PathVisio:

        • DataNode (ID, Type): each tuple contains the ID of a node and its ID type.

        • Link (ID1,ID2): each type contains two IDs mapping to each other.
        • Info (Version): containing the version of data.
      • 2. Two tables:
        • Data(VID,ID,Type): each tuple contains a virtual ID, the ID of a node and its ID type. IDs mapped to each other have the same virtual ID.
          • Info (Version)
    • Question:
      • Which solution to use?
      • How to store cached data? Separate tables or additional fields?

References/Resources

Web services and Relational DB

Tools

Project Management

Project Timeline

Provide a timeline for implementation. Insert a graphic if you can. Try this free online tool for making project timelines -> [http://www.helpuplan.com/index.asp Help-u-Plan] (create a new chart; modify; right-click to save gif; then attach to this page)

Tasks and Milestones

This project will use incremental approach.

Outline the major milestones and tasks involved in implementation.

Tentative Schedule

  1. Milestone 1: Define ID Mapping API

  2. Milestone 2: Design overall system structure and moch UI

  3. Milestone 3: Finish File-Based network Merge

  4. Milestone 4: DB based Merge

  5. Milestone 5: Web-Service Based Merge

  6. Milestone 6: Integration and Testing

  7. Milestone 7: Documentation and Public Release

Project Dependencies

Another GSoC project taken by ArmanAksoy is also Advanced Network Merge. His approach is different: he proposed to merge nodes according to similarity score rather than ID mappings. However, for the GUI part and the part of deciding attribute conflicts, these two projects can share a common mechanism.

Issues/Discussions/Further investigations

  1. How to design the public interface for ID mapping function? How should Derby DB, Web service clients and ID mapper interact with each other?
    • Better to keep the ID mapping interface simple: it just takes original ID, original ID type, destination ID type as parameters and returns destination ID.
    • Implementation of the ID mapping interface should take the ID mapping sources (Derby DB, custom file and web service) into consideration
      • If the user chooses Derby DB to get ID mapping, just search local Derby DB and return the result.
      • If the user chooses custom file or web service,
        • Try to search the data in Derby first
        • For IDs not found in Derby, try custom file or web service
        • Cache the ID mappings retrieved from web service into Derby (merge data in Derby may be needed)
  2. Should we implement a number of validations to detect mistakes in the ID mappings? How? -- Need further investigation.
    • GaryBader: It would be good to have a validation and cleaning system that implements a number of validation rules to detect mistakes in the input ID mapping files. Bad ID mappings downloaded from external sources can cause errors in the network merge, which will degrade the data. For instance, if you download uniprot and extract all of the ID mappings, you will find that some of the supposedly equivalent IDs actually point to proteins from different species. So organism, molecule type (rna, dna, protein, gene), etc. should be part of the system and associated with each ID to enable better validation.

    • GaryBader: What I'm thinking here is a module (separate from Cytoscape) that validates and cleans ID mappings gathered from different relatively trusted sources e.g. Ensembl, uniprot, entrez gene, refseq. We've found ID mapping errors in these sources, so it would be useful to create a cleaned set for our users. This could be a separate module to simplify the main API i.e. the main API could assume that the IDs are clean and will simply use them as is. This would also simplify the input formats (text file or database), since they would just need ID1:DB1, ID2:DB2, ID mapping type. If large file size is an issue, these could be compressed in interesting ways given there is a lot of repetition in IDs i.e. all human ensembl IDs start with ENSP000.. and you can convert the strings to integers (but this optimization is getting ahead of ourselves and should be the last thing considered, only if necessary)

  3. For ID mappings, should we limit the types of IDs supported?
    • GaryBader: The user should be allowed to use whatever IDs they want, however we may only want to provide a limited 'suggested' or 'recommended' subset in any files or databases we provide to encourage users to use more consistent IDs in their work and reduce errors and our maintenance costs.

  4. Should it be required that one network can only contain one type of IDs?
    • ThomasKelder: Maybe there are also different options here, depending on the network type. Sometimes the ID type is fixed for an attribute (e.g. an attribute called 'uniprot' typically only contains uniprot IDs), but sometimes the type is defined in another attribute (e.g. an attribute 'ID' and 'system') or concatenated to the ID (e.g. 'Uniprot:U1234'). It will be hard to cover all these options, maybe some investigation is needed on what situations occur most.

    • KeiichiroOno: Ideally, each network contains only one ID set. However, in some cases it is impossible since one ID set does not contain all of the macro/small molecules (see this example from IntAct. Node IDs are UniProt unified acc. number, but it is not available for small molecules). Just like IntAct, probably it is OK to use DBName:ID style ID if the object does not exists in the target ID set. In that case, the DBName:ID should be taken from the primary source of the database, i.e., if the object only exists in KEGG and not in UniProt, it should be KEGG:xxxxx

  5. How to maintain a Derby database on the server and client? is it affordable?
    • For very large ID lists, it may take too much time to get ID mappings from web service, a local database may save a lot of time. The issue is that effort may be needed to maintain/update such a database and the Cytoscape need to update the local database (either manually or automatically) from time to time. Is it affordable?
    • An possible way may be like this: we maintain an ID mapping database/file on the server, update it every a few days from online web services/databases (e.g. NCBI, UniProt). The Cytoscape user can manually download/update it. when the user chooses to get ID mappings from web services/online databases, search the local database first, and then for the IDs that are not contained in the local database, try to get them from online web services/databases, if successful, save these ID mappings in the local database. The advantage of the method is that only the used IDs will be updated in the local database, rather then the whole database (though, the user can manually update it). This is desirable, because most likely one user will use a subset of the IDs. It's not necessary to update the whole database.

    • One question: Are the ID mappings changeable? Or does updating ID mappings mean only adding new mappings or possibly change the old mappings? If the ID mappings are changeable, it's better to add a time attribute for each ID mapping in the database, so we can update the old (e.g. >3 months) ID mappings in the local database (only update the used the IDs when searching).

    • Even if we do not maintain an ID mapping database on the server, it is still good to maintain a local database/file, which contains the ID mappings the user has used, so that the user do not need get them from web service every time. It may be especially desirable, when a tool for mapping IDs is implemented.

Comments

How to Comment

Edit the page and add your comments under the provided header. By adding your ideas to the Wiki directly, we can more easily organize everyone's ideas, and keep clear records. Be sure to include today's date and your name for each comment. Try to keep your comments as concrete and constructive as possible. For example, if you find a part of the RFC makes no sense, please say so, but don't stop there. Take the extra step and propose alternatives.

Funding for Cytoscape is provided by a federal grant from the U.S. National Institute of General Medical Sciences (NIGMS) of the Na tional Institutes of Health (NIH) under award number GM070743-01. Corporate funding is provided through a contract from Unilever PLC.

MoinMoin Appliance - Powered by TurnKey Linux