Advanced_Network_Merge_and_ID_Mapping

RFC Name : Advanced Network Merge and ID Mapping API

Editor(s): Jianjiong Gao, Keiichiro Ono, Thomas Kelder, Martijn van Iersel, Gary Bader

Date: 2008-05-02

Status: open for comment

Proposal

Cytocape has implemented a plugin of Merge Networks, which allows the user to find the union, intersection and difference of networks based on node identifiers/IDs. However, nodes with different IDs may actually have the same biological meaning (e.g. the same protein/gene), especially when they come from different sources. In this project, the Merge Networks feature will be enhanced. Nodes will be matched according to ID mappings, i.e. the nodes whose IDs are mapped to each other are considered as the same node.

Use Cases

5 related use cases have been identified on [http://baderlab.org/IdentifierMapping Bader Lab ID Mapping page]. 2 of them are closely related to this project:

Unification during dataset merging: During a merge operation e.g. of two protein-protein interaction datasets from independently created databases, it is vital to recognize that two protein objects, one from each data source, represent the same protein molecule, even if the protein objects don’t share any database accession numbers. Unification requires knowledge of record type e.g. you cannot reliably use a gene ID to unify proteins (mostly because splice variants exist).
Identifier translation: Some analysis methods require specific translations from one set of identifiers to another. For instance, our 'activity centers' analysis requires translation from protein or gene identifiers in a pathway database to Affymetrix probe set identifiers or other gene expression array platform identifiers.

Implementation Plan

General Architecture

attachment:system_design.png

This project will be devided into 3 parts.
1. ID Mapping Module
  - Implements function to map one ID set (UniProt, NCBI Gene ID, etc.) to the other.
  - This function should be accessible from other modules (plugin). Public API will be published for other plugin developers.
2. Database Module
  - Wrap Derby with a genetic interface for Cytoscape.
  - Storing mappings locally.
  - Data from remote sources (web services) will be cached here.
3. Network Merge Module
  - Actually merging multiple networks.
  - Dealing with attribute conflicts.
  - This includes new GUI.
The following module is optional. Will be implemented if we have time.
- ID mapping validator

Code Base

Based on Cytoscape 2.6.x branch since 3.0 will not be available until next year.
However, we should try to minimize the amount of work for porting to 3.0 series. This should be done by interface-based design, i.e., design clean API of rthe outside of the world and should be useable by other plugin developers.
Interoperability between modules (plugins) should be considered.

Workflows

Get ID mappings
- Below are the possible options from which the user can choose which ID mapping source to use:
  - Choose an attribute as ID to match.
  - Provide custom mapping with a text file. (IDs in each row match--biologically the same)
  - Get ID mappings from a web service
  - Get ID mappings from a relational database.
  - Get ID mapping from a local database (e.g. embeded Derby database)
- Ask the user to selected which ID types are used in the networks from a list of ID types.
  - The list of ID types may be different for different database or web service. For example, if the user chooses GenMAPP gene database, he/she can select a few out of the 19 supported ID types.
  - Or according to GaryBader's suggestion, we should limit the types of IDs supported in order to minimize the chances for errors. On that purpose, we should only provide the most common used ID types (e.g. Entrez Gene, RefSeq, UniProt and some others) in the list of ID types, no matter what web service/database the user has chosen.
  - If the user is not sure about what ID types are used in the networks, he/she can choose all.
- After retrieving the data, report to the user how many IDs are found, how many are not and how many are ambiguous (the same ID existing in different type--should be rare). For the ambiguous ones, ask the user to decide. (This is similar to what [http://david.abcc.ncifcrf.gov/ DAVID] has done.)
Merge the networks according to the ID mappings
- For those nodes whose IDs cannot be found in the ID mappings, merge them according to IDs.
- For those nodes whose IDs are found in the ID mappings, merge them according to ID mapping.
- When merging node in both of above two cases, conflicts of attributes need to be addressed as in 3.
Deal with attribute conflicts
- Let the user sort the priorities of the networks to be merged, follow the one with higher priority when conflicts occur.
- Let the user assign which attribute value should be used for each node. After each assignment, pop out to ask “apply to all?” with check box “don’t ask again”.

GUI Design

A GUI Mockup: http://web.missouri.edu/~jg722/GSoC/GUIMockUp/launch.jnlp

API Design

API for ID mapping

Basically, the interface takes a list of orignal IDs, original ID types and destination ID type as parameter and return a list of destination IDs.

 public interface IDMapper {
     /**
      * Map a list of IDs of one type to a list of IDs of another type
      * @param idSrc The list of source IDs, i.e., the list of IDs to be mapped
      * @param typeSrc The type (e.g. UniProt ID) of the source IDs
      * @param typeDst The destination ID type
      * @return a list of distination IDs
      */
     public List<String> mapID(List<String> idSrc, String typeSrc, String typeDst);
 }

 public interface IDMapperDB extents IDMapper {...}

 public class IDMapperDerby implements IDMapperDB {...}
 public class IDMapperWebservice implements IDMapper {...}
 public class IDMapperFile implements IDMapper {...}

Question:
- Interface VS abstract Class?

Comments on API

KeiichiroOno (2008-05-02): It (API for ID mapping) should be as simple as possible; just takes original ID (or ID set), original ID type (entrez gene id, uniprot id, etc.), and destination ID type.
GaryBader (2008-05-03): I think the default query should be as Kei suggested, but some users will want to merge in specific ways (e.g. unification in one species vs. cross-species), so having an optional 'id mapping type' controlled vocabulary could be useful. I would suggest creating some test cases of actual different types of ID mappings that users typically do - these could be used to test the generality of the API.
PietMolenaar (2008-05-05): Would it be possible to introduce an additional interface IDMapperDb of which IDMapperDerby is a sample implementation, in order to facilitate generalized mapping to relational databases

File format

User custom file format
- Similar as the text file for importing networks, because ID mapping can be view as a graph: nodes represent IDs, edges represent mappings between IDs.
- The file is comprised of 3 section:
  - First line: n - number of nodes in this file
  - Each of the next n lines: ID followed by ID type
  - Each of the rest of lines: link (ID to ID)
- An example
```
   3
   P15700   UniProt
   M31455   GenBank
   853844   EntrezGene
   P15700   M31455
   P15700   853844
   M31455   853844
```
- Note:
  - Redundant ID mapping in the file is allowed. (e.g. the last row “M31455 853844” is redundant. It can be inferred from “P15700 M31455” and “P15700 853844”. )
- Questions:
  - Could two IDs of the same type be linked?
Relational database tables to store ID mapping
- Two solutions:
  - 1. Three tables similar as in PathVisio:
    - DataNode (ID, Type): each tuple contains the ID of a node and its ID type.
    - Link (ID1,ID2): each type contains two IDs mapping to each other.
    - Info (Version): containing the version of data.
  - 2. Two tables:
    - Data(VID,ID,Type): each tuple contains a virtual ID, the ID of a node and its ID type. IDs mapped to each other have the same virtual ID.
      - Info (Version)
- Question:
  - Which solution to use?
  - How to store cached data? Separate tables or additional fields?

References/Resources

Related work

SampleWebServiceClients by KeiichiroOno
5 related use cases identified on [http://baderlab.org/IdentifierMapping Bader Lab ID Mapping page]
[http://www.pathvisio.org/ PathVisio] synonym databases store the mappings from Ensembl using Derby database ( [http://ftp2.bigcat.unimaas.nl/~martijn.vaniersel/pathvisio/daily/javadoc package] & [http://svn.bigcat.unimaas.nl/pathvisio/trunk/src/core/org/pathvisio/data src]--GDB classes in the org.pathvisio.data)
GeneNameMapping

Web services and Relational DB

[http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html NCBI web service]
[http://www.ebi.ac.uk/uniprot/remotingAPI/index.html UniProt API]
[http://www.ebi.ac.uk/Tools/webservices/ Ensembl web service]
[http://www.biomart.org/martservice.html BioMart]
[http://www.pathvisio.org/Help#Supported_database_systems PathVisio/WikiPathways synonym database]
[http://genmapp.org/help_v2/GeneDatabase.htm GenMAPP gene database]

Tools

[http://db.apache.org/derby/ Derby database]
ScriptingPlugins can be utilized to test web services before writing actual Java code. Ruby plugin has SOAP utilities and BioRuby inside.
[http://www.netbeans.org/kb/55/websvc-jax-ws-asynch.html NetBeans IDE]: nice tool for mocking up GUI. It has a visual GUI editor for Swing. Also, it has tools to develop web service clients easily from GUI

Project Management

Project Timeline

Overall schedule for Google Summer of Code is available [http://www.google.com/calendar/embed?src=gsummerofcode@gmail.com&ctz=America/Los_Angeles here].

Provide a timeline for implementation. Insert a graphic if you can. Try this free online tool for making project timelines -> [http://www.helpuplan.com/index.asp Help-u-Plan] (create a new chart; modify; right-click to save gif; then attach to this page)

Tasks and Milestones

This project will use incremental approach.

Outline the major milestones and tasks involved in implementation.

Tentative Schedule

Milestone 1: Define ID Mapping API
Milestone 2: Design overall system structure and moch UI
Milestone 3: Finish File-Based network Merge
Milestone 4: DB based Merge
Milestone 5: Web-Service Based Merge
Milestone 6: Integration and Testing
Milestone 7: Documentation and Public Release

Project Dependencies

Another GSoC project taken by ArmanAksoy is also Advanced Network Merge. His approach is different: he proposed to merge nodes according to similarity score rather than ID mappings. However, for the GUI part and the part of deciding attribute conflicts, these two projects can share a common mechanism.

Related RFCs

[:WebServicesIDMapping:RFC 29]: Web services API for ID mapping/translator service
[:DataIntegration:RFC 39]: Cytoscape Data Integration
[:BioWebServiceConnectivity:RFC 45]: Web Services Client Manager and Unified Network/Attribute Import Mechanism

Issues/Discussions/Further investigations

How to design the public interface for ID mapping function? How should Derby DB, Web service clients and ID mapper interact with each other?
- Better to keep the ID mapping interface simple: it just takes original ID, original ID type, destination ID type as parameters and returns destination ID.
- Implementation of the ID mapping interface should take the ID mapping sources (Derby DB, custom file and web service) into consideration
  - If the user chooses Derby DB to get ID mapping, just search local Derby DB and return the result.
  - If the user chooses custom file or web service,
    - Try to search the data in Derby first
    - For IDs not found in Derby, try custom file or web service
    - Cache the ID mappings retrieved from web service into Derby (merge data in Derby may be needed)
Should we implement a number of validations to detect mistakes in the ID mappings? How? -- Need further investigation.
- GaryBader: It would be good to have a validation and cleaning system that implements a number of validation rules to detect mistakes in the input ID mapping files. Bad ID mappings downloaded from external sources can cause errors in the network merge, which will degrade the data. For instance, if you download uniprot and extract all of the ID mappings, you will find that some of the supposedly equivalent IDs actually point to proteins from different species. So organism, molecule type (rna, dna, protein, gene), etc. should be part of the system and associated with each ID to enable better validation.
- GaryBader: What I'm thinking here is a module (separate from Cytoscape) that validates and cleans ID mappings gathered from different relatively trusted sources e.g. Ensembl, uniprot, entrez gene, refseq. We've found ID mapping errors in these sources, so it would be useful to create a cleaned set for our users. This could be a separate module to simplify the main API i.e. the main API could assume that the IDs are clean and will simply use them as is. This would also simplify the input formats (text file or database), since they would just need ID1:DB1, ID2:DB2, ID mapping type. If large file size is an issue, these could be compressed in interesting ways given there is a lot of repetition in IDs i.e. all human ensembl IDs start with ENSP000.. and you can convert the strings to integers (but this optimization is getting ahead of ourselves and should be the last thing considered, only if necessary)
For ID mappings, should we limit the types of IDs supported?
- GaryBader: The user should be allowed to use whatever IDs they want, however we may only want to provide a limited 'suggested' or 'recommended' subset in any files or databases we provide to encourage users to use more consistent IDs in their work and reduce errors and our maintenance costs.
Should it be required that one network can only contain one type of IDs?
- ThomasKelder: Maybe there are also different options here, depending on the network type. Sometimes the ID type is fixed for an attribute (e.g. an attribute called 'uniprot' typically only contains uniprot IDs), but sometimes the type is defined in another attribute (e.g. an attribute 'ID' and 'system') or concatenated to the ID (e.g. 'Uniprot:U1234'). It will be hard to cover all these options, maybe some investigation is needed on what situations occur most.
- KeiichiroOno: Ideally, each network contains only one ID set. However, in some cases it is impossible since one ID set does not contain all of the macro/small molecules (see this example from IntAct. Node IDs are UniProt unified acc. number, but it is not available for small molecules). Just like IntAct, probably it is OK to use DBName:ID style ID if the object does not exists in the target ID set. In that case, the DBName:ID should be taken from the primary source of the database, i.e., if the object only exists in KEGG and not in UniProt, it should be KEGG:xxxxx
How to maintain a Derby database on the server and client? is it affordable?
- For very large ID lists, it may take too much time to get ID mappings from web service, a local database may save a lot of time. The issue is that effort may be needed to maintain/update such a database and the Cytoscape need to update the local database (either manually or automatically) from time to time. Is it affordable?
- An possible way may be like this: we maintain an ID mapping database/file on the server, update it every a few days from online web services/databases (e.g. NCBI, UniProt). The Cytoscape user can manually download/update it. when the user chooses to get ID mappings from web services/online databases, search the local database first, and then for the IDs that are not contained in the local database, try to get them from online web services/databases, if successful, save these ID mappings in the local database. The advantage of the method is that only the used IDs will be updated in the local database, rather then the whole database (though, the user can manually update it). This is desirable, because most likely one user will use a subset of the IDs. It's not necessary to update the whole database.
- One question: Are the ID mappings changeable? Or does updating ID mappings mean only adding new mappings or possibly change the old mappings? If the ID mappings are changeable, it's better to add a time attribute for each ID mapping in the database, so we can update the old (e.g. >3 months) ID mappings in the local database (only update the used the IDs when searching).
- Even if we do not maintain an ID mapping database on the server, it is still good to maintain a local database/file, which contains the ID mappings the user has used, so that the user do not need get them from web service every time. It may be especially desirable, when a tool for mapping IDs is implemented.

Comments

KeiichiroOno (2008-04-22):
1. Separate Derby as embedded database plugin and make it accessible from other part of Cytoscape (for flexibility and reuse).
2. Caching retreved ID sets from web services and from the next query, try local DB first for performance.
3. Design some GUI mockups and checks feasibility.
4. For local database, we may need to start 3 big tables with the following primary key: NCBI Gene ID, Ensembl Gene ID, and UniProt unified accession number. These three covers a lot of objects in biological databases.
Add comment here…

How to Comment

Edit the page and add your comments under the provided header. By adding your ideas to the Wiki directly, we can more easily organize everyone's ideas, and keep clear records. Be sure to include today's date and your name for each comment. Try to keep your comments as concrete and constructive as possible. For example, if you find a part of the RFC makes no sense, please say so, but don't stop there. Take the extra step and propose alternatives.