Advanced_Network_Merge_and_ID_Mapping

RFC Name : Advanced Network Merge and ID Mapping API

Editor(s): Jianjiong Gao, Keiichiro Ono, Thomas Kelder, Martijn van Iersel, Gary Bader

Date: 2008-05-02

Status: <!>Under Construction

Proposal

Cytocape has implemented a plugin of Merge Networks, which allows the user to find the union, intersection and difference of networks based on node identifiers/IDs. However, nodes with different IDs may actually have the same biological meaning (e.g. the same protein/gene), especially when they come from different sources.

In this project, the Merge Networks feature will be enhanced. Nodes will be matched according to ID mappings, i.e. the nodes whose IDs are mapped to each other are considered as the same node. The basic idea is as follows:

Get ID mappings by a user-custom file, a local database or a web service;
Merge the nodes according the ID mappings;
Let the user decide attributes conflicts if occurring.

Use Cases

1. Get ID mappings

Below are the possible options from which the user can choose which ID mapping source to use:
- Choose an attribute as ID to match.
- Provide custom mapping with a text file. (IDs in each row match--biologically the same)
- Get ID mappings from a relational database.
  - [http://www.pathvisio.org/Help#Supported_database_systems PathVisio/WikiPathways synonym database]
  - [http://genmapp.org/help_v2/GeneDatabase.htm GenMAPP gene database]
- Get ID mappings from a web service
  - [http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html NCBI web service]
  - [http://www.ebi.ac.uk/uniprot/remotingAPI/index.html UniProt API]
  - [http://www.ebi.ac.uk/Tools/webservices/ Ensembl web service]
- Get ID mapping from a local database
  - [http://db.apache.org/derby/ Derby database] for ID mappings
Ask the user to selected which ID types are used in the networks from a list of ID types.
- The list of ID types may be different for different database or web service. For example, if the user chooses GenMAPP gene database, he/she can select a few out of the 19 supported ID types.
- Or according to GaryBader's suggestion, we should limit the types of IDs supported in order to minimize the chances for errors. On that purpose, we should only provide the most common used ID types (e.g. Entrez Gene, RefSeq, UniProt and some others) in the list of ID types, no matter what web service/database the user has chosen.
- If the user is not sure about what ID types are used in the networks, he/she can choose all.
When retrieving data from web services, try local DB first for performance; and cache retreved ID sets from web services for the next query.
After retrieving the data, report to the user how many IDs are found, how many are not and how many are ambiguous (the same ID existing in different type--should be rare). For the ambiguous ones, ask the user to decide. (This is similar to what [http://david.abcc.ncifcrf.gov/ DAVID] has done.)

2. Merge the networks according to the ID mappings

For those nodes whose IDs cannot be found in the ID mappings, merge them according to IDs.
For those nodes whose IDs are found in the ID mappings, merge them according to ID mapping.
When merging node in both of above two cases, conflicts of attributes need to be addressed as in 3.

3. Deal with attribute conflicts

Let the user sort the priorities of the networks to be merged, follow the one with higher priority when conflicts occur.
Let the user assign which attribute value should be used for each node. After each assignment, pop out to ask “apply to all?” with check box “don’t ask again”.

Implementation Plan

Modules

This project can be divided into the following parts: (Ref: [http://web.missouri.edu/~jg722/GSoC/networkmerge.pdf diagram] by KeiOno)

General-purpose embedded DB plugin
- Wrap Derby with an Interface for Cytoscape. The interface should be generic rather than specific for ID mapping, because embedded DB may be used by other plugins or core in the future.
Web Service Clients development
- KeiOno has implemented a framework to manage web service clients in Cytoscape 2.6
  - Some of web service clients (e.g. NCBI, BioMart, and PICR) could be used in this project (Ref: SampleWebServiceClients)
  - KeiOno has also prototyped a client to use UniProt API
  - Add additional clients (e.g. Ensembl web service) to the framework
ID Mapper Core
- Implement a generic interface to to provide ID mappings, which integrates
  - Custom ID mapping files
  - Embedded DB
  - Web services
Build a new Network Merge plugin
- New GUI for the plugin
- Utilize the old code for merging nodes
- Deal with the conflicts
(Optional) ID mapping plugin
- Implement a UI to use ID Mapper module alone

Discussion/Further investigation

How to design the public interface for ID mapping function? How should Derby DB, Web service clients and ID mapper interact with each other?
- Better to keep the ID mapping interface simple: it just takes original ID, original ID type, destination ID type as parameters and returns destination ID.
- Implementation of the ID mapping interface should take the ID mapping sources (Derby DB, custom file and web service) into consideration
  - If the user chooses Derby DB to get ID mapping, just search local Derby DB and return the result.
  - If the user chooses custom file or web service,
    - Try to search the data in Derby first
    - For IDs not found in Derby, try custom file or web service
    - Cache the ID mappings retrieved from web service into Derby (merge data in Derby may be needed)
How to design the DB tables for ID mapping? How to store caching data?
- Three tables similar as in PathVisio: DataNode(ID, Type), Link(ID1,ID2), Info(Version).
- Separate tables for caching data or additional fields in DataNode and Link?
Should we implement a number of validations to detect mistakes in the ID mappings--GaryBader? How? -- Need further investigation.
- (According to GaryBader) It would be good to have a validation and cleaning system that implements a number of validation rules to detect mistakes in the input ID mapping files. Bad ID mappings downloaded from external sources can cause errors in the network merge, which will degrade the data. For instance, if you download uniprot and extract all of the ID mappings, you will find that some of the supposedly equivalent IDs actually point to proteins from different species. So organism, molecule type (rna, dna, protein, gene), etc. should be part of the system and associated with each ID to enable better validation.
For ID mappings, should we limit the types of IDs supported?
- According to GaryBader, Yes, in order to minimize the chances for errors. Maybe we can start from some most common ID types (e.g. Entrez Gene, RefSeq, UniProt), and then extend gradually.
Should it be required that one network can only contain one type of IDs?
- (According to ThomasKelder) Maybe there are also different options here, depending on the network type. Sometimes the ID type is fixed for an attribute (e.g. an attribute called 'uniprot' typically only contains uniprot IDs), but sometimes the type is defined in another attribute (e.g. an attribute 'ID' and 'system') or concatenated to the ID (e.g. 'Uniprot:U1234'). It will be hard to cover all these options, maybe some investigation is needed on what situations occur most.
- KeiOno: Ideally, each network contains only one ID set. However, in some cases it is impossible since one ID set does not contain all of the macro/small molecules (see this example from IntAct. Node IDs are UniProt unified acc. number, but it is not available for small molecules). Just like IntAct, probably it is OK to use DBName:ID style ID if the object does not exists in the target ID set. In that case, the DBName:ID should be taken from the primary source of the database, i.e., if the object only exists in KEGG and not in UniProt, it should be KEGG:xxxxx

Resource

KeiOno has implemented a new framework to manage web service clients in Cytoscape, some of which (possibly BioMart, NCBI, Uniprot and PICR) can be used here. (Ref: SampleWebServiceClients)
GaryBader has identified 5 use cases for ID mapping listed at http://baderlab.org/IDMapping. Use cases 1 and 3 seem to be the priority for Cytoscape, but only use case 1 for network merge. On the web page, he also listed a number of ID mapping sources.
ScriptingPlugins can be utilized to test web services before writing actual Java code. Ruby plugin has SOAP utilities and BioRuby inside.
[http://www.netbeans.org/kb/55/websvc-jax-ws-asynch.html NetBeans IDE]: nice tool for mocking up GUI. It has a visual GUI editor for Swing. Also, it has tools to develop web service clients easily from GUI
[http://www.pathvisio.org/ PathVisio] synonym databases store the mappings from Ensembl using Derby database ( [http://ftp2.bigcat.unimaas.nl/~martijn.vaniersel/pathvisio/daily/javadoc package] & [http://svn.bigcat.unimaas.nl/pathvisio/trunk/src/core/org/pathvisio/data src]--GDB classes in the org.pathvisio.data)

Project Management

Project Timeline

Provide a timeline for implementation. Insert a graphic if you can. Try this free online tool for making project timelines -> [http://www.helpuplan.com/index.asp Help-u-Plan] (create a new chart; modify; right-click to save gif; then attach to this page)

Tasks and Milestones

Outline the major milestones and tasks involved in implementation.

Milestone 1: …
1. Task 1: ...
2. Task 2: ...
Milestone 2: …

Project Dependencies

Another GSoC project taken by ArmanAksoy is also Advanced Network Merge. His approach is different: he proposed to merge nodes according to similarity score rather than ID mappings. However, for the GUI part and the part of deciding attribute conflicts, these two projects can share a common mechanism.

Related RFCs

Link to other related RFCs

Issues

List any issues, conflict, or dependencies raised by this proposal

Comments

Add comment here…
KeiOne (2008-04-22):
1. Separate Derby as embedded database plugin and make it accessible from other part of Cytoscape (for flexibility and reuse).
2. Caching retreved ID sets from web services and from the next query, try local DB first for performance.
3. Design some GUI mockups and checks feasibility.
4. For local database, we may need to start 3 big tables with the following primary key: NCBI Gene ID, Ensembl Gene ID, and UniProt unified accession number. These three covers a lot of objects in biological databases.

How to Comment

Edit the page and add your comments under the provided header. By adding your ideas to the Wiki directly, we can more easily organize everyone's ideas, and keep clear records. Be sure to include today's date and your name for each comment. Try to keep your comments as concrete and constructive as possible. For example, if you find a part of the RFC makes no sense, please say so, but don't stop there. Take the extra step and propose alternatives.