Diff for "Advanced_Network_Merge_and_ID_Mapping"

Differences between revisions 1 and 2

RFC Name : Advanced Network Merge and ID Mapping API

Editor(s): Jianjiong Gao, Keiichiro Ono, Thomas Kelder, Martijn van Iersel, Gary Bader

Date: 2008-05-02

Status: <!>Under Construction

Proposal

Cytocape has implemented a plugin of Merge Networks, which allows the user to find the union, intersection and difference of networks based on node identifiers/IDs. However, nodes with different IDs may actually have the same biological meaning (e.g. the same protein/gene), especially when they come from different sources. In this project, the Merge Networks feature will be enhanced. Nodes will be matched according to ID mappings, i.e. the nodes whose IDs are mapped to each other are considered as the same node.

Use Cases

5 related use cases have been identified on [http://baderlab.org/IdentifierMapping Bader Lab ID Mapping page]. 2 of them are closely related to this project:

Unification during dataset merging: During a merge operation e.g. of two protein-protein interaction datasets from independently created databases, it is vital to recognize that two protein objects, one from each data source, represent the same protein molecule, even if the protein objects don’t share any database accession numbers. Unification requires knowledge of record type e.g. you cannot reliably use a gene ID to unify proteins (mostly because splice variants exist).
Identifier translation: Some analysis methods require specific translations from one set of identifiers to another. For instance, our 'activity centers' analysis requires translation from protein or gene identifiers in a pathway database to Affymetrix probe set identifiers or other gene expression array platform identifiers.

Implementation Plan

Workflows

Get ID mappings
- Below are the possible options from which the user can choose which ID mapping source to use:
  - Choose an attribute as ID to match.
  - Provide custom mapping with a text file. (IDs in each row match--biologically the same)
  - Get ID mappings from a web service
  - Get ID mappings from a relational database.
  - Get ID mapping from a local database (e.g. embeded Derby database)
- Ask the user to selected which ID types are used in the networks from a list of ID types.
  - The list of ID types may be different for different database or web service. For example, if the user chooses GenMAPP gene database, he/she can select a few out of the 19 supported ID types.
  - Or according to GaryBader's suggestion, we should limit the types of IDs supported in order to minimize the chances for errors. On that purpose, we should only provide the most common used ID types (e.g. Entrez Gene, RefSeq, UniProt and some others) in the list of ID types, no matter what web service/database the user has chosen.
  - If the user is not sure about what ID types are used in the networks, he/she can choose all.
- After retrieving the data, report to the user how many IDs are found, how many are not and how many are ambiguous (the same ID existing in different type--should be rare). For the ambiguous ones, ask the user to decide. (This is similar to what [http://david.abcc.ncifcrf.gov/ DAVID] has done.)
Merge the networks according to the ID mappings
- For those nodes whose IDs cannot be found in the ID mappings, merge them according to IDs.
- For those nodes whose IDs are found in the ID mappings, merge them according to ID mapping.
- When merging node in both of above two cases, conflicts of attributes need to be addressed as in 3.
Deal with attribute conflicts
- Let the user sort the priorities of the networks to be merged, follow the one with higher priority when conflicts occur.
- Let the user assign which attribute value should be used for each node. After each assignment, pop out to ask “apply to all?” with check box “don’t ask again”.

Modules

This project can be divided into the following parts: (Ref: [http://web.missouri.edu/~jg722/GSoC/networkmerge.pdf diagram] by KeiichiroOno)

General-purpose embedded DB plugin
- Wrap Derby with an Interface for Cytoscape. The interface should be generic rather than specific for ID mapping, because embedded DB may be used by other plugins or core in the future.
Web Service Clients development
- KeiichiroOno has implemented a framework to manage web service clients in Cytoscape 2.6
  - Some of web service clients (e.g. NCBI, BioMart, and PICR) could be used in this project (Ref: SampleWebServiceClients)
  - KeiichiroOno has also prototyped a client to use UniProt API
  - Add additional clients (e.g. Ensembl web service) to the framework
ID Mapper Core
- Implement a generic interface to to provide ID mappings, which integrates
  - Custom ID mapping files
  - Embedded DB
  - Web services
Build a new Network Merge plugin
- New GUI for the plugin
- Utilize the old code for merging nodes
- Deal with the conflicts
ID mapping plugin
- Implement a UI to use ID Mapper module alone

Discussion/Further investigation

How to design the public interface for ID mapping function? How should Derby DB, Web service clients and ID mapper interact with each other?
- Better to keep the ID mapping interface simple: it just takes original ID, original ID type, destination ID type as parameters and returns destination ID.
- Implementation of the ID mapping interface should take the ID mapping sources (Derby DB, custom file and web service) into consideration
  - If the user chooses Derby DB to get ID mapping, just search local Derby DB and return the result.
  - If the user chooses custom file or web service,
    - Try to search the data in Derby first
    - For IDs not found in Derby, try custom file or web service
    - Cache the ID mappings retrieved from web service into Derby (merge data in Derby may be needed)
How to design the DB tables for ID mapping? How to store caching data?
- Three tables similar as in PathVisio: DataNode(ID, Type), Link(ID1,ID2), Info(Version).
- Separate tables for caching data or additional fields in DataNode and Link?
Should we implement a number of validations to detect mistakes in the ID mappings--GaryBader? How? -- Need further investigation.
- (According to GaryBader) It would be good to have a validation and cleaning system that implements a number of validation rules to detect mistakes in the input ID mapping files. Bad ID mappings downloaded from external sources can cause errors in the network merge, which will degrade the data. For instance, if you download uniprot and extract all of the ID mappings, you will find that some of the supposedly equivalent IDs actually point to proteins from different species. So organism, molecule type (rna, dna, protein, gene), etc. should be part of the system and associated with each ID to enable better validation.
For ID mappings, should we limit the types of IDs supported?
- According to GaryBader, Yes, in order to minimize the chances for errors. Maybe we can start from some most common ID types (e.g. Entrez Gene, RefSeq, UniProt), and then extend gradually.
Should it be required that one network can only contain one type of IDs?
- (According to ThomasKelder) Maybe there are also different options here, depending on the network type. Sometimes the ID type is fixed for an attribute (e.g. an attribute called 'uniprot' typically only contains uniprot IDs), but sometimes the type is defined in another attribute (e.g. an attribute 'ID' and 'system') or concatenated to the ID (e.g. 'Uniprot:U1234'). It will be hard to cover all these options, maybe some investigation is needed on what situations occur most.
- KeiichiroOno: Ideally, each network contains only one ID set. However, in some cases it is impossible since one ID set does not contain all of the macro/small molecules (see this example from IntAct. Node IDs are UniProt unified acc. number, but it is not available for small molecules). Just like IntAct, probably it is OK to use DBName:ID style ID if the object does not exists in the target ID set. In that case, the DBName:ID should be taken from the primary source of the database, i.e., if the object only exists in KEGG and not in UniProt, it should be KEGG:xxxxx

References/Resources

Related work

SampleWebServiceClients by KeiichiroOno
5 related use cases identified on [http://baderlab.org/IdentifierMapping Bader Lab ID Mapping page]
[http://www.pathvisio.org/ PathVisio] synonym databases store the mappings from Ensembl using Derby database ( [http://ftp2.bigcat.unimaas.nl/~martijn.vaniersel/pathvisio/daily/javadoc package] & [http://svn.bigcat.unimaas.nl/pathvisio/trunk/src/core/org/pathvisio/data src]--GDB classes in the org.pathvisio.data)

Web services and Relational DB

[http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html NCBI web service]
[http://www.ebi.ac.uk/uniprot/remotingAPI/index.html UniProt API]
[http://www.ebi.ac.uk/Tools/webservices/ Ensembl web service]
[http://www.biomart.org/martservice.html Biomart]
[http://www.pathvisio.org/Help#Supported_database_systems PathVisio/WikiPathways synonym database]
[http://genmapp.org/help_v2/GeneDatabase.htm GenMAPP gene database]

Tools

[http://db.apache.org/derby/ Derby database]
ScriptingPlugins can be utilized to test web services before writing actual Java code. Ruby plugin has SOAP utilities and BioRuby inside.
[http://www.netbeans.org/kb/55/websvc-jax-ws-asynch.html NetBeans IDE]: nice tool for mocking up GUI. It has a visual GUI editor for Swing. Also, it has tools to develop web service clients easily from GUI

Project Management

Project Timeline

Provide a timeline for implementation. Insert a graphic if you can. Try this free online tool for making project timelines -> [http://www.helpuplan.com/index.asp Help-u-Plan] (create a new chart; modify; right-click to save gif; then attach to this page)

Tasks and Milestones

Outline the major milestones and tasks involved in implementation.

Milestone 1: …
1. Task 1: ...
2. Task 2: ...
Milestone 2: …

Project Dependencies

Another GSoC project taken by ArmanAksoy is also Advanced Network Merge. His approach is different: he proposed to merge nodes according to similarity score rather than ID mappings. However, for the GUI part and the part of deciding attribute conflicts, these two projects can share a common mechanism.

Related RFCs

[:WebServicesIDMapping:RFC 29]: Web services API for ID mapping/translator service
[:BioWebServiceConnectivity:RFC 45]: Web Services Client Manager and Unified Network/Attribute Import Mechanism

Issues

List any issues, conflict, or dependencies raised by this proposal

Comments

KeiichiroOno (2008-04-22):
1. Separate Derby as embedded database plugin and make it accessible from other part of Cytoscape (for flexibility and reuse).
2. Caching retreved ID sets from web services and from the next query, try local DB first for performance.
3. Design some GUI mockups and checks feasibility.
4. For local database, we may need to start 3 big tables with the following primary key: NCBI Gene ID, Ensembl Gene ID, and UniProt unified accession number. These three covers a lot of objects in biological databases.
Add comment here…

How to Comment

Edit the page and add your comments under the provided header. By adding your ideas to the Wiki directly, we can more easily organize everyone's ideas, and keep clear records. Be sure to include today's date and your name for each comment. Try to keep your comments as concrete and constructive as possible. For example, if you find a part of the RFC makes no sense, please say so, but don't stop there. Take the extra step and propose alternatives.

-  ← Revision 1 as of 2008-05-03 05:36:38 →
  Size: 11791
  Editor: asp
  Comment:
+  ← Revision 2 as of 2008-05-03 16:28:34 →
  Size: 12440
  Editor: asp
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 9:
-Cytocape has implemented a plugin of Merge Networks, which allows the user to find the union, intersection and difference of networks based on node identifiers/IDs. However, nodes with different IDs may actually have the same biological meaning (e.g. the same protein/gene), especially when they come from different sources. 



In this project, the Merge Networks feature will be enhanced. Nodes will be matched according to ID mappings, i.e. the nodes whose IDs are mapped to each other are considered as the same node. The basic idea is as follows: 



 1. Get ID mappings by a user-custom file, a local database or a web service; 

 1. Merge the nodes according the ID mappings; 

 1. Let the user decide attributes conflicts if occurring.
+Cytocape has implemented a plugin of Merge Networks, which allows the user to find the union, intersection and difference of networks based on node identifiers/IDs. However, nodes with different IDs may actually have the same biological meaning (e.g. the same protein/gene), especially when they come from different sources. In this project, the Merge Networks feature will be enhanced. Nodes will be matched according to ID mappings, i.e. the nodes whose IDs are mapped to each other are considered as the same node.
-Line 18:
+Line 12:
-=== 1. Get ID mappings ===



 * Below are the possible options from which the user can choose which ID mapping source to use: 

  * Choose an attribute as ID to match. 

  * Provide custom mapping with a text file. (IDs in each row match--biologically the same) 

  * Get ID mappings from a relational database. 

   * [http://www.pathvisio.org/Help#Supported_database_systems PathVisio/WikiPathways synonym database]

   * [http://genmapp.org/help_v2/GeneDatabase.htm GenMAPP gene database]

  * Get ID mappings from a web service 

   * [http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html NCBI web service]

   * [http://www.ebi.ac.uk/uniprot/remotingAPI/index.html UniProt API]

   * [http://www.ebi.ac.uk/Tools/webservices/ Ensembl web service]

  * Get ID mapping from a local database 

   * [http://db.apache.org/derby/ Derby database] for ID mappings

 * Ask the user to selected which ID types are used in the networks from a list of ID types. 

  * The list of ID types may be different for different database or web service. For example, if the user chooses GenMAPP gene database, he/she can select a few out of the  19 supported ID types. 

  * Or according to GaryBader's suggestion, we should limit the types of IDs supported in order to minimize the chances for errors. On that purpose, we should only provide the most common used ID types (e.g. Entrez Gene, RefSeq, UniProt and some others) in the list of ID types, no matter what web service/database the user has chosen. 

  * If the user is not sure about what ID types are used in the networks, he/she can choose all. 

 * When retrieving data from web services, try local DB first for performance; and cache retreved ID sets from web services for the next query. 

 * After retrieving the data, report to the user how many IDs are found, how many are not and how many are ambiguous (the same ID existing in different type--should be rare). For the ambiguous ones, ask the user to decide. (This is similar to what  [http://david.abcc.ncifcrf.gov/ DAVID] has done.) 



=== 2. Merge the networks according to the ID mappings ===



 * For those nodes whose IDs cannot be found in the ID mappings, merge them according to IDs. 

 * For those nodes whose IDs are found in the ID mappings, merge them according to ID mapping. 

 * When merging node in both of above two cases, conflicts of attributes need to be addressed as in 3. 



=== 3. Deal with attribute conflicts ===



 * Let the user sort the priorities of the networks to be merged, follow the one with higher priority when conflicts occur. 

 * Let the user assign which attribute value should be used for each node. After each assignment, pop out to ask “apply to all?” with check box “don’t ask again”.
+related use cases have been identified on [http://baderlab.org/IdentifierMapping Bader Lab ID Mapping page]. 2 of them are closely related to this project:

 * '''Unification''' during dataset merging: During a merge operation e.g. of two protein-protein interaction datasets from independently created databases, it is vital to recognize that two protein objects, one from each data source, represent the same protein molecule, even if the protein objects don’t share any database accession numbers. Unification requires knowledge of record type e.g. you cannot reliably use a gene ID to unify proteins (mostly because splice variants exist). 

 * '''Identifier translation''': Some analysis methods require specific translations from one set of identifiers to another. For instance, our 'activity centers' analysis requires translation from protein or gene identifiers in a pathway database to Affymetrix probe set identifiers or other gene expression array platform identifiers.
-Line 53:
+Line 18:
+=== Workflows ===



 1. Get ID mappings

  * Below are the possible options from which the user can choose which ID mapping source to use: 

   * Choose an attribute as ID to match. 

   * Provide custom mapping with a text file. (IDs in each row match--biologically the same) 

   * Get ID mappings from a web service 

   * Get ID mappings from a relational database. 

   * Get ID mapping from a local database (e.g. embeded Derby database)

  * Ask the user to selected which ID types are used in the networks from a list of ID types. 

   * The list of ID types may be different for different database or web service. For example, if the user chooses GenMAPP gene database, he/she can select a few out of the  19 supported ID types. 

   * Or according to GaryBader's suggestion, we should limit the types of IDs supported in order to minimize the chances for errors. On that purpose, we should only provide the most common used ID types (e.g. Entrez Gene, RefSeq, UniProt and some others) in the list of ID types, no matter what web service/database the user has chosen. 

   * If the user is not sure about what ID types are used in the networks, he/she can choose all. 

  * After retrieving the data, report to the user how many IDs are found, how many are not and how many are ambiguous (the same ID existing in different type--should be rare). For the ambiguous ones, ask the user to decide. (This is similar to what  [http://david.abcc.ncifcrf.gov/ DAVID] has done.) 

 1. Merge the networks according to the ID mappings

  * For those nodes whose IDs cannot be found in the ID mappings, merge them according to IDs. 

  * For those nodes whose IDs are found in the ID mappings, merge them according to ID mapping. 

  * When merging node in both of above two cases, conflicts of attributes need to be addressed as in 3. 

 1. Deal with attribute conflicts

  * Let the user sort the priorities of the networks to be merged, follow the one with higher priority when conflicts occur. 

  * Let the user assign which attribute value should be used for each node. After each assignment, pop out to ask “apply to all?” with check box “don’t ask again”.
-Line 55:
+Line 42:
-This project can be divided into the following parts: (Ref:  [http://web.missouri.edu/~jg722/GSoC/networkmerge.pdf diagram] by KeiOno)
+This project can be divided into the following parts: (Ref:  [http://web.missouri.edu/~jg722/GSoC/networkmerge.pdf diagram] by KeiichiroOno)
-Line 60:
+Line 47:
-  * KeiOno has implemented a framework to manage web service clients in Cytoscape 2.6
+  * KeiichiroOno has implemented a framework to manage web service clients in Cytoscape 2.6
-Line 62:
+Line 49:
-   * KeiOno has also prototyped a client to use UniProt API
+   * KeiichiroOno has also prototyped a client to use UniProt API
-Line 73:
+Line 60:
-. (Optional) ID mapping plugin
+. ID mapping plugin
-Line 99:
+Line 86:
-  * KeiOno: Ideally, each network contains only one ID set. However, in some cases it is impossible since one ID set does not contain all of the macro/small molecules (see  this example from IntAct. Node IDs are UniProt unified acc. number, but it is not available for small molecules). Just like IntAct, probably it is OK to use DBName:ID style ID if the object does not exists in the target ID set. In that case, the DBName:ID should be taken from the primary source of the database, i.e., if the object only exists in KEGG and not in UniProt, it should be KEGG:xxxxx
+  * KeiichiroOno: Ideally, each network contains only one ID set. However, in some cases it is impossible since one ID set does not contain all of the macro/small molecules (see  this example from IntAct. Node IDs are UniProt unified acc. number, but it is not available for small molecules). Just like IntAct, probably it is OK to use DBName:ID style ID if the object does not exists in the target ID set. In that case, the DBName:ID should be taken from the primary source of the database, i.e., if the object only exists in KEGG and not in UniProt, it should be KEGG:xxxxx
-Line 101:
+Line 88:
-=== Resource ===
+== References/Resources ==
-Line 103:
+Line 90:
- * KeiOno has implemented a new framework to manage web service clients in Cytoscape, some of which (possibly BioMart, NCBI, Uniprot and PICR) can be used here. (Ref: SampleWebServiceClients)
+=== Related work ===
-Line 105:
+Line 92:
- * GaryBader has identified 5 use cases for ID mapping listed at  http://baderlab.org/IDMapping. Use cases 1 and 3 seem to be the priority for Cytoscape, but only use case 1 for network merge. On the web page, he also listed a number of ID mapping sources.
+ * SampleWebServiceClients by KeiichiroOno

 * 5 related use cases identified on [http://baderlab.org/IdentifierMapping Bader Lab ID Mapping page] 

 * [http://www.pathvisio.org/ PathVisio] synonym databases store the mappings from Ensembl using Derby database ( [http://ftp2.bigcat.unimaas.nl/~martijn.vaniersel/pathvisio/daily/javadoc package] & [http://svn.bigcat.unimaas.nl/pathvisio/trunk/src/core/org/pathvisio/data src]--GDB classes in the org.pathvisio.data)
-Line 107:
+Line 96:
+=== Web services and Relational DB ===



 * [http://www.ncbi.nlm.nih.gov/entrez/query/static/esoap_help.html NCBI web service]

 * [http://www.ebi.ac.uk/uniprot/remotingAPI/index.html UniProt API]

 * [http://www.ebi.ac.uk/Tools/webservices/ Ensembl web service]

 * [http://www.biomart.org/martservice.html Biomart]

 * [http://www.pathvisio.org/Help#Supported_database_systems PathVisio/WikiPathways synonym database]

 * [http://genmapp.org/help_v2/GeneDatabase.htm GenMAPP gene database]



=== Tools ===



 * [http://db.apache.org/derby/ Derby database]
-Line 108:
+Line 109:
 Line 110:
- * [http://www.pathvisio.org/ PathVisio] synonym databases store the mappings from Ensembl using Derby database ( [http://ftp2.bigcat.unimaas.nl/~martijn.vaniersel/pathvisio/daily/javadoc package] & [http://svn.bigcat.unimaas.nl/pathvisio/trunk/src/core/org/pathvisio/data src]--GDB classes in the org.pathvisio.data)
-Line 130:
+Line 128:
-~-''Link to other related RFCs''-~
+ * [:WebServicesIDMapping:RFC 29]: Web services API for ID mapping/translator service

 * [:BioWebServiceConnectivity:RFC 45]: Web Services Client Manager and Unified Network/Attribute Import Mechanism
-Line 138:
+Line 137:
- *''Add comment here…''

 * KeiOne (2008-04-22):
+ * KeiichiroOno (2008-04-22):
-Line 145:
+Line 143:
+ *''Add comment here…''