Handlers for the following format still exist in Cytoscape as legacy code, however we strongly recommend using the new formats (OBO + Gene Association) described in the previous section, since they are easier to download directly from the Gene Ontology project and use directly. Currently, users have no access to an import interface for this old format.
Building your own annotation files
The annotation server requires that the gene annotations and associated ontology of controlled vocabulary terms follow a simple format. This simple format was chosen because it is efficient to parse and easy to use.
The flat file formats are explained below:
The Ontology Format
By example (the Gene Ontology - GO):
(curator=GO) (type=all) 0003673 = Gene_Ontology 0003674 = molecular_function [partof: 0003673 ] 0008435 = anticoagulant [isa: 0003674 ] 0016172 = antifreeze [isa: 0003674 ] 0016173 = ice nucleation inhibitor [isa: 0016172 ] 0016209 = antioxidant [isa: 0003674 ] 0045174 = glutathione dehydrogenase (ascorbate) [isa: 0009491 0015038 0016209 0016672 ] 0004362 = glutathione reductase (NADPH) [isa: 0015038 0015933 0016209 0016654 ] 0017019 = myosin phosphatase catalyst [partof: 0017018 ] ...
A second example (KEGG pathway ontology):
(curator=KEGG) (type=Metabolic Pathways) 90001 = Metabolism 80001 = Carbohydrate Metabolism [isa: 90001 ] 80003 = Lipid Metabolism [isa: 90001 ] 80002 = Energy Metabolism [isa: 90001 ] 80004 = Nucleotide Metabolism [isa: 90001 ] 80005 = Amino Acid Metabolism [isa: 90001 ] 80006 = Metabolism of Other Amino Acids [isa: 90001 ] 80007 = Metabolism of Complex Carbohydrates [isa: 90001 ] ...
The format has these required features:
The first line contains two parenthesized assignments for curator and type. In the GO example above, the ontology file (which is created from the XML that GO provides) nests all three specific ontologies (molecular function, biological process, cellular component) below the 'root' ontology, named 'Gene_Ontology'. (type=all) tells you that all three ontologies are included in that file.
- Following the mandatory title line, there are one or more category lines, each with the form:
number0 = name [isa:|partof: number1 number2 ...]
where isa and partof are terms used in GO; they describe the relation between parent and child terms in the ontology hierarchy.
The trailing blank before each left square bracket is not required; it is an artifact of the python script that creates these files.
The Annotation Format
By example (from the GO biological process annotation file):
(species=Saccharomyces cerevisiae) (type=Biological Process) (curator=GO) YMR056C = 0006854 YBR085W = 0006854 YJR155W = 0006081 ...
and from KEGG:
(species=Mycobacterium tuberculosis) (type=Metabolic Pathways) (curator=KEGG) RV0761C = 10 RV0761C = 71 RV0761C = 120 RV0761C = 350 RV0761C = 561 RV1862 = 10 ...
The format has these required features:
The first line contains three parenthesized assignments, for species, type and curator. In the example just above, the annotation file (created for budding yeast from the flat text file maintained by SGD for the Gene Ontology project and available both at their web site and at GO's) shows three yeast ORFs annotated for biological process with respect to GO, as described above.
- Following the mandatory title line, there are one or more annotation lines, each with the form:
canonicalName = ontologyTermID
Once loaded, this annotation (along with the accompanying ontology) can be assigned to nodes in a Cytoscape network. For this to work, the species type of the node must exactly match the species named on the first line of the annotation file. The canonicalName of your node must exactly match the canonicalName present in the annotation file. If you don’t see the expected results when using this feature of Cytoscape, check this again, as getting either of these wrong is a common mistake.
Load Data into Cytoscape
The easiest way to make annotations available to Cytoscape is by loading annotations into the Cytoscape annotation server. This is the default behavior for the official release of Cytoscape.
The Annotation Manifest
You must first create a text file to specify the files you want Cytoscape to load. Here is an example, from a file which (for convenience) we usually call manifest:
ontology=goOntology.txt annotation=yeastBiologicalProcess.txt annotation=yeastMolecularFunction.txt annotation=yeastCellularComponent.txt
Use the Cytoscape -b command line argument to specify the annotation manifest file to read (e.g. -b manifest). Please note that the -s switch, which sets the default species for your data, is required to exactly match the species named in any annotation file you wish to use.
Getting and Reformatting GO Data
The Gene Ontology (GO) project is a valuable source of annotation for the genes of many organisms. In this section we will explain how to:
- Obtain the GO ontology file
- Reformat it into the simpler flat file Cytoscape uses
- Obtain an annotation file (we illustrate with yeast and human annotation)
- Reformat the annotation files into the simple Cytoscape format
Obtain the GO ontology file
Go to the GO XML FTP (ftp://ftp.geneontology.org/pub/go/xml/) page. Download the latest go-YYYYMM-termdb.xml.gz file.
Reformat GO XML ontology file into a flat file
gunzip go-YYYYMM-termdb.xml.gz python parseGoTermsToFlatFile.py go-YYYYMM-termdb.xml > goOntology.txt
(see below for Python script listing)
Obtain the association file for your organism
GO maintains a list of association files for many organisms; these files associate genes with GO terms. The next step is to get the file for the organism(s) you are interested in, and parse it into the form Cytoscape needs. A list of files may be seen at http://www.geneontology.org/GO.current.annotations.shtml. The rightmost column contains links to tab-delimited files of gene associations, by species. Choose the species you are interested in, and click 'Download'.
Let's use "GO Annotations @ EBI: Human" as an example. After you have downloaded and saved the file, look at the first few lines:
SPTR O00115 DRN2_HUMAN GO:0003677 PUBMED:9714827 TAS F Deoxyribonuclease II precursor IPI00010348 protein taxon:9606 SPTR SPTR O00115 DRN2_HUMAN GO:0004519 GOA:spkw IEA F Deoxyribonuclease II precursor IPI00010348 protein taxon:9606 20020425 SPTR SPTR O00115 DRN2_HUMAN GO:0004531 PUBMED:9714827 TAS F Deoxyribonuclease II precursor IPI00010348 protein taxon:9606 SPTR ...
Note that line wrapping has occurred here, so each line of the actual file is wrapped to two lines. The goal is to create from these lines the following lines:
(species=Homo sapiens) (type=Molecular Function) (curator=GO) IPI00010348 = 0003677 IPI00010348 = 0004519 IPI00010348 = 0004531 ...
or
(species=Homo sapiens) (type=Biological Process) (curator=GO) NP_001366 = 0006259 NP_001366 = 0006915 NP_005289 = 0007186 NP_647593 = 0006899 ...
The first sample contains molecular function annotations for proteins, and each protein is identified by its IPI number. IPI is the International Protein Index, which maintains cross references to the main databases for human, mouse and rat proteomes. The second sample contains biological process annotation, and each protein is identified by its NP (RefSeq) number. These two naming systems, IPI and RefSeq, are two of many that you can use to define canonical names when you run Cytoscape. For budding yeast, it is much easier: the yeast community always uses standard ORF names, and so Cytoscape uses these as canonical names. For human proteins and genes, there is no single standard.
The solution (for those working with human genes or proteins) is, once you have downloaded the annotations file, to:
- Decide which naming system you want to use.
Download ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/xrefs.goa. This cross-reference file, when used strategically, allows you to create Cytoscape-compatible annotation files in which the canonical name is the one most meaningful to you.
Examine xrefs.goa to figure out which column contains the names you wish to use.
- Make a very slight modification to the python script described below, and then
Run that script, supplying both xrefs.goa and that annotation file as arguments.
Here are a few sample lines from xrefs.goa:
SP O00115 IPI00010348 ENSP00000222219; NP_001366; BAA28623;AAC77366;AAC35751;AAC39852;BAB55598;AAB51172;AAH10419; 2960,DNASE2 1777,DNASE2 SP O00116 IPI00010349 ENSP00000324567;ENSP00000264167; NP_003650; CAA70591; 327,AGPS 8540,AGPS SP O00124 IPI00010353 ENSP00000265616;ENSP00000322580; NP_005662; BAA18958;BAA18959;AAH20694; 7993,D8S2298E ...
Note that line wrapping has occurred here – each line in this example starts with the letters SP. See the README file for more information (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/README).
Finally, run the script to create your three annotation files for human proteins:
bioproc.anno (GO biological process annotation)
molfunc.anno (GO molecular function annotation)
cellcomp.anno (GO cellular component annotation)
using the supplied python script. It may be necessary to modify this script slightly if RefSeq identifiers are not used as canonical names or if you are using a more recent version of Python.
python parseAssignmentsToFlatFileFromGoaProject.py gene_association.goa_human xrefs.goa
(See below for Python script listing)
Python script examples
These scripts, as described above, require Python version 2.2 or later.
Script 1 - parseGoTermsToFlatFile.py
# parseGoTermToFlatFile.py: translate a GO XML ontology file into a simpler # Cytoscape flat file #----------------------------------------------------------------------------------- # RCS: $Revision: 1.3 $ $Date: 2003/05/18 00:38:43 $ #----------------------------------------------------------------------------------- import re, pre, sys #----------------------------------------------------------------------------------- def flatFilePrint (id, name, isaIDs, partofIDs): isa = '' if (len (isaIDs) > 0): isa = '[isa: ' for isaID in isaIDs: isa += isaID isa += ' ' isa += ']' partof = '' if (len (partofIDs) > 0): partof = '[partof: ' for partofID in partofIDs: partof += partofID partof += ' ' partof += ']' result = '~np~%~/np~s = ~np~%~/np~s ~np~%~/np~s ~np~%~/np~s' ~np~%~/np~ (id, name, isa, partof) result = result.strip () if (result == 'isa = isa' or result == 'partof = partof'): print >> sys.stderr, 'meaningless term: ~np~%~/np~s' ~np~%~/np~ result else: print result #----------------------------------------------------------------------------------- if (len (sys.argv) != 2): print 'usage: ~np~%~/np~s <someFile.xml>' ~np~%~/np~ sys.argv [0] sys.exit (); inputFilename = sys.argv [1]; print >> sys.stderr, 'reading ~np~%~/np~s...' ~np~%~/np~ inputFilename text = open (inputFilename).read () print >> sys.stderr, 'read ~np~%~/np~d characters' ~np~%~/np~ len (text) regex = '<go:term .*?>(.*?)</go:term>'; cregex = pre.compile (regex, re.DOTALL) # . matches newlines m = pre.findall (cregex, text) print >> sys.stderr, 'number of go terms: ~np~%~/np~d' ~np~%~/np~ len (m) regex2 = '<go:accession>GO:(.*?)</go:accession>.*?<go:name>(.*?)</go:name>' cregex2 = re.compile (regex2, re.DOTALL) regex3 = '<go:isa\s*rdf:resource="http://www.geneontology.org/go#GO:(.*?)"\s*/>' cregex3 = re.compile (regex3, re.DOTALL) regex4 = '<go:part-of\s*rdf:resource="http://www.geneontology.org/go#GO:(.*?)"\s*/>' cregex4 = re.compile (regex4, re.DOTALL) goodElements = 0 badElements = 0 print '(curator=GO) (type=all)' for term in m: m2 = re.search (cregex2, term) if (m2): goodElements += 1; id = m2.group (1) name = m2.group (2) isaIDs = [] m3 = re.findall (cregex3, term); for ref in m3: isaIDs.append (ref) m4 = re.findall (cregex4, term); partofIDs = [] for ref in m4: partofIDs.append (ref) flatFilePrint (id, name, isaIDs, partofIDs) else: badElements += 1; print >> sys.stderr, 'no match to m2...' print >> sys.stderr, "---------------\n~np~%~/np~s\n------------------" ~np~%~/np~ term print >> sys.stderr, 'goodElements ~np~%~/np~d' ~np~%~/np~ goodElements print >> sys.stderr, 'badElements ~np~%~/np~d' ~np~%~/np~ badElements #--------------------------------------
Script 2 - parseAssignmentsToFlatFileFromGoaProject.py
import sys #----------------------------------------------------------------------------------- def fixCanonicalName (rawName): # for instance, trim 'YBR085W|ANC3' to 'YBR085W' bar = rawName.find ('|') if (bar < 0): return rawName return rawName [:bar] #----------------------------------------------------------------------------------- def fixGoID (rawID): bar = rawID.find (':') + 1 return rawID [bar:] #----------------------------------------------------------------------------------- def readGoaXrefFile (filename): lines = open (filename).read().split ('\n') result = {} for line in lines: if (len (line) < 10): continue tokens = line.split ('\t') ipi = tokens [2] np = tokens [5] semicolon = np.find (';') if (semicolon >= 0): np = np [:semicolon] if (len (ipi) > 0 and len (np) > 0): result [ipi] = np return result #----------------------------------------------------------------------------------- if (len (sys.argv) != 3): print 'error! parse <gene_associations file from GO> <goa xrefs file> ' sys.exit () associationFilename = sys.argv [1]; xrefsFilename = sys.argv [2] species = 'Homo sapiens' ipiToNPHash = readGoaXrefFile (xrefsFilename) tester = 'IPI00099416' print 'hash size: ~np~%~/np~d' ~np~%~/np~ len (ipiToNPHash) print 'test map: ~np~%~/np~s -> NP_054861: ~np~%~/np~s ' ~np~%~/np~ (tester, ipiToNPHash [tester]) bioproc = open ('bioproc.txt', 'w') molfunc = open ('molfunc.txt', 'w') cellcomp = open ('cellcomp.txt', 'w') bioproc.write ('(species=~np~%~/np~s) (type=Biological Process) (curator=GO)\n' ~np~%~/np~ species) molfunc.write ('(species=~np~%~/np~s) (type=Molecular Function) (curator=GO)\n' ~np~%~/np~ species); cellcomp.write ('(species=~np~%~/np~s) (type=Cellular Component) (curator=GO)\n' ~np~%~/np~ species); lines=open(associationFilename).read().split('\n') sys.stderr.write ('found ~np~%~/np~d lines\n' ~np~%~/np~ len (lines)) for line in lines: if (line.find ('!') == 0 or len (line) < 2): continue tokens = line.split ('\t') goOntology = tokens [8] goIDraw = tokens [4] goID = goIDraw.split (':')[1] ipiName = fixCanonicalName (tokens [10]) if (len (ipiName) < 1): continue if (not ipiToNPHash.has_key (ipiName)): continue refseqName = ipiToNPHash [ipiName] printName = refseqName #printName = ipiName if (ipiName == tester): print '~np~%~/np~s (~np~%~/np~s) has go term ~np~%~/np~s' ~np~%~/np~ (tester, printName, goID) if (goOntology == 'C'): cellcomp.write ('~np~%~/np~s = ~np~%~/np~s\n' ~np~%~/np~ (printName, goID)) elif (goOntology == 'P'): bioproc.write ('~np~%~/np~s = ~np~%~/np~s\n' ~np~%~/np~ (printName, goID)) elif (goOntology == 'F'): molfunc.write ('~np~%~/np~s = ~np~%~/np~s\n' ~np~%~/np~ (printName, goID)) #-----------------------------------------------------------------------------------