Cytoscape_User_Manual/Old_Annotation_Server

Handlers for the following format still exist in Cytoscape as legacy code, however we strongly recommend using the new formats (OBO + Gene Association) described in the previous section, since they are easier to download directly from the Gene Ontology project and use directly. Currently, users have no access to an import interface for this old format.

Building your own annotation files

The annotation server requires that the gene annotations and associated ontology of controlled vocabulary terms follow a simple format. This simple format was chosen because it is efficient to parse and easy to use.

The flat file formats are explained below:

The Ontology Format

By example (the Gene Ontology - GO):

(curator=GO) (type=all)
0003673 = Gene_Ontology
0003674 = molecular_function [partof: 0003673 ]
0008435 = anticoagulant [isa: 0003674 ]
0016172 = antifreeze [isa: 0003674 ]
0016173 = ice nucleation inhibitor [isa: 0016172 ]
0016209 = antioxidant [isa: 0003674 ]
0045174 = glutathione dehydrogenase (ascorbate) [isa: 0009491 0015038 0016209 0016672 ]
0004362 = glutathione reductase (NADPH) [isa: 0015038 0015933 0016209 0016654 ]
0017019 = myosin phosphatase catalyst [partof: 0017018 ]
...

A second example (KEGG pathway ontology):

(curator=KEGG) (type=Metabolic Pathways)
90001 = Metabolism
80001 = Carbohydrate Metabolism [isa: 90001 ]
80003 = Lipid Metabolism [isa: 90001 ]
80002 = Energy Metabolism [isa: 90001 ]
80004 = Nucleotide Metabolism [isa: 90001 ]
80005 = Amino Acid Metabolism [isa: 90001 ]
80006 = Metabolism of Other Amino Acids [isa: 90001 ]
80007 = Metabolism of Complex Carbohydrates [isa: 90001 ]
...

The format has these required features:

The first line contains two parenthesized assignments for curator and type. In the GO example above, the ontology file (which is created from the XML that GO provides) nests all three specific ontologies (molecular function, biological process, cellular component) below the 'root' ontology, named 'Gene_Ontology'. (type=all) tells you that all three ontologies are included in that file.
Following the mandatory title line, there are one or more category lines, each with the form:
- number0 = name [isa:|partof: number1 number2 ...]
where isa and partof are terms used in GO; they describe the relation between parent and child terms in the ontology hierarchy.
The trailing blank before each left square bracket is not required; it is an artifact of the python script that creates these files.

The Annotation Format

By example (from the GO biological process annotation file):

(species=Saccharomyces cerevisiae) (type=Biological Process) (curator=GO)
YMR056C = 0006854
YBR085W = 0006854
YJR155W = 0006081
...

and from KEGG:

(species=Mycobacterium tuberculosis) (type=Metabolic Pathways) (curator=KEGG)
RV0761C = 10
RV0761C = 71
RV0761C = 120
RV0761C = 350
RV0761C = 561
RV1862 = 10
...

The format has these required features:

The first line contains three parenthesized assignments, for species, type and curator. In the example just above, the annotation file (created for budding yeast from the flat text file maintained by SGD for the Gene Ontology project and available both at their web site and at GO's) shows three yeast ORFs annotated for biological process with respect to GO, as described above.
Following the mandatory title line, there are one or more annotation lines, each with the form:
- canonicalName = ontologyTermID
Once loaded, this annotation (along with the accompanying ontology) can be assigned to nodes in a Cytoscape network. For this to work, the species type of the node must exactly match the species named on the first line of the annotation file. The canonicalName of your node must exactly match the canonicalName present in the annotation file. If you don’t see the expected results when using this feature of Cytoscape, check this again, as getting either of these wrong is a common mistake.

Load Data into Cytoscape

The easiest way to make annotations available to Cytoscape is by loading annotations into the Cytoscape annotation server. This is the default behavior for the official release of Cytoscape.

The Annotation Manifest

You must first create a text file to specify the files you want Cytoscape to load. Here is an example, from a file which (for convenience) we usually call manifest:

ontology=goOntology.txt
annotation=yeastBiologicalProcess.txt
annotation=yeastMolecularFunction.txt
annotation=yeastCellularComponent.txt

Use the Cytoscape -b command line argument to specify the annotation manifest file to read (e.g. -b manifest). Please note that the -s switch, which sets the default species for your data, is required to exactly match the species named in any annotation file you wish to use.

Getting and Reformatting GO Data

The Gene Ontology (GO) project is a valuable source of annotation for the genes of many organisms. In this section we will explain how to:

Obtain the GO ontology file
Reformat it into the simpler flat file Cytoscape uses
Obtain an annotation file (we illustrate with yeast and human annotation)
Reformat the annotation files into the simple Cytoscape format

Obtain the GO ontology file

Go to the GO XML FTP (ftp://ftp.geneontology.org/pub/go/xml/) page. Download the latest go-YYYYMM-termdb.xml.gz file.

Reformat GO XML ontology file into a flat file

 gunzip go-YYYYMM-termdb.xml.gz
 python parseGoTermsToFlatFile.py go-YYYYMM-termdb.xml > goOntology.txt

(see below for Python script listing)

Obtain the association file for your organism

GO maintains a list of association files for many organisms; these files associate genes with GO terms. The next step is to get the file for the organism(s) you are interested in, and parse it into the form Cytoscape needs. A list of files may be seen at http://www.geneontology.org/GO.current.annotations.shtml. The rightmost column contains links to tab-delimited files of gene associations, by species. Choose the species you are interested in, and click 'Download'.

Let's use "GO Annotations @ EBI: Human" as an example. After you have downloaded and saved the file, look at the first few lines:

SPTR    O00115  DRN2_HUMAN              GO:0003677      PUBMED:9714827  TAS             F       Deoxyribonuclease II precursor  IPI00010348     protein taxon:9606              SPTR
SPTR    O00115  DRN2_HUMAN              GO:0004519      GOA:spkw        IEA             F       Deoxyribonuclease II precursor  IPI00010348     protein taxon:9606        20020425      SPTR
SPTR    O00115  DRN2_HUMAN              GO:0004531      PUBMED:9714827  TAS             F       Deoxyribonuclease II precursor  IPI00010348     protein taxon:9606              SPTR
...

Note that line wrapping has occurred here, so each line of the actual file is wrapped to two lines. The goal is to create from these lines the following lines:

(species=Homo sapiens) (type=Molecular Function) (curator=GO)
IPI00010348 = 0003677
IPI00010348 = 0004519
IPI00010348 = 0004531
...

(species=Homo sapiens) (type=Biological Process) (curator=GO)
NP_001366 = 0006259
NP_001366 = 0006915
NP_005289 = 0007186
NP_647593 = 0006899
...

The first sample contains molecular function annotations for proteins, and each protein is identified by its IPI number. IPI is the International Protein Index, which maintains cross references to the main databases for human, mouse and rat proteomes. The second sample contains biological process annotation, and each protein is identified by its NP (RefSeq) number. These two naming systems, IPI and RefSeq, are two of many that you can use to define canonical names when you run Cytoscape. For budding yeast, it is much easier: the yeast community always uses standard ORF names, and so Cytoscape uses these as canonical names. For human proteins and genes, there is no single standard.

The solution (for those working with human genes or proteins) is, once you have downloaded the annotations file, to:

Decide which naming system you want to use.
Download ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/xrefs.goa. This cross-reference file, when used strategically, allows you to create Cytoscape-compatible annotation files in which the canonical name is the one most meaningful to you.
Examine xrefs.goa to figure out which column contains the names you wish to use.
Make a very slight modification to the python script described below, and then
Run that script, supplying both xrefs.goa and that annotation file as arguments.

Here are a few sample lines from xrefs.goa:

SP      O00115  IPI00010348             ENSP00000222219;        NP_001366;              BAA28623;AAC77366;AAC35751;AAC39852;BAB55598;AAB51172;AAH10419; 2960,DNASE2     1777,DNASE2
SP      O00116  IPI00010349             ENSP00000324567;ENSP00000264167;        NP_003650;              CAA70591;       327,AGPS        8540,AGPS
SP      O00124  IPI00010353             ENSP00000265616;ENSP00000322580;        NP_005662;              BAA18958;BAA18959;AAH20694;             7993,D8S2298E
...

Note that line wrapping has occurred here – each line in this example starts with the letters SP. See the README file for more information (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/README).

Finally, run the script to create your three annotation files for human proteins:

bioproc.anno (GO biological process annotation)
molfunc.anno (GO molecular function annotation)
cellcomp.anno (GO cellular component annotation)

using the supplied python script. It may be necessary to modify this script slightly if RefSeq identifiers are not used as canonical names or if you are using a more recent version of Python.

python parseAssignmentsToFlatFileFromGoaProject.py gene_association.goa_human xrefs.goa

(See below for Python script listing)

Python script examples

These scripts, as described above, require Python version 2.2 or later.

Script 1 - parseGoTermsToFlatFile.py

# parseGoTermToFlatFile.py:  translate a GO XML ontology file into a simpler
#  Cytoscape flat file
#-----------------------------------------------------------------------------------
# RCS: $Revision: 1.3 $   $Date: 2003/05/18 00:38:43 $
#-----------------------------------------------------------------------------------
import re, pre, sys
#-----------------------------------------------------------------------------------
def flatFilePrint (id, name, isaIDs, partofIDs):
  isa = ''
  if (len (isaIDs) > 0):
    isa = '[isa: '
    for isaID in isaIDs:
      isa += isaID
      isa += ' '
    isa += ']'
  partof = ''
  if (len (partofIDs) > 0):
    partof = '[partof: '
    for partofID in partofIDs:
      partof += partofID
      partof += ' '
    partof += ']'
  result = '~np~%~/np~s = ~np~%~/np~s ~np~%~/np~s ~np~%~/np~s' ~np~%~/np~ (id, name, isa, partof)
  result = result.strip ()
  if (result == 'isa = isa' or result == 'partof = partof'):
    print >> sys.stderr, 'meaningless term: ~np~%~/np~s' ~np~%~/np~ result
  else:
    print result
#-----------------------------------------------------------------------------------
if (len (sys.argv) != 2):
  print 'usage:  ~np~%~/np~s <someFile.xml>' ~np~%~/np~ sys.argv [0]
  sys.exit ();
inputFilename = sys.argv [1];
print >> sys.stderr,  'reading ~np~%~/np~s...' ~np~%~/np~ inputFilename
text = open (inputFilename).read ()
print >> sys.stderr,  'read ~np~%~/np~d characters' ~np~%~/np~ len (text)
regex = '<go:term .*?>(.*?)</go:term>';
cregex = pre.compile (regex, re.DOTALL)   # . matches newlines
m = pre.findall (cregex, text)
print >> sys.stderr, 'number of go terms: ~np~%~/np~d' ~np~%~/np~ len (m)
regex2 = '<go:accession>GO:(.*?)</go:accession>.*?<go:name>(.*?)</go:name>'
cregex2 = re.compile (regex2, re.DOTALL)
regex3 = '<go:isa\s*rdf:resource="http://www.geneontology.org/go#GO:(.*?)"\s*/>'
cregex3 = re.compile (regex3, re.DOTALL)
regex4 = '<go:part-of\s*rdf:resource="http://www.geneontology.org/go#GO:(.*?)"\s*/>'
cregex4 = re.compile (regex4, re.DOTALL)
goodElements = 0
badElements = 0
print '(curator=GO) (type=all)'
for term in m:
  m2 = re.search (cregex2, term)
  if (m2):
    goodElements += 1;
    id = m2.group (1)
    name = m2.group (2)
    isaIDs = []
    m3 = re.findall (cregex3, term);
    for ref in m3:
      isaIDs.append (ref)
    m4 = re.findall (cregex4, term);
    partofIDs = []
    for ref in m4:
      partofIDs.append (ref)
    flatFilePrint (id, name, isaIDs, partofIDs)
  else:
    badElements += 1;
    print >> sys.stderr, 'no match to m2...'
    print >> sys.stderr, "---------------\n~np~%~/np~s\n------------------" ~np~%~/np~ term
print >> sys.stderr,  'goodElements ~np~%~/np~d' ~np~%~/np~ goodElements
print >> sys.stderr, 'badElements ~np~%~/np~d' ~np~%~/np~ badElements
#--------------------------------------

Script 2 - parseAssignmentsToFlatFileFromGoaProject.py

import sys
#-----------------------------------------------------------------------------------
def fixCanonicalName (rawName):
# for instance, trim 'YBR085W|ANC3' to 'YBR085W'
  bar = rawName.find ('|')
  if (bar < 0):
    return rawName
  return rawName [:bar]
#-----------------------------------------------------------------------------------
def fixGoID (rawID):
  bar = rawID.find (':') + 1
  return rawID [bar:]
#-----------------------------------------------------------------------------------
def readGoaXrefFile (filename):
  lines = open (filename).read().split ('\n')
  result = {}
  for line in lines:
    if (len (line) < 10):
      continue
    tokens = line.split ('\t')
    ipi = tokens [2]
    np = tokens [5]
    semicolon = np.find (';')
    if (semicolon >= 0):
      np = np [:semicolon]
    if (len (ipi) > 0 and len (np) > 0):
      result [ipi] = np
  return result
#-----------------------------------------------------------------------------------
if (len (sys.argv) != 3):
  print 'error!  parse   <gene_associations file from GO> <goa xrefs file> '
  sys.exit ()
associationFilename = sys.argv [1];
xrefsFilename = sys.argv [2]
species = 'Homo sapiens'
ipiToNPHash = readGoaXrefFile (xrefsFilename)
tester = 'IPI00099416'
print 'hash size: ~np~%~/np~d' ~np~%~/np~ len (ipiToNPHash)
print 'test map: ~np~%~/np~s -> NP_054861: ~np~%~/np~s ' ~np~%~/np~ (tester, ipiToNPHash [tester])
bioproc = open ('bioproc.txt', 'w')
molfunc = open ('molfunc.txt', 'w')
cellcomp = open ('cellcomp.txt', 'w')
bioproc.write ('(species=~np~%~/np~s) (type=Biological Process) (curator=GO)\n' ~np~%~/np~ species)
molfunc.write ('(species=~np~%~/np~s) (type=Molecular Function) (curator=GO)\n' ~np~%~/np~ species);
cellcomp.write ('(species=~np~%~/np~s) (type=Cellular Component) (curator=GO)\n' ~np~%~/np~ species);
lines=open(associationFilename).read().split('\n')
sys.stderr.write ('found ~np~%~/np~d lines\n' ~np~%~/np~ len (lines))

for line in lines:
  if (line.find ('!') == 0 or len (line) < 2):
    continue
  tokens = line.split ('\t')
  goOntology = tokens [8]
  goIDraw = tokens [4]
  goID = goIDraw.split (':')[1]
  ipiName = fixCanonicalName (tokens [10])
  if (len (ipiName) < 1):
    continue


  if (not ipiToNPHash.has_key (ipiName)):
    continue
  refseqName = ipiToNPHash [ipiName]
  printName = refseqName
  #printName = ipiName
  if (ipiName == tester):
    print '~np~%~/np~s (~np~%~/np~s) has go term ~np~%~/np~s' ~np~%~/np~ (tester, printName, goID)
  if (goOntology == 'C'):
    cellcomp.write ('~np~%~/np~s = ~np~%~/np~s\n' ~np~%~/np~ (printName, goID))
  elif (goOntology == 'P'):
    bioproc.write ('~np~%~/np~s = ~np~%~/np~s\n' ~np~%~/np~ (printName, goID))
  elif (goOntology == 'F'):
    molfunc.write ('~np~%~/np~s = ~np~%~/np~s\n' ~np~%~/np~ (printName, goID))
#-----------------------------------------------------------------------------------

Cytoscape_User_Manual/Old_Annotation_Server (last edited 2009-02-12 01:03:15 by localhost)