Cytoscape_User_Manual/Expression_Data

In addition to normal node and edge attribute data, Cytoscape also supports importing gene expression data. Gene expression data are imported using a different file format than normal attributes; however, the resulting attributes are not treated differently by Cytoscape. Gene expression data (like attribute data) can be loaded at any time, but are (generally) only relevant once a network has been loaded.

Data File Format

Gene expression ratios or values are specified over one or more experiments using a text file. Ratios result from a comparison of two expression measurements (experiment vs. control). Some expression platforms, such as Affymetrix, directly measure expression values, without a comparison. The file consists of a header and a number of space- or tab-delimited fields, one line per gene, with the following format:

Identifier [CommonName] value1 value2 ... valueN [pval1 pval2 ... pvalN]

Brackets [ ] indicate fields that are optional.

The first field identifies which Cytoscape node the data refers to. In the simplest case, this is the gene name - exactly as it appears on the network generated by Cytoscape (case sensitive!). Alternatively, this can be some node attribute that identifies the node uniquely, such as a probeset identifier for commercial microarrays.

The next field is an optional common name. It is not used by Cytoscape, and is provided strictly for the user's convenience. With this common name field, the input format is the same as for commonly-used expression data anaysis packages such as SAM (http://www-stat.stanford.edu/~tibs/SAM/).

The next set of columns represent expression values, one per experiment. These can be either absolute expression values or fold change ratios. Each experiment is identified by its experiment name, given in the first line.

Optionally, significance measures such as P values may be provided. These values, generated by many microarray data analysis packages, indicate where the level of gene expression or the fold change appears to be greater than random chance. If you are using significance measures, then your expression file should contain them in a second set of columns after the expression values. The column names for the expression significance measures need to match those of the expression values exactly.

For example, here is an excerpt from the file galExpData.pvals in the Cytoscape sampleData directory:

GENE COMMON gal1RG gal4RG gal80R gal1RG gal4RG gal80R
YHR051W COX6 -0.034 0.111 -0.304 3.75720e-01 1.56240e-02 7.91340e-06
YHR124W NDT80 -0.090 0.007 -0.348 2.71460e-01 9.64330e-01 3.44760e-01
YKL181W PRS1 -0.167 -0.233 0.112 6.27120e-03 7.89400e-04 1.44060e-01
YGR072W UPF3 0.245 -0.471 0.787 4.10450e-04 7.51780e-04 1.37130e-05

This indicates that there is data for three experiments: gal1RG, gal4RG, and gal80R. These names appear two times in the header line: the first time gives the expression values, and the second gives the significance measures. For instance, the second line tells us that in Experiment gal1RG, the gene YHR051W has an expression value of -0.034 with significance measure 3.75720e-01.

Some variations on this basic format are recognized; see the formal file format specification below for more information. Expression data files commonly have the file extensions ".mrna" or ".pvals", and these file extensions are recognized by Cytoscape when browsing for data files.

General Procedure

Load an expression attribute matrix file using File → Import → Attribute/Expression Matrix... to bring up the import window, or by specifying the filename using the -m option at the command line. If you use the command line input, you must enter your expression data by node ID. If you use the dialog box, then you can either load expression data by node ID (the default option), or you can select a node attribute to use in assigning your expression data to your Cytoscape nodes. If you do use a node attribute, then (1) the attribute should already be loaded, and (2) the node attribute value must match the first column in your matrix file.

Worked Example

For the sample network file sampleData/galFiltered.sif:

Option A.

Load a sample gene expression data set by going to File → Import → Attribute/Expression Matrix... . In the resulting window, in the field labeled "Please select an attribute or expression matrix file...", use the Select button to enter sampleData/galExpData.pvals. The identifiers used in this file are the same ones used in the network file sampleData/galFiltered.sif, so you do not need to touch the field labeled "Assign values to nodes using...". A few lines of this file are shown below:

GENE COMMON gal1RG gal4RG gal80R gal1RG gal4RG gal80R
YHR051W COX6 -0.034 0.111 -0.304 3.75720e-01 1.56240e-02 7.91340e-06
YHR124W NDT80 -0.090 0.007 -0.348 2.71460e-01 9.64330e-01 3.44760e-01
YKL181W PRS1 -0.167 -0.233 0.112 6.27120e-03 7.89400e-04 1.44060e-01

Option B.

Step 1. After loading the network, load the node attribute file sampleData/gal.probeset.na, using File → Import → Node attributes... . This file is shown in part below:

Probeset
YHR051W = probeset2
YHR124W = probeset3
YKL181W = probeset4

Step 2. After loading the node attribute file, select the expression data file sampleData.galExpPvals.probeset.pvals, shown in part below:

GENE COMMON gal1RG gal4RG gal80R gal1RG gal4RG gal80R
probeset2 COX6 -0.034 0.111 -0.304 3.75720e-01 1.56240e-02 7.91340e-06
probeset3 NDT80 -0.090 0.007 -0.348 2.71460e-01 9.64330e-01 3.44760e-01
probeset4 PRS1 -0.167 -0.233 0.112 6.27120e-03 7.89400e-04 1.44060e-01

After selecting this file, in the field labeled "Assign values to nodes using...", select Probeset. You will see that this loads exactly the same expression data as in Case 1, but provides extra flexibility in case the node name cannot be used as an identifier.

Detailed file format (Advanced users)

In all expression data files, any whitespace (spaces and/or tabs) is considered a delimiter between adjacent fields. Every line of text is either the header line or contains all the measurements for a particular gene. No name conversion is applied to expression data files.

The names given in the first column of the expression data file should match exactly the names used elsewhere (i.e. in SIF or GML files).

The first line is a header line with one of the following three header formats:

<text> <text> cond1 cond2 ... cond1 cond2 ... [NumSigConds]

<text> <text> cond1 cond2 ...

<tab><tab>RATIOS<tab><tab>...LAMBDAS

The first format specifies that both expression ratios and significance values are included in the file. The first two text tokens (in angled brackets) contain names for each gene, such as the formal and common gene names. The condX token set specifies the names of the experimental conditions; these columns will contain ratio values. This list of condition names must then be duplicated exactly, each spelled the same way and in the same order. Optionally, a final column with the title NumSigConds may be present. If present, this column will contain integer values indicating the number of conditions in which each gene had a statistically significant change according to some threshold.

The second format is similar to the first except that the duplicate column names are omitted, and there is no NumSigConds field. This format specifies data with ratios but no significance values.

The third format specifies an MTX header, which is a commonly used format. Two tab characters precede the RATIOS token. This token is followed by a number of tabs equal to the number of conditions, followed by the LAMBDAS token. This format specifies both ratios and significance values.

Each line after the first is a data line with the following format:

FormalGeneName CommonGeneName ratio1 ratio2 ... [lambda1 lambda2 ...] [numSigConds]

The first two tokens are gene names. The names in the first column are the keys used for node name lookup; these names should be the same as the names used elsewhere in Cytoscape (i.e. in the SIF, GML, or XGMML files). Traditionally in the gene expression microarray community, who defined these file formats, the first token is expected to be the formal name of the gene (in systems where there is a formal naming scheme for genes), while the second is expected to be a synonym for the gene commonly used by biologists, although Cytoscape does not make use of the common name column. The next columns contain floating point values for the ratios, followed by columns with the significance values if specified by the header line. The final column, if specified by the header line, should contain an integer giving the number of significant conditions for that gene. Missing values are not allowed and will confuse the parser. For example, using two consecutive tabs to indicate a missing value will not work; the parser will regard both tabs as a single delimiter and be unable to parse the line correctly.

Optionally, the last line of the file may be a special footer line with the following format:

NumSigGenes int1 int2 ...

This line specified the number of genes that were significantly differentially expressed in each condition. The first text token must be spelled exactly as shown; the rest of the line should contain one integer value for each experimental condition.

Cytoscape_User_Manual/Expression_Data (last edited 2009-02-12 01:03:03 by localhost)