Diff for "Cytoscape_2.8/EquationAttributes"

Differences between revisions 20 and 21

RFC65: Equation Attributes

Editor(s): Johannes Ruscheinski

Date: 2009-12-22

Status: Open for comment

Contents

Proposal
Use Cases
Implementation Plan
Informal Description of Equation Syntax and Functions
Java API
Equation Evaluation
Comments

Proposal

Add the ability to Cytoscape to describe attributes as being derived or computed based on expression which may involve references to other attributes.

Background

This capability would seem to be a natural extension in the attribute browser. Especially for numerical-valued attributes it seems obvious that Cytoscape users might want to derive new quantities based on the values of other attributes. A similar argument holds for boolean attributes.

Use Cases

Allows users to easily colour a node based on a threshold value. Example: Given a Double attribute "someValue" and a threshold called "someThreshold" we could create a new boolean attribute called "exceedsThreshold" which would be defined by an equation attribute that is "$(someValue) > someThreshold".
Assume we have an attribute called "someAttrib" and we would like to assign colour values based on the log of this attribute's value then we could assign a new new Double attribute "shiftedLog" defined by "log10($(someAttrib)) + 22.7"
Consider an attribute that has missing values called "maybeMissing" and assume we could like to provide a derived attribute that has a default value of -10.2 when ever "maybeMissing" is not set, then we could create a new attribute called "neverMissing" defined by "$(maybeMissing=-10.2)"

Many more complex examples can be easily construed.

Implementation Plan

Define a syntax for the expressions (Cull this as a subset from some programming language's grammar?)
Write a parser (LL(1) grammar, recursive descent) implementing the syntax and error handling. (Pay special attention to easy extension with more built-in functions.)
Test and debug the stand-alone parser
Identify the location in the Cytoscape code base for integration and integrate the parser
Test and debug the new capability within Cytoscape

Project Timeline

Define syntax (est. 1/2 day to 1 day)
Implement and test stand-alone parser (4 to 5 days)
Integrate the parser into the Cytoscape code base (3 to 4 days)

Total time: 7 1/2 to 10 days.

Tasks and Milestones

Define syntax
1. Cull a subset of the "C" syntax that corresponds to what we want to implement
2. Potentially rearrange parts of the syntax to make it correspond to our precedence requirements
3. Check for violations of LL(1) and transform to LL(1), if necessary
Implement and test stand-alone parser
1. Write a scanner/tokenizer (GetToken/UngetToken)
2. Convert the syntax into a series of recursive function calls
3. Add type checking and error handling
4. Create unit tests
5. Implement the built-in functions (Possibly add the capability to register Java function objects as built-in functions.)
6. Test and debug
Integrate the parser into the Cytoscape code base
1. Find where in the Cytoscape code base would be the natural location for integration
2. Actually integrate the new code
3. Create unit tests for the integrated code
4. Test the new code as part of Cytoscape
5. Manually test and debug the new code and possibly extend the unit tests

Project Dependencies

This project does not depend on any other projects.

Informal Description of Equation Syntax and Functions

Attribute References

References to existing, possibly empty attributes can take four possible forms:

Unconditional references look like $(attribName). This type of reference results in an error should the attribute "attribName" not exist or if it exists but has no assigned value.
Conditional references look like $(attribName=defaultValue). This type of reference results in an error should the attribute "attribName" not exist or if the type of "defaultValue" is inconsistent with the type of "attribName". If attribName exists but is empty, the default value "defaultValue" will be substituted instead.
Unconditional map references look like $(attribName[key]). This type of reference results in an error should the attribute "attribName" not exist or if the type of "attribName" is not a Map or if "key" is not a valid key for the Map "attribName".
Conditional map references look like $(attribName[key]=defaultValue). This type of reference results in an error should the attribute "attribName" not exist or if "defaultValue" is not compatible with the value type for the Map "attribName".

In the case of Boolean attribute references the two predefined constants true and false will be the only possible default values.

Expressions

Expressions are composed of constants, operators, function calls and attribute references. Their syntax follows that of various programming languages and everyday expectations for arithmetic expressions. The result type of an expression can be either a Double or a Boolean value. Grouping and nesting of subexpressions may be accomplished by using parentheses.

Built-In Functions

Function name	Description	Argument Type(s)	Return Type	Comments
abs	absolute value	Double/Int	Double
log10	logarithm to base 10	Double/Int	Double
exp	e raised to a power	Double/Int	Double
round	nearest integer	Double/Int	Double
trunk	returns the integer part of a number, e.g. trunk(-1.3) = trunk(-1.9) = -1	Double/Int	Double
tolower	convert a string to lowercase	String	String
toupper	convert a string to uppercase	String	String
substring	returns the substring of "string" starting at 0-based offset "startIndex" and of length "length"	(string: String, startIndex: Int, length: Int)	String	This function is robust in that, if startIndex is beyond the end of "string", an empty string will be returned and if "length" would extend beyond the end of "string", as much as possible of "string" will be returned. On the other hand, a negative "startIndex" will result in an error.
format	convert an arithmetic value or expression to a string	(arithmeticExpression: Double/Int, width: Int, precision: Int [,padding: String [, type: (N\|NE)]])	String	padding and type are optional and default to " " for padding and NE for type. Padding must be a string literal of length 1 and type can be either E (for exponential) or NE (for non-exponential). Padding will be used to extend a result, if its width is less than the specified width. Use a width of 0 if you do not want any padding. Precision specifies the number of significant digits in the generated string. When using the E format, generated strings will look like "2.3e5" rather than "23000".
first	first element of a List	List	element type of the list	Please note that lists of lists or maps of lists etc. will not be supported!
last	last element of a List	List	element type of the list	Please note that lists of lists or maps of lists etc. will not be supported!
nth	0-based n-th element of a List	List, Int	element type of the list	Please note that lists of lists or maps of lists etc. will not be supported!
size	# of elements in a List or Map	List/Map	Double
defined	test for existence of an attribute	an attribute name	Boolean	this will require a special implementation effort as this function is fundamentally different from all other built-in functions
length	character count of a String	String	Double

Operators

The usual complement of arithmetic as well as boolean comparison operators will be provided. Operator precedence will probably follow the C/Java/FORTRAN conventions. Arithmetic operator precedence will follow that of common mathematic, i.e. unary plus and minus having the highest precedence, followed by exponentiation and then the other four arithmetic operators. Comparison operators will have intermediate precedence and boolean operators the lowest precedence.

Symbol	Description	Operand Type(s)	Result Type	Associativity	Comments
"-"	unary minus	Double,Int	Double	non-associative
"+"	unary plus	Double,Int	Double	non-associative
"+"	addition	Double/Int, Double/Int	Double	left	may result in -Inf or +Inf
"-"	subtraction	Double/Int, Double/Int	Double	left	may result in -Inf or +Inf
"*"	multiplication	Double/Int, Double/Int	Double	left	may result in NaN, +Inf, or -Inf
"/"	division	Double/Int, Double/Int	Double	left	may result in NaN, +Inf, or -Inf
"^"	exponentiation	Double/Int, Double/Int	Double	right	may result in NaN, +Inf, or -Inf
"=="	equality	Double/Int, Double/Int	Boolean	non-associative
"=="	equality	String, String	Boolean	non-associative
"!="	inequality	Double/Int, Double/Int	Boolean	non-associative
"!="	inequality	String, String	Boolean	non-associative
"not"	logical inversion	Boolean	Boolean	non-associative
"and"	logical conjunction	Boolean, Boolean	Boolean	non-associative
"or"	logical disjunction	Boolean, Boolean	Boolean	non-associative
"<"	less than	Double/Int, Double/Int	Boolean	non-associative
">"	greater than	Double/Int, Double/Int	Boolean	non-associative
"<="	less than or equal	Double/Int, Double/Int	Boolean	non-associative
">="	greater than or equal	Double/Int, Double/Int	Boolean	non-associative
"."	string concatenation	String, String	String	left

String Constants a.k.a. String Literals

String literals are sequences of characters enclosed in double-quotes. Double-quotes, newlines or backslashes may be embedded in a string literal by preceding them with a backslash. Arbitrary unicode characters may be embedded in a string with a \uXXXX escape sequence where each X stands for a hexadecimal digit. This allows for the portable embedding of characters from foreign scripts or rarely used English characters with diacritical marks or ligatures etc.

Error Reporting

Type errors or arithmetic errors are reported in the field that contains the equation, for example if the function attempts to add a string and an integer or if an equation references a non-existent or empty attribute without providing a default. (Note: This may need some more thought, for example we may consider to provide an empty value in case of an error.) An important type of error would be circular references which would always be illegal. For efficiency reasons this type of error should be checked for immediately after a user edits an equation.

Java API

The Java API is illustrated by these code fragments: EquationAPI.java

Equation Evaluation

Evaluation Time

Conceptually, at least two choices present themselves:

Evaluate when the attribute value is being requested. This has the potential problem of being costly if the attribute value will be request$
Evaluate whenever the calculated value changes. This has the problem that any references need to track an attribute so that can notify it wh$

Choice 1. appears to lead to a far simpler implementation. Furthermore, should an attribute never be referenced, choice 2. would perform unnece$

Performance Issues

It is very likely that unless users actually use equations that there should be no measurable loss in performance since all that is required is $ 1. Compile the expressions to a simple stack-based language. This could potentially lead to a multiple speed-up and should be very easy to impl$ 2. If 1. is not enough, actual Java byte code could be generated. cglib would seem like a useful library for t$ Every time an equation changed, the corresponding byte code would be regenerated. Equations must be executed in topological order of the inter-$

Comments

How would the UI, specifically the attribute browser, look while Cytoscape is computing the values? For example, it might be useful to have a generic progress icon instead of the attribute value while Cytoscape is in the process of computing it value.
Why not use Jython for a scripting language instead of implementing an entire new language? Perhaps using a well-established scripting language as a basis would be more flexible for adding new features. This would allow people to enter in a script that queries a database to do a mapping from one attribute value to another.

How to Comment

Edit the page and add your comments under the provided header. By adding your ideas to the Wiki directly, we can more easily organize everyone's ideas, and keep clear records. Be sure to include today's date and your name for each comment. Try to keep your comments as concrete and constructive as possible. For example, if you find a part of the RFC makes no sense, please say so, but don't stop there. Take the extra step and propose alternatives.

-  ← Revision 20 as of 2010-02-04 20:09:58 →
  Size: 13356
  Editor: scsiapat1
  Comment:
+  ← Revision 21 as of 2010-02-08 17:16:05 →
  Size: 13657
  Editor: barbera
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 124:
+== Equation Evaluation ==



=== Evaluation Time ===

Conceptually, at least two choices present themselves:

 1. Evaluate when the attribute value is being requested.  This has the potential problem of being costly if the attribute value will be request$

 2. Evaluate whenever the calculated value changes. This has the problem that any references need to track an attribute so that can notify it wh$



Choice 1. appears to lead to a far simpler implementation.  Furthermore, should an attribute never be referenced, choice 2. would perform unnece$
-Line 125:
+Line 134:
-It is very likely that unless users actually use equations that there should be no measurable loss in performance.  Should equation-heavy attribute lists lead to noticable slowdowns there are two steps that could be taken:

1. Compile the expressions to a simple stack-based language.  This could potentially lead to a multiple speed-up and should be very easy to implement.

2. If 1. is not enough, actual Java byte code could be generated. [[http://cglib.sourceforge.net/|cglib]] would seem like a useful library for this purpose.

Every time an equation changed, the corresponding byte code would be regenerated.  Equations must be executed in topological order of the inter-attribute dependencies.  To avoid performance problems this order should be redetermined every time a user edits an equation and not at equation evaluation time!
+It is very likely that unless users actually use equations that there should be no measurable loss in performance since all that is required is $

1. Compile the expressions to a simple stack-based language.  This could potentially lead to a multiple speed-up and should be very easy to impl$

2. If 1. is not enough, actual Java byte code could be generated. [[http://cglib.sourceforge.net/|cglib]] would seem like a useful library for t$

Every time an equation changed, the corresponding byte code would be regenerated.  Equations must be executed in topological order of the inter-$