RFC65: Equation Attributes |
Editor(s): Johannes Ruscheinski |
Date: 2009-12-22 |
Status: Open for comment |
Contents
Proposal
Add the ability to Cytoscape to describe attributes as being derived or computed based on expression which may involve references to other attributes.
Background
This capability would seem to be a natural extension in the attribute browser. Especially for numerical-valued attributes it seems obvious that Cytoscape users might want to derive new quantities based on the values of other attributes. A similar argument holds for boolean attributes.
Use Cases
Allows users to easily colour a node based on a threshold value. Example: Given a Double attribute "someValue" and a threshold called "someThreshold" we could create a new boolean attribute called "exceedsThreshold" which would be defined by an equation attribute that is "$(someValue) > someThreshold".
- Assume we have an attribute called "someAttrib" and we would like to assign colour values based on the log of this attribute's value then we could assign a new new Double attribute "shiftedLog" defined by "log10($(someAttrib)) + 22.7"
- Consider an attribute that has missing values called "maybeMissing" and assume we could like to provide a derived attribute that has a default value of -10.2 whenever "maybeMissing" is not set, then we could create a new attribute called "neverMissing" defined by "$(maybeMissing=-10.2)"
Many more complex examples can be easily construed.
Implementation Plan
- Define a syntax for the expressions (Cull this as a subset from some programming language's grammar?)
Write a parser (LL(1) grammar, recursive descent) implementing the syntax and error handling. (Pay special attention to easy extension with more built-in functions.)
- Test and debug the stand-alone parser
- Identify the location in the Cytoscape code base for integration and integrate the parser
- Test and debug the new capability within Cytoscape
Project Timeline
- Define syntax (est. 1/2 day to 1 day)
- Implement and test stand-alone parser (4 to 5 days)
- Integrate the parser into the Cytoscape code base (3 to 4 days)
Total time: 7 1/2 to 10 days.
Tasks and Milestones
- Define syntax
- Cull a subset of the "C" syntax that corresponds to what we want to implement
- Potentially rearrange parts of the syntax to make it correspond to our precedence requirements
- Check for violations of LL(1) and transform to LL(1), if necessary
- Implement and test stand-alone parser
Write a scanner/tokenizer (GetToken/UngetToken)
- Convert the syntax into a series of recursive function calls
- Add type checking and error handling
- Create unit tests
- Implement the built-in functions (Possibly add the capability to register Java function objects as built-in functions.)
- Test and debug
- Integrate the parser into the Cytoscape code base
- Find where in the Cytoscape code base would be the natural location for integration
- Actually integrate the new code
- Create unit tests for the integrated code
- Test the new code as part of Cytoscape
- Manually test and debug the new code and possibly extend the unit tests
Project Dependencies
This project does not depend on any other projects.
Informal Description of Equation Syntax and Functions
Attribute References
References to existing, possibly empty attributes can take four possible forms:
Unconditional references look like $(attribName). This type of reference results in an error should the attribute "attribName" not exist or if it exists but has no assigned value.
Conditional references look like $(attribName=defaultValue). This type of reference results in an error should the attribute "attribName" not exist or if the type of "defaultValue" is inconsistent with the type of "attribName". If attribName exists but is empty, the default value "defaultValue" will be substituted instead.
Unconditional map references look like $(attribName[key]). This type of reference results in an error should the attribute "attribName" not exist or if the type of "attribName" is not a Map or if "key" is not a valid key for the Map "attribName".
Conditional map references look like $(attribName[key]=defaultValue). This type of reference results in an error should the attribute "attribName" not exist or if "defaultValue" is not compatible with the value type for the Map "attribName".
In the case of Boolean attribute references the two predefined constants true and false will be the only possible default values.
Expressions
Expressions are composed of constants, operators, function calls and attribute references. Their syntax follows that of various programming languages and everyday expectations for arithmetic expressions. The result type of an expression can be either a Double or a Boolean value. Grouping and nesting of subexpressions may be accomplished by using parentheses.
Built-In Functions
Function name |
Description |
Argument Type(s) |
Return Type |
Comments |
abs |
absolute value |
Double/Int |
Double |
|
log10 |
logarithm to base 10 |
Double/Int |
Double |
|
exp |
e raised to a power |
Double/Int |
Double |
|
round |
nearest integer |
Double/Int |
Double |
|
trunk |
returns the integer part of a number, e.g. trunk(-1.3) = trunk(-1.9) = -1 |
Double/Int |
Double |
|
tolower |
convert a string to lowercase |
String |
String |
|
toupper |
convert a string to uppercase |
String |
String |
|
substring |
returns the substring of "string" starting at 0-based offset "startIndex" and of length "length" |
(string: String, startIndex: Int, length: Int) |
String |
This function is robust in that, if startIndex is beyond the end of "string", an empty string will be returned and if "length" would extend beyond the end of "string", as much as possible of "string" will be returned. On the other hand, a negative "startIndex" will result in an error. |
format |
convert an arithmetic value or expression to a string |
(arithmeticExpression: Double/Int, width: Int, precision: Int [,padding: String [, type: (N|NE)]]) |
String |
padding and type are optional and default to " " for padding and NE for type. Padding must be a string literal of length 1 and type can be either E (for exponential) or NE (for non-exponential). Padding will be used to extend a result, if its width is less than the specified width. Use a width of 0 if you do not want any padding. Precision specifies the number of significant digits in the generated string. When using the E format, generated strings will look like "2.3e5" rather than "23000". |
first |
first element of a List |
List |
element type of the list |
Please note that lists of lists or maps of lists etc. will not be supported! |
last |
last element of a List |
List |
element type of the list |
Please note that lists of lists or maps of lists etc. will not be supported! |
nth |
0-based n-th element of a List |
List, Int |
element type of the list |
Please note that lists of lists or maps of lists etc. will not be supported! |
size |
# of elements in a List or Map |
List/Map |
Double |
|
defined |
test for existence of an attribute |
an attribute name |
Boolean |
this will require a special implementation effort as this function is fundamentally different from all other built-in functions |
length |
character count of a String |
String |
Double |
Operators
The usual complement of arithmetic as well as boolean comparison operators will be provided. Operator precedence will probably follow the Excel conventions. Arithmetic operator precedence will follow that of common mathematic, i.e. unary plus and minus having the highest precedence, followed by exponentiation and then the other four arithmetic operators. Comparison operators will have intermediate precedence and boolean operators the lowest precedence.
Symbol |
Description |
Operand Type(s) |
Result Type |
Associativity |
Comments |
"-" |
unary minus |
Double,Int |
Double |
non-associative |
|
"+" |
unary plus |
Double,Int |
Double |
non-associative |
|
"+" |
addition |
Double/Int, Double/Int |
Double |
left |
may result in -Inf or +Inf |
"-" |
subtraction |
Double/Int, Double/Int |
Double |
left |
may result in -Inf or +Inf |
"*" |
multiplication |
Double/Int, Double/Int |
Double |
left |
may result in NaN, +Inf, or -Inf |
"/" |
division |
Double/Int, Double/Int |
Double |
left |
may result in NaN, +Inf, or -Inf |
"^" |
exponentiation |
Double/Int, Double/Int |
Double |
right |
may result in NaN, +Inf, or -Inf |
"==" |
equality |
Double/Int, Double/Int |
Boolean |
non-associative |
|
"==" |
equality |
String, String |
Boolean |
non-associative |
|
"!=" |
inequality |
Double/Int, Double/Int |
Boolean |
non-associative |
|
"!=" |
inequality |
String, String |
Boolean |
non-associative |
|
"not" |
logical inversion |
Boolean |
Boolean |
non-associative |
|
"and" |
logical conjunction |
Boolean, Boolean |
Boolean |
non-associative |
|
"or" |
logical disjunction |
Boolean, Boolean |
Boolean |
non-associative |
|
"<" |
less than |
Double/Int, Double/Int |
Boolean |
non-associative |
|
">" |
greater than |
Double/Int, Double/Int |
Boolean |
non-associative |
|
"<=" |
less than or equal |
Double/Int, Double/Int |
Boolean |
non-associative |
|
">=" |
greater than or equal |
Double/Int, Double/Int |
Boolean |
non-associative |
|
"." |
string concatenation |
String, String |
String |
left |
String Constants a.k.a. String Literals
String literals are sequences of characters enclosed in double-quotes. Double-quotes, newlines or backslashes may be embedded in a string literal by preceding them with a backslash. Arbitrary unicode characters may be embedded in a string with a \uXXXX escape sequence where each X stands for a hexadecimal digit. This allows for the portable embedding of characters from foreign scripts or rarely used English characters with diacritical marks or ligatures etc.
Error Reporting
Type errors or arithmetic errors are reported in the field that contains the equation, for example if the function attempts to add a string and an integer or if an equation references a non-existent or empty attribute without providing a default. (Note: This may need some more thought, for example we may consider to provide an empty value in case of an error.) An important type of error would be circular references which would always be illegal. For efficiency reasons this type of error should be checked for immediately after a user edits an equation.
Java API
The Java API is illustrated by these code fragments: EquationAPI.java
Equation Evaluation
Evaluation Time
Conceptually, at least two choices present themselves:
- Evaluate when the attribute value is being requested. This has the potential problem of being costly if the attribute value will be requested frequently or if the
- expressions that are being evaluated are computationally expensive.
- Evaluate whenever the calculated value changes. This has the problem that any references need to track an attribute so that can notify it when they change.
Choice 1. appears to lead to a far simpler implementation. Furthermore, should an attribute never be referenced, choice 2. would perform unnecessary updates.
Performance Issues
It is very likely that unless users actually use equations that there should be no measurable loss in performance since all that is required is a fast tests of whether an attribute contains an equation or not. This can implemented with a Java type check or simply testing an enum value. Should equation attributes be used we can speed them up as follows:
- Compile the expressions to a simple stack-based language. This could potentially lead to a multiple speed-up and should be very easy to implement.
If 1. is not enough, actual Java byte code could be generated. cglib would seem like a useful library for this purpose. Every time an equation changed, the corresponding byte code would be regenerated.
Equations must be executed in topological order of the inter-equation dependencies. Finally, we can eat and have our cake, too by combining lazy evaluation and caching! When we need an equation's value, we fetch the input values. Then we compare the current input values with the old input values. If they match we return a cached result, otherwise we calculate the expression and store both the result and the input values. This way expressions will only be evaluated the first time that they are needed and their input parameters have changed. The amortised cost of repeated expression evaluation consists of the fetching of the referenced value(s) and the comparison against the stored expression input parameters. This cost will depend on how efficiently we can look up attributes in Cytoscape.
Documentation
The section of the Cytoscape Manual describing the current actual implementation from a user's perspective is Cytoscape_User_Manual/Attribute_Formulas. There is also a section on how to write a simple plug-in to add one's own built-in functions plugin_developer_tutorial#How_to_add_new_attribute_functions_via_a_Cytoscape_plug-in.
Comments
- How would the UI, specifically the attribute browser, look while Cytoscape is computing the values? For example, it might be useful to have a generic progress icon instead of the attribute value while Cytoscape is in the process of computing it value.
Johannes: The computations would be very fast. There is no iteration or recursion. It would be very hard for users to create expressions that require a significant fraction of a second to compute. Should this ever happend we would certainly consider implementing some kind of visual feedback.
- Why not use Jython for a scripting language instead of implementing an entire new language? Perhaps using a well-established scripting language as a basis would be more flexible for adding new features. This would allow people to enter in a script that queries a database to do a mapping from one attribute value to another.
Johannes: This is not a proposal to implement a scripting language. This would only add simple Excel-style expressions. There'd be no control flow, function definitions, recursion, iteration etc. Also calling Jython for "$A + $B" is probably far slower than implementing our own expression evaluation. Furthermore if we stick to common spreadsheet syntax a much larger fraction of people would already be familiar with this.
This is very cool, and will be extremely useful! One of the things that this enables is the possibility of having additional attribute tables beyond the existing 3 tables. This was one of the goals for 3.0 (CyDataTables), but should be reasonably possible in 2.8. Have you thought about adding references to other tables similar to the way Excel allows references to other Worksheets? In the same vein, there are some very useful topological equations that would be really, really nice. For example, summing weights over all attached edges to add a weight to a node, and various counts of attached edges (e.g. degree), etc.
Johannes: No, I have not considered this but am very open to the idea of extending this in all kinds of ways.
- Is there a natural set of 'macros' for dealing with time course data and mapping to visual properties? Say we have n data points at representing n time slices, what's the easiest way to retrieve and visually encode the attribute value at that time point? Would it be to use map references?
Johannes: It is fairly trivial to add more built-in functions and I would be happy to expand on the current set. In fact new functions can also easily implemented in a Cytoscape plugin and registered at runtime.
- I appreciate the reasons for banning circular references, but can the system be designed to allow subclassing in a way that might permit some constrained forms of circular references. This could eventually enable some forms of quantitative modeling with equation solvers. Or should this just be outside the scope of Cytoscape?
Johannes: I am having a lot of fun with this and would love to explore this idea. One concern is performance. With circular references, how many iterations do we want to perform? What are the convergence criteria etc. Still, I'd be happy to think along those lines. The basic engine (compiler plus interpreter) certainly do not impose any restrictions that would make this difficult.
- Gary: sounds like we mainly want to focus on the updating visual styles based on formula use case. The quantitative modeling use case probably requires it's own framework, as it can be quite complex.
MikeSmoot: Agreed.
- Gary: updating the UI with the results of formulas may be the most time consuming part - how often will this be done? Will test cases be developed to evaluate this performance e.g. 10,000 formulas in an attribute column referencing multiple columns?
MikeSmoot: Yes, we've already tried this with 100,000 equations and haven't seen any impact (i.e. the VizMapper is just as fast with equations as it is with normal values). This is pretty counter-intuitive, so perhaps we've missed something, but so far we haven't found a case where equations cause performance problems. Because speed is a big concern we'll be particularly interested in seeing if people are able to come up with test cases where we do see an impact. The good news is that we haven't really done any optimization to the code, so if we do encounter performance problems then we have ideas on how to deal with them and if nothing turns up then we have quite simple code!
- Gary: Is there a create list function? This would take values from other attributes and create a list out of them. This would help the custom graphics feature rely only on single attributes, not multiple attributes.
Johannes: Not yet, but I see no problems in adding one.
How to Comment
Edit the page and add your comments under the provided header. By adding your ideas to the Wiki directly, we can more easily organize everyone's ideas, and keep clear records. Be sure to include today's date and your name for each comment. Try to keep your comments as concrete and constructive as possible. For example, if you find a part of the RFC makes no sense, please say so, but don't stop there. Take the extra step and propose alternatives.