
Hi Stephen, others, On Friday 23 October 2009 21:54:39 stephen.burke@stfc.ac.uk wrote:
A point just came up about the representation of strings. In the GLUE 2 specification it seems we have no definition of the "string" type, which looks like an oversight (other than an implication in the placeholder section that strings may be UTF-8).
Agreed. This does seem to be an oversight. My humble suggestion: GLUE 2 strings are Unicode (see ISO/IEC 10646), but GLUE does not specifying how a string is to be represented. A GLUE binding MUST describe which encodings are available for strings. If the underlying storage has one or more encodings that allow round-trip (decoding an encoded Unicode string) without any collisions (a collision is when two distinct Unicode strings that, after round-trip, are the same) then the binding MUST allow only these encodings. If the underlying storage allows multiple collision-less round-trip string encodings then the GLUE binding MAY allow alternative encoding. If the one or more of these encodings is a Unicode standard encoding then the GLUE binding SHOULD allow at least one of the available standard Unicode encodings. If UTF-8 is an available encoding then the binding SHOULD allow UTF-8 encoded strings. If none of the string encodings available from the underlying storage support a collision-less Unicode round-trip then the binding SHOULD use the encoding that minimises the number of string collisions. The GLUE binding MUST document which Unicode strings have a collision-less round-trip and SHOULD document the expected encoded value for the remaining Unicode strings. The text to be included in Glue 2.0 errata and included in the next revision.
In the GLUE 2 LDAP schema as currently implemented for EGEE, strings seem to be typed as IA5String as they were in glue 1, which is basically 7-bit ascii so special characters are not allowed. Should we be allowing UTF-8 strings?
I've had one user complain that they couldn't include a German sharp-s (double-s or ß) in a name attribute. However, this doesn't really matter for German names: all German "weird" letters have 7-bit ASCII encoded versions ("ß" --> "ss", "ë" --> "ae", etc). I believe the same isn't true for all other languages, so I would support adopting UTF-8 encoded strings for the GLUE LDAP binding. (IIRC, they are called "DirectoryString" in LDAP speak.) Cheers, Paul.