Strings

older
GLUE 2.0 HTML rendering for easy...

stephen.burke＠stfc.ac.uk

23 Oct 2009 23 Oct '09

7:54 p.m.

Hi, A point just came up about the representation of strings. In the GLUE 2 specification it seems we have no definition of the "string" type, which looks like an oversight (other than an implication in the placeholder section that strings may be UTF-8). In the GLUE 2 LDAP schema as currently implemented for EGEE, strings seem to be typed as IA5String as they were in glue 1, which is basically 7-bit ascii so special characters are not allowed. Should we be allowing UTF-8 strings? Stephen -- Scanned by iCritical.

Show replies by date

Paul Millar

26 Oct 26 Oct

11:34 a.m.

Hi Stephen, others, On Friday 23 October 2009 21:54:39 stephen.burke@stfc.ac.uk wrote:

...

A point just came up about the representation of strings. In the GLUE 2 specification it seems we have no definition of the "string" type, which looks like an oversight (other than an implication in the placeholder section that strings may be UTF-8).

Agreed. This does seem to be an oversight. My humble suggestion: GLUE 2 strings are Unicode (see ISO/IEC 10646), but GLUE does not specifying how a string is to be represented. A GLUE binding MUST describe which encodings are available for strings. If the underlying storage has one or more encodings that allow round-trip (decoding an encoded Unicode string) without any collisions (a collision is when two distinct Unicode strings that, after round-trip, are the same) then the binding MUST allow only these encodings. If the underlying storage allows multiple collision-less round-trip string encodings then the GLUE binding MAY allow alternative encoding. If the one or more of these encodings is a Unicode standard encoding then the GLUE binding SHOULD allow at least one of the available standard Unicode encodings. If UTF-8 is an available encoding then the binding SHOULD allow UTF-8 encoded strings. If none of the string encodings available from the underlying storage support a collision-less Unicode round-trip then the binding SHOULD use the encoding that minimises the number of string collisions. The GLUE binding MUST document which Unicode strings have a collision-less round-trip and SHOULD document the expected encoded value for the remaining Unicode strings. The text to be included in Glue 2.0 errata and included in the next revision.

...

In the GLUE 2 LDAP schema as currently implemented for EGEE, strings seem to be typed as IA5String as they were in glue 1, which is basically 7-bit ascii so special characters are not allowed. Should we be allowing UTF-8 strings?

I've had one user complain that they couldn't include a German sharp-s (double-s or ß) in a name attribute. However, this doesn't really matter for German names: all German "weird" letters have 7-bit ASCII encoded versions ("ß" --> "ss", "ë" --> "ae", etc). I believe the same isn't true for all other languages, so I would support adopting UTF-8 encoded strings for the GLUE LDAP binding. (IIRC, they are called "DirectoryString" in LDAP speak.) Cheers, Paul.

stephen.burke＠stfc.ac.uk

12:07 p.m.

Paul Millar [mailto:paul.millar@desy.de] said:

...

The text to be included in Glue 2.0 errata and included in the next revision.

These sound reasonable to me. However, for our current implementation technologies do we know if there is in fact a problem with using UTF-8 everywhere?

...

I've had one user complain that they couldn't include a German sharp-s (double-s or ß) in a name attribute. However, this doesn't really matter for German names: all German "weird" letters have 7-bit ASCII encoded versions ("ß" --> "ss", "ë" --> "ae", etc).

As some people may have seen, the particular problem that triggered this was a German-localised output from a unix "service xxx status", so even if there are alternative spellings that doesn't mean that you'll get them without some special translation. (Actually a google search finds http://www.manticmoo.com/articles/jeff/programming/perl/converting-from-utf8... which looks like a pretty good quick fix if it works.) Stephen PS I have to say I'm a bit surprised in retrospect that I didn't see this coming until we hit a real example, especially after the long discussion about non-standard characters in DNs a year or so back! -- Scanned by iCritical.

Paul Millar

1:47 p.m.

On Monday 26 October 2009 13:07:31 stephen.burke@stfc.ac.uk wrote:

...

Paul Millar [mailto:paul.millar@desy.de] said:

...
The text to be included in Glue 2.0 errata and included in the next revision.

These sound reasonable to me.

Ta.

...

However, for our current implementation technologies do we know if there is in fact a problem with using UTF-8 everywhere?

I know of no problems with switching to UTF-8. From [1], there are two printable characters that are incompatible: Code IA5String UTF-8 (and ASCII) 0x24 (currency) Dollar 0x7E (over-line) Tilde [1] http://www.zytrax.com/tech/ia5.html Since information is updated periodically from UTF-8 (or, perhaps, ASCII) LDIF data, any problem with this transition should be short-lived.

...

...
[Snip: encoding German names]

As some people may have seen, the particular problem that triggered this was a German-localised output from a unix "service xxx status", so even if there are alternative spellings that doesn't mean that you'll get them without some special translation. (Actually a google search finds http://www.manticmoo.com/articles/jeff/programming/perl/converting-from-ut f8-to-ascii.php which looks like a pretty good quick fix if it works.)

I don't know the details here but I'd imagine that, if we supported UTF-8 then publishing arbitrary UTF-8 information would just work. Irrespective of encoding issues, (and with the benefit of hindsight ;-) I'm not sure publishing the values returned from running commands on a machine (e.g., the result of "service xxx status") as computer-interpretable values is such a good idea. The output could be from some i18n software, which could be localised to their local language. Wouldn't this force GLUE clients to understand all possible languages? To my mind, it would be better to publish values taken from a (short) list of acceptable values and to choose the value from the return-code of executing commands (or something similar). If the published value is the name of something (e.g., GlueSEName) then there isn't the same problem since it doesn't have to be machine understandable.

...

PS I have to say I'm a bit surprised in retrospect that I didn't see this coming until we hit a real example, especially after the long discussion about non-standard characters in DNs a year or so back!

Indeed! Paul.

stephen.burke＠stfc.ac.uk

2:23 p.m.

Paul Millar [mailto:paul.millar@desy.de] said:

...

I don't know the details here but I'd imagine that, if we supported UTF-8 then publishing arbitrary UTF-8 information would just work.

Hopefully yes, but right now the LDAP schema uses IA5, and even if we change glue 2 we'll probably leave glue 1 alone (?). Unfortunately that particular trick doesn't seem to work, it translates globus-gridftp-server (PID 3522) wird ausgefÃ¼hrt... to globus-gridftp-server (PID 3522) wird ausgeführt... which is all well and good, but if you feed it globus-gridftp-server (PID 3522) wird ausgeführt... you get Wide character in print at encode.pl line 6, <> line 5. globus-gridftp-server (PID 3522) wird ausgefï¿½hrt... Sigh ... however it looks like something along the lines of "tr/\0-\x{10ffff}/\0-\x7f?/;" may at least provide a minimal solution.

...

The output could be from some i18n software, which could be localised to their local language. Wouldn't this force GLUE clients to understand all possible languages?

By clients do you mean computers or people? In many cases, including this one, string attributes are designed to be human-readable and I'm not sure we should in general be forcing everyone to use English ... clearly if things are supposed to be digested by a program, e.g. the various enumerated lists, then you can't easily localise them without losing interoperability. Stephen -- Scanned by iCritical.

Paul Millar

3:02 p.m.

On Monday 26 October 2009 15:23:06 stephen.burke@stfc.ac.uk wrote:

...

Paul Millar [mailto:paul.millar@desy.de] said:

...
I don't know the details here but I'd imagine that, if we supported UTF-8 then publishing arbitrary UTF-8 information would just work.

Hopefully yes, but right now the LDAP schema uses IA5, and even if we change glue 2 we'll probably leave glue 1 alone (?).

That sounds reasonable: we can sell Glue 2 on its i18n :)

...

Unfortunately that particular trick doesn't seem to work, it translates [...]

I'm guessing you're typing stuff into the command line here, right? Also, it would be useful if you could you pipe the output through "hexdump - C". Diagnosing these kind of problems when some programs are "helpfully" mapping strings back into UTF-8 (e.g., the email program, the terminal, Perl, etc). So, ü (u with dots) is Unicode 00FC, which (according to my terminal) is C3 BC in UTF-8. Misinterpreting this 2-byte sequence as Latin-1 would give Ã¼. I'm not sure where the upside-down question-mark ½-symbol comes from, though.

...

Wide character in print at encode.pl line 6, <> line 5. globus-gridftp-server (PID 3522) wird ausgefï¿½hrt...

This looks like a perl problem. Does the program know it's getting UTF-8 input?

...

...
The output could be from some i18n software, which could be localised to their local language. Wouldn't this force GLUE clients to understand all possible languages?

By clients do you mean computers or people? In many cases, including this one, string attributes are designed to be human-readable and I'm not sure we should in general be forcing everyone to use English

That's what I was wondering: if it's for local human consumption then supporting l18n text is desirable.

...

... clearly if things are supposed to be digested by a program, e.g. the various enumerated lists, then you can't easily localise them without losing interoperability.

Yup, agreed! Cheers, Paul.

stephen.burke＠stfc.ac.uk

3:12 p.m.

Paul Millar [mailto:paul.millar@desy.de] said:

...

I'm guessing you're typing stuff into the command line here, right?

Effectively yes - the thing that gave the error was just a cut-and-paste from the previous output, i.e. it seems that the decode() function is not idempotent (if that's the right word). Anyway I don't really want to spend a lot of time understanding the intricacies of unicode, or indeed perl, I just want some way to ensure that what I print is a valid IA5String so the schema validation doesn't throw it out.

...

This looks like a perl problem. Does the program know it's getting UTF-8 input?

It seems to be the opposite - with unicode input it seems to translate it sensibly, but fed its own ouput it gives an error. (Modulo whatever the terminal emulator is doing with the characters ...) Stephen -- Scanned by iCritical.

stephen.burke＠stfc.ac.uk

12:12 p.m.

New subject: Placeholder for IDs?

...

(other than an implication in the placeholder section that strings may be UTF-8).

While I'm at it, another thing that came up recently was that we don't seem to have a placeholder for unique IDs, e.g. a mandatory foreign key. Do you have any suggestions? (So far I went for "$UNDEFINED$" in the absence of inspiration ...) Stephen -- Scanned by iCritical.

David Horat

12:55 p.m.

New subject: Placeholder for IDs?

In my opinion, there should not exist any placeholder for Foreign Keys. If a FK is mandatory, you need to publish it. If not, just make them optional. If you put a placeholder in a FK, then the word mandatory has no real meaning. On Mon, Oct 26, 2009 at 1:12 PM, <stephen.burke@stfc.ac.uk> wrote:

...

...
(other than an implication in the placeholder section that strings may be UTF-8).

While I'm at it, another thing that came up recently was that we don't seem to have a placeholder for unique IDs, e.g. a mandatory foreign key. Do you have any suggestions? (So far I went for "$UNDEFINED$" in the absence of inspiration ...)

Stephen -- Scanned by iCritical. _______________________________________________ glue-wg mailing list glue-wg@ogf.org http://www.ogf.org/mailman/listinfo/glue-wg

-- David Horat Software Engineer – IT/GD – Grid Deployment Group CERN – European Organization for Nuclear Research » Where the web was born Address: 1211 Geneva - Switzerland, Office: 28/R-003 Phone +41 22 76 77996 Fax +41 22 76 68178 (fax to email service) Web: http://cern.ch/horat Web: http://davidhorat.com/ Profile: http://linkedin.com/in/davidhorat

stephen.burke＠stfc.ac.uk

1:13 p.m.

New subject: Placeholder for IDs?

...

In my opinion, there should not exist any placeholder for Foreign Keys. If a FK is mandatory, you need to publish it. If not, just make

David Horat [mailto:david.horat@cern.ch] said: them optional. If you put a placeholder in a FK, then the word mandatory has no real meaning. It's really the other way around, if an attribute is mandatory you have to publish *something*, even if for some reason the info provider is unable to determine the correct value. In the particular case I encountered (Policy -> UserDomain) I think it is a mistake for it to be mandatory, but there are other cases, e.g. ToComputingService -> ComputingService, where it clearly is correct for it to be mandatory but nevertheless the info provider may for some reason fail to determine the correct ID. You then want a value which is clearly erroneous so you can detect the failure. Also for example the way the gip works is to have a template file, with values which should be changed by the dynamic provider. If that provider fails the values in the template will become visible, and again you want to use something which is easily detectable. Stephen -- Scanned by iCritical.

Paul Millar

2 p.m.

New subject: Placeholder for IDs?

On Monday 26 October 2009 13:12:55 stephen.burke@stfc.ac.uk wrote:

...

...
(other than an implication in the placeholder section that strings may be UTF-8).

While I'm at it, another thing that came up recently was that we don't seem to have a placeholder for unique IDs, e.g. a mandatory foreign key. Do you have any suggestions? (So far I went for "$UNDEFINED$" in the absence of inspiration ...)

Well, if we're talking GLUE 2.0, then IDs are URIs, so section A.11 (page 64) should apply. If we're talking Glue 1.3, then $UNDEFINED$ is fine. Cheers, Paul.

stephen.burke＠stfc.ac.uk

2:51 p.m.

New subject: Placeholder for IDs?

Paul Millar [mailto:paul.millar@desy.de] said:

...

Well, if we're talking GLUE 2.0, then IDs are URIs, so section A.11 (page 64) should apply.

Hmm, I see a can of worms opening here :) Basic question, are our IDs supposed to have a scheme, and if so what? So far I've assumed that the URI spec doesn't require one: http://tools.ietf.org/html/rfc2396#appendix-A Anyway I don't entirely see that A.11 helps since the placeholders vary depending on the scheme. In this particular case we have an extra few worms since the UserDomain ID is (probably) going to be the VO name, and things like UNDEFINED or UNDEFINEDVALUE are in fact valid VO names, albeit they would be a bit eccentric. You might object that they aren't especially unique, but then nor is "atlas" ... in EGEE I suspect we will avoid all this by the simple expedient of ignoring UserDomains completely, but unfortunately the schema currently marks the UserDomain relation as mandatory so you're forced to have the reference even if there is not, and will never be, an object to refer to! Stephen PS You could have a similar problem with the AdminDomainID - I doubt that there will be a site called UNDEFINED but again there is probably nothing to say that it isn't allowed. -- Scanned by iCritical.

5789

Age (days ago)

5792

Last active (days ago)

List overview

Download

11 comments

3 participants

participants (3)

David Horat
Paul Millar
stephen.burke＠stfc.ac.uk

Strings

David Horat

tags

participants (3)