
Hi all, As an exercise, I've tried to jot down as precise and complete a description of each GLUE storage object as possible, also describing and how they relate to each other. I've also tried to do this without any forward-references (so, in theory, the document is readable in a single pass). In almost all cases, I've left out the attributes. I don't know how useful this is. It's just my point-of-view of things as stand now. I'm sure there are bits that are "wrong" (either I've misunderstood and/or this description breaks a use-case), but if so, helpfully people can point which bits are wrong and (perhaps) it will stimulate some discussion. BTW, I'm implicitly assuming that StorageEnvironment.RetentionPolicy can be multivalued. If this isn't true and we have the use-case of the same physical disks being part of, for example, both Custodial and Output storage, then it starts to get complicated. As always, comments appreciated. Cheers, Paul. --- UserDomain: A collection of one or more end-users; a VO is an instance of a UserDomain. All end-users that interact with the physical storage are a member of a UserDomain and, in general, derive their authorisation from that membership. StorageCapacity: A StorageCapacity object describes the ability to store data within a homogeneous storage technology. This storage technology provides a common access latency. All StorageCapacity objects are specified within a certain context. The context is determined by an association between the StorageCapacity object and precisely one other higher-level object. These associations are not listed here, but are described in later sections. In general, a StorageCapacity object will record some context-specific information. Examples of such information include the total storage capacity of the underlying technology and how much of that total has been used. The underlying storage technology may affect which of the context-specific attributes are available. For example, tape storage may be considered semi-infinite, so the total and free attributes have no meaning. If this is so, then it affects all StorageCapacity objects with the same underlying technology, independent of their context. Different contexts may also affect what context-specific attributes are recorded. This is a policy decision when implementing GLUE, as recording all possible information may be costly and provide no great benefit. [Aside: these two reasons are why many of the attributes within StorageCapacity are optional. Rather than explicitly subclassing the objects and making the values required, it is left deliberately vague which attributes are published.] A StorageCapacity may represent a logical aggregation of multiple underlying storage technology instances; for example, a StorageCapacity might represent many disk storage nodes, or many tapes stored within a tape silo. GLUE makes no effort to record information at this deeper level; but by not doing so, it requires that the underlying storage technology be homogeneous. Homogeneous means that the underlying storage technology is either identical or sufficiently similar that the differences don't matter. In most cases, the homogeneity is fairly obvious (e.g., tape storage vs disk-based storage), but there may be times where this distinction becomes contentious and judgement may be required; for example, the quality of disk-base storage might indicate that one subset is useful for a higher-quality service. If this is so, then it may make sense to represent the different class of disk by different SpaceCapacities. StorageEnvironment: A StorageEnvironment is a collection of one or more StorageCapacities with a set of associated (enforced) storage management policies. Examples of these policies are Type (Volatile, Durable, Permanent) and RetentionPolicy (Custodial, Output, Replica). StorageEnvironments act as a logical aggregation of StorageCapacities, so each StorageEnvironment must have at least one associated StorageCapacity. It is the associated StorageCapacities that allow a StorageEnvironment to store data with its advertised policies; for example, to act as (Permanent, Custodial) storage of data. Since a StorageEnvironment may contain multiple StorageCapacities, it may describe a heterogeneous environment. An example of this is "tape storage", which has both tape back-end and disk front-end into which users can pin files. Such a StorageEnvironment would have two associated StorageCapacities: one describing the disk storage and another describing the tape. If a StorageCapacity is associated with a StorageEnvironment, it is associated with only one. A StorageCapacity may not be shared between different StorageEnvironments. StorageCapacities associated with a StorageEnvironment must be non-overlapping with any other such StorageCapacity and the set of all such StorageCapacities must represent the complete storage available to end-users. Each physical storage device (e.g., individual disk drive or tape) that an end-user can utilise must be represented by (some part of) precisely one StorageCapacity associated with a StorageEnvironment. Nevertheless, the StorageCapacities associated with StorageEnvironments may be incomplete as a site may deploy physical storage devices that are not directly under end-user control; for example, disk storage used to cache incoming transfers. GLUE makes no effort to record information about such storage. StorageResource: A StorageResource is an aggregation of one or more StorageEnvironments and describes the hardware that a particular software instance has under its control. A StorageResource must have at least one StorageEnvironment, otherwise there wouldn't be much point publishing information about it. [This isn't a strict requirement, but I think it makes sense to include it.] All StorageEnvironments must be part of precisely one StorageResource. SoftwareEnvironments may not be shared between StorageResources. This means that all physics hardware must be published under precisely one StorageResource. StorageShare: A StorageShare is a logical partitioning of one or more StorageEnvironments. Perhaps the simplest example of a StorageShare is one associated with a single StorageEnvironment with a single associated StorageCapacity, and that represents all the available storage of that StorageCapacity. An example of a storage that could be represented by this trivial StorageShare is the classic-SE. StorageSpaces must have one or more associated StorageCapacities. These StorageCapacities provide a complete description of the different homogeneous underlying technologies that are available under the space. In general, the number of StorageCapacities associated with a StorageShare is the sum of the number of StorageCapacities associated with each of the StorageShare's associated StorageEnvironments. Following from this, there is an implicit association between the StorageCapacity associated with a StorageShare and the corresponding StorageCapacity associated with a StorageEnvironment. Intuitively, this association is from the fact that the two StorageCapacities share the same underlying physical storage. This implicit association is not recorded in GLUE. StorageSpaces may overlap. Specifically, given a StorageCapacity (SC_E) that is associated with some StorageEnvironment and which has totalSize TS_E, let the sum of the totalSize attributes for all StorageCapacities that are: 1. associated with a StorageSpace, and 2. that are implicitly associated with SC_E be TS_S. If the StorageSpaces are covering then TS_S = TS_E. If the StorageSpaces overlap, then TS_S > TS_E. [sorry, I couldn't easily describe this with just words without it sounding awful!] StorageSpaces may be incomplete. Following the same definitions as above, this is when TS_S < TS_E. Intuitively, this happens if the site-admin has not yet assigned all available storage. End-users within a UserDomain may wish to store or retrieve files. The StorageShares provides a complete, abstract description of the underlying storage at their disposal. No member of a UserDomain may interact with the physical hardware except through a StorageShare. The partitioning is persistent through file creation and deletion. The totalSize attributes (of a StorageSpace's associated StorageCapacties) do not change as a result of file creation or deletion. [Does GLUE need to stipulate this, or should we leave this vague?] A single StorageShare may allow multiple UserDomains to access storage; if so, the StorageShare is "shared" between the different UserDomains. Such a shared StorageShare is typical if a site provides storage described by the trivial StorageShare (one that covers a complete StorageEnvironment) whilst supporting multiple UserDomains. StorageMappingPolicy: The StorageMappingPolicy describes how a particular UserDomain is allowed to access a particular StorageShare. No member of a UserDomain may interact with a StorageShare except as described by a StorageMappingPolicy. The StorageMappingPolicies may contain information that is specific to that UserDomain, such as one or more associated StorageCapacities. If provided, these provide a UserDomain-specific view of their usage of the underlying physical storage technology as a result of their usage within the StorageShare. If StorageCapacities are associated with a StorageMappingPolicy, there will be the same number as are associated with the corresponding StorageShare. StorageEndpoint: A StorageEndpoint specifies that storage may be controlled through a particular interface. The SRM protocol is an example of such an interface and a StorageEndpoint would be advertised for each instance of SRM. The access policies describing which users of a UserDomain may use the StorageEndpoint are not published. On observing that a site publishes a StorageEndpoint, one may deduce only that it is valid for at least one user of one supported UserDomain. StorageAccessProtocol: A StorageAccessProtocol describes how data may be sent or received. The presence of a StorageAccessProtocol indicates that data may be fetched or stored using this interface. Access to the interface may be localised; that is, only available from certain computers. It may also be restricted to specified UserDomains. However, neither policy restrictions are published in GLUE. On observing a StorageAccessProcol, one may deduce only that it is valid for at least one user of one supported UserDomain. StorageService: A StorageService is an aggregation of StorageEndpoints, StorageAccessProtocols and StorageResources. It is the top-level description of the ability to transfer files to and from a site, and manipulate the files once stored.

Hi Paul, various comments inline.
[...]
BTW, I'm implicitly assuming that StorageEnvironment.RetentionPolicy can be multivalued. If this isn't true and we have the use-case of the same physical disks being part of, for example, both Custodial and Output storage, then it starts to get complicated.
I think it is OK if the RetentionPolicy _can_ be multivalued, but in WLCG it would be published single-valued, viz. along with an AccessLatency to describe the Storage Class that is implemented by the Environment.
[...]
StorageEnvironment:
A StorageEnvironment is a collection of one or more StorageCapacities with a set of associated (enforced) storage management policies. Examples of these policies are Type (Volatile, Durable, Permanent) and RetentionPolicy (Custodial, Output, Replica).
Note that we should get rid of the obsolete, confusing Type and Lifetime attributes in favor of the ExpirationMode copied from SRM v3.
[...]
StorageResource:
A StorageResource is an aggregation of one or more StorageEnvironments and describes the hardware that a particular software instance has under its control.
See my reply to Sergio: we may rather want to allow an Environment to be linked to multiple Resources, e.g. a disk and a tape Resource, such that we can publish the back-end implementation name and version for each of them.
A StorageResource must have at least one StorageEnvironment, otherwise there wouldn't be much point publishing information about it. [This isn't a strict requirement, but I think it makes sense to include it.]
OK.
All StorageEnvironments must be part of precisely one StorageResource. SoftwareEnvironments may not be shared between StorageResources. This means that all physics hardware must be published under precisely one StorageResource.
See my comment above.
StorageShare:
A StorageShare is a logical partitioning of one or more StorageEnvironments.
Perhaps the simplest example of a StorageShare is one associated with a single StorageEnvironment with a single associated StorageCapacity, and that represents all the available storage of that StorageCapacity. An example of a storage that could be represented by this trivial StorageShare is the classic-SE.
StorageSpaces must have one or more associated StorageCapacities. | ^^^^^^ | Shares
These StorageCapacities provide a complete description of the different homogeneous underlying technologies that are available under the space.
In general, the number of StorageCapacities associated with a StorageShare is the sum of the number of StorageCapacities associated with each of the StorageShare's associated StorageEnvironments.
Following from this, there is an implicit association between the StorageCapacity associated with a StorageShare and the corresponding StorageCapacity associated with a StorageEnvironment. Intuitively, this association is from the fact that the two StorageCapacities share the same underlying physical storage. This implicit association is not recorded in GLUE.
StorageSpaces may overlap. Specifically, given a StorageCapacity | ^^^^^^ | Shares
(SC_E) that is associated with some StorageEnvironment and which has totalSize TS_E, let the sum of the totalSize attributes for all | ^^^ | let TS_S be .....
StorageCapacities that are: 1. associated with a StorageSpace, and | ^^^^^ | Share
2. that are implicitly associated with SC_E be TS_S. If the StorageSpaces are covering then TS_S = TS_E. If | ^^^^^^^ ^^^^^^ | XXXXXXX Shares
the StorageSpaces overlap, then TS_S > TS_E. | ^^^^^^ | Shares
[sorry, I couldn't easily describe this with just words without it sounding awful!]
StorageSpaces may be incomplete. Following the same definitions | ^^^^^^ | Shares
as above, this is when TS_S < TS_E. Intuitively, this happens if the site-admin has not yet assigned all available storage.
End-users within a UserDomain may wish to store or retrieve files. The StorageShares provides a complete, abstract description of the underlying storage at their disposal. No member of a UserDomain may interact with the physical hardware except through a StorageShare.
The partitioning is persistent through file creation and deletion. The totalSize attributes (of a StorageSpace's associated StorageCapacties) | ^^^^^ | Share
do not change as a result of file creation or deletion. [Does GLUE need to stipulate this, or should we leave this vague?]
Why mention it at all? You do not make statements about the behavior of the other sizes, and I think there is no need to go there...
A single StorageShare may allow multiple UserDomains to access storage; if so, the StorageShare is "shared" between the different UserDomains. Such a shared StorageShare is typical if a site provides storage described by the trivial StorageShare (one that covers a complete StorageEnvironment) whilst supporting multiple UserDomains.
[...]
StorageAccessProtocol:
A StorageAccessProtocol describes how data may be sent or received. The presence of a StorageAccessProtocol indicates that data may be fetched or stored using this interface.
Access to the interface may be localised; that is, only available from certain computers. It may also be restricted to specified UserDomains. However, neither policy restrictions are published in GLUE. On observing a StorageAccessProcol, one may deduce only that it is valid for at least one user of one supported UserDomain.
..... from at least one computer. Thanks, Maarten

Hi Maarten, Thanks for the comments; my comments are interleaved below. I'm updated the document, but do people feel this is useful? We could: folded the information into the GLUE 2.0 spec, keep it as an informative (non-normative) document, drop it, as being too confusing? On Monday 31 March 2008 01:48:16 Maarten.Litmaath@cern.ch wrote: [...]
BTW, I'm implicitly assuming that StorageEnvironment.RetentionPolicy can be multivalued. If this isn't true and we have the use-case of the same physical disks being part of, for example, both Custodial and Output storage, then it starts to get complicated.
I think it is OK if the RetentionPolicy _can_ be multivalued, but in WLCG it would be published single-valued, viz. along with an AccessLatency to describe the Storage Class that is implemented by the Environment.
OK, I've added a paragraph on that.
[...] StorageEnvironment:
A StorageEnvironment is a collection of one or more StorageCapacities with a set of associated (enforced) storage management policies. Examples of these policies are Type (Volatile, Durable, Permanent) and RetentionPolicy (Custodial, Output, Replica).
Note that we should get rid of the obsolete, confusing Type and Lifetime attributes in favor of the ExpirationMode copied from SRM v3.
Should we do this with GLUE v2.0? (I would be happy with that).
[...] A StorageResource is an aggregation of one or more StorageEnvironments and describes the hardware that a particular software instance has under its control.
See my reply to Sergio: we may rather want to allow an Environment to be linked to multiple Resources, e.g. a disk and a tape Resource, such that we can publish the back-end implementation name and version for each of them.
Yes, I agree the bit about Environment being hosted in a single Resource is wrong; for example, dCache (a StorageResource) and TSM (a StorageResource) together host "D1T1" (a StorageEnvironment), which has a disk-based StorageCapacity and a tape-base StorageCapacity. I'm try to reword that bit. [...]
StorageShare: [...] StorageSpaces must have one or more associated StorageCapacities. | ^^^^^^ | Shares
Yes, I'm not sure what happened there: too many words beginning with "S", I guess. [...]
(SC_E) that is associated with some StorageEnvironment and which has totalSize TS_E, let the sum of the totalSize attributes for all
| ^^^ | let TS_S be .....
Err, I think that one should be TS_E. totalSize of the StorageShare associated with the Environment (the one "underneath" the StorageEnvironment). In this context the StorageShare represents all of the physical medium. Somehow, accurately describing what overlapping and incomplete StorageShares means takes a lot of words! [...]
do not change as a result of file creation or deletion. [Does GLUE need to stipulate this, or should we leave this vague?]
Why mention it at all? You do not make statements about the behavior of the other sizes, and I think there is no need to go there...
Well, we could just not mention it; however, I'm a little concerned about tacit assumptions, and how not everyone has the same set. I'd hope we can make all the assumptions explicit. In this particular case, there's (at least) a couple of models for how a space could work: a. partitioning: I'm allocating 10 TiB, I can store files up to that size of data (*). Once I've written 10 TiB of data, I can delete files to create more space. This is like having a 10 TiB hard disk to store data. (*) real-life systems have some complications, but considering an idealised storage system. b. consumable: I'm allocated 10 TiB storage. I can store files up to that size of data. I can delete the files if I like, but deleting files doesn't allow me to store more files. Once I've used up that 10 TiB of storage, I have to ask for more. (perhaps option b. seems a little crazy, but it might be how StorageShares would work with archival WORM media). It seems that everyone in HEP assumes a partitioning model but I don't think I've seen it stated anywhere. Other communities might assume a different model. If we want information to be sharable (or mergable) I think we should state clearly any assumptions we're making. If we don't assume anything, that should be noted too, so people consuming information know that either: 1. they can't assume what happens, or 2. if they assume something, that it's a WLCG convention that they must revisit when combining data with other information sources. [...]
On observing a StorageAccessProcol, one may deduce only that it is valid for at least one user of one supported UserDomain.
..... from at least one computer.
True :) I've added that, too. Cheers, Paul.

Hi Paul,
I'm updated the document, but do people feel this is useful?
We could: folded the information into the GLUE 2.0 spec,
That would be ideal. Note that it will need to be adapted if we cannot reach agreement on some statement it currently makes.
keep it as an informative (non-normative) document, drop it, as being too confusing?
[...]
A StorageEnvironment is a collection of one or more StorageCapacities with a set of associated (enforced) storage management policies. Examples of these policies are Type (Volatile, Durable, Permanent) and RetentionPolicy (Custodial, Output, Replica).
Note that we should get rid of the obsolete, confusing Type and Lifetime attributes in favor of the ExpirationMode copied from SRM v3.
Should we do this with GLUE v2.0?
Yes! The ExpirationMode was invented to clear up the confusion that has arisen from the (mis)use of Type/Lifetime for other purposes.
(I would be happy with that).
[...]
(SC_E) that is associated with some StorageEnvironment and which has totalSize TS_E, let the sum of the totalSize attributes for all
| ^^^ | let TS_S be .....
Err, I think that one should be TS_E. totalSize of the StorageShare associated with the Environment (the one "underneath" the StorageEnvironment). In this context the StorageShare represents all of the physical medium.
Well, you wrote this before: | let the sum of the totalSize attributes for all | StorageCapacities that are: | 1. associated with a StorageSpace, and | 2. that are implicitly associated with SC_E | be TS_S. I suggested this instead: | let TS_S be the sum of the totalSize attributes for all | StorageCapacities that are: | 1. associated with a StorageShare, and | 2. that are implicitly associated with SC_E | . Spot the 3 differences! It reads more easily...
Somehow, accurately describing what overlapping and incomplete StorageShares means takes a lot of words!
[...]
do not change as a result of file creation or deletion. [Does GLUE need to stipulate this, or should we leave this vague?]
Why mention it at all? You do not make statements about the behavior of the other sizes, and I think there is no need to go there...
Well, we could just not mention it; however, I'm a little concerned about tacit assumptions, and how not everyone has the same set. I'd hope we can make all the assumptions explicit.
In this particular case, there's (at least) a couple of models for how a space could work:
a. partitioning: I'm allocating 10 TiB, I can store files up to that size of data (*). Once I've written 10 TiB of data, I can delete files to create more space. This is like having a 10 TiB hard disk to store data.
(*) real-life systems have some complications, but considering an idealised storage system.
b. consumable: I'm allocated 10 TiB storage. I can store files up to that size of data. I can delete the files if I like, but deleting files doesn't allow me to store more files. Once I've used up that 10 TiB of storage, I have to ask for more.
(perhaps option b. seems a little crazy, but it might be how StorageShares would work with archival WORM media).
That example proves my point: one cannot predict/prescribe how a particular size attribute will be affected when a file is stored or deleted!
It seems that everyone in HEP assumes a partitioning model but I don't think I've seen it stated anywhere. Other communities might assume a different model. If we want information to be sharable (or mergable) I think we should state clearly any assumptions we're making. If we don't assume anything, that should be noted too, so people consuming information know that either: 1. they can't assume what happens, or 2. if they assume something, that it's a WLCG convention that they must revisit when combining data with other information sources.
We must explicitly mention that in general any of the various sizes may be an approximation whose exact behavior depends on the underlying technology, implementation, load, concurrent usage, ... A grid may impose strict(er) requirements on the behavior. GLUE just provides a "coat rack" for information. Thanks, Maarten

glue-wg-bounces@ogf.org
[mailto:glue-wg-bounces@ogf.org] On Behalf Of Maarten.Litmaath@cern.ch said: That would be ideal. Note that it will need to be adapted if we cannot reach agreement on some statement it currently makes.
Just as a quick comment, I've been on holiday and I'm currently in the middle of a three-day python course, but I'll try to comment on this discussion by the end of the week, and I should be able to join the phone meetings. Stephen

glue-wg-bounces@ogf.org
[mailto:glue-wg-bounces@ogf.org] On Behalf Of Paul Millar said:
StorageSpaces must have one or more associated StorageCapacities. | ^^^^^^ | Shares
Yes, I'm not sure what happened there: too many words beginning with "S", I guess.
Well, possibly more than that ... I think there may be a misunderstanding here. In the current (1.3) schema there is a general intent that SAs map to spaces (space tokens) - that was somewhat controversial but seems to be broadly right, e.g. look at what RAL is currently publishing for its SRM 2s (one SA per space token) which to me seems entirely natural. In any case, VOInfos do *not* map to spaces, they map to space token *descriptions* (Tags) - potentially you could have one space shared by multiple VOs, each of which would have their own Tag (and/or Path) for the same space. For Glue 2 the VOInfo seems to have turned into the Share, at least judging by the attributes. However, that means that a Share is *not* a space, and your slip above seems to suggest that that's the way you're thinking. If several VOs shared a space there would be several Shares - basically one Share per MappingPolicy, not one per space, although a MappingPolicy might consist of multiple rules. If anything represents a space in the SRM sense it would be the Environment - but as I said in the previous mail that's become a bit unclear since Environment now has no real attributes, just a type which represents a storage class, so we're left with the ambiguity (which we also ended up with in 1.3) of whether one Environment represents all spaces of a given class, or whether you have one per space (or maybe something else?).
Somehow, accurately describing what overlapping and incomplete StorageShares means takes a lot of words!
Maybe because we are trying to represent distinct concepts with the same words, and maybe the same Glue objects? There are at least two different underlying concepts: physical disk and tape systems, and logical spaces (reservations) which may span many instances of the physical storage, may be contained within them, and may be shared between VOs or not. (And which might perhaps be nested ...) This was always a problem in glue and we never dealt with it. For example, take a classic SE with a file system organised like: /data/atlas /data/lhcb ... each of which would be published as one SA. However, in terms of the physical disk servers you might have one big server mounted at /data, n smaller ones mounted at /data/atlas, /data/lhcb, ... or m even smaller ones mounted at /data/atlas/mc, /data/atlas/cosmic/, ... or any combination of those, and we had no way to describe those variations in the schema. With SRM we no longer worry about that level of detail, but the general distinction between physical and logical space is still there. Stephen

Hi Stephen, Thanks for the comments. I've added some,too :-)
For Glue 2 the VOInfo seems to have turned into the Share, at least judging by the attributes. However, that means that a Share is *not* a space, and your slip above seems to suggest that that's the way you're thinking. If several VOs shared a space there would be several Shares - basically one Share per MappingPolicy,
not one per space, although a MappingPolicy might consist of multiple rules.
Generally, I think that comparisons between Glue 1.x and 2.0 are helpful only to some extend. Concepts of 1.3 do not fit into 2.0 and therefore the base for comparisons is not really given. Its more important to agree on what we want to express. In case of the VOInfo the information can be expressed by the StorageMappingPolicy which describes how a VO may utilize a Share representing storage space. Both keep StorageCapacity info to give opportunity to distinguish more than one viewpoint of accounting. For example, all sizes of mapping policies pointing to one share might not sum up to the total size of the share. On the other hand: All associated mapping policies generally should see the SAME free space as specified in the Share. The Share also keeps capacity information in order not to run over all mappingpolicies and count the (e.g.) usedSizes. There is also the case of having no free space in the associated MappingPolicies but some free published in the Share - not sure if this is makes sence. However, I think that GLUE should not judge these specialities. They should be treated by the instances which use the info.
If anything represents a space in the SRM sense it would be the Environment - but as I said in the previous mail that's become a bit unclear since Environment now has no real attributes, just a type which represents a storage class, so we're left with the ambiguity (which we also ended up with in 1.3) of whether one Environment represents all spaces of a given class, or whether you have one per space (or maybe something else?).
The enviroment linked to the capacity was introduced to give an summarized stats about the total spaces of all types of enviroments. This does not neccessarily mean that all used spaces of the linked shares should sum up to the (total) used space of the enviroment. On the opposite it would be odd if you would publish a total size for the environment which is lower than sum of the associated shares. This is a known problem. If you mean by SRM space something which is accessible via a SpaceTokenDescription then the first way is to put it into the share. All associated MappingPolicies would then see the same 'default' STD and path (except, you'd specify a own 'non-default' path/STD in a MappingPolicy).
Maybe because we are trying to represent distinct concepts with the same words, and maybe the same Glue objects? There are at least two different underlying concepts: physical disk and tape systems, and logical spaces (reservations) which may span many instances of the physical storage, may be contained within them, and may be shared between VOs or not. (And which might perhaps be nested ...) This was always a problem in glue and we never dealt with it. For example, take a classic SE with a file system organised like:
/data/atlas /data/lhcb ...
each of which would be published as one SA. However, in terms of the physical disk servers you might have one big server mounted at /data, n smaller ones mounted at /data/atlas, /data/lhcb, ... or m even smaller ones mounted at /data/atlas/mc, /data/atlas/cosmic/, ... or any combination of those, and we had no way to describe those variations in the schema. With SRM we no longer worry about that level of detail,
but the general distinction between physical and logical space is still there.
I understand your use case but is this an information system use case for the future? I would say that this is rather a deployment issue than something which needs to be published so that the middleware/user has to take care of. In the case obove you would store on the server side all incoming files under /data/atlas on server1, /data/cms/ under server2, etc.. Can this be done like this or am I completly wrong? Cheers, Felix _______________________________________________ glue-wg mailing list glue-wg@ogf.org http://www.ogf.org/mailman/listinfo/glue-wg

Felix Nikolaus Ehm [mailto:Felix.Ehm@cern.ch] said:
Generally, I think that comparisons between Glue 1.x and 2.0 are helpful only to some extend. Concepts of 1.3 do not fit into 2.0 and therefore the base for comparisons is not really given.
That may be true, but it should at least be raising a flag to look more carefully. In this case the Share as defined in the current draft *can't* be something that could be shared between VOs, because it has Tag as an attribute and the Tag names (space token descriptions) are VO-specific, so if people are conceiving it as something that can be shared (i.e. an SRM space or equivalent) then something is wrong. (For classic SE use the same applies to the Path, typically different VOs would have different Paths even if the space is shared). That also suggests to me that ExpirationMode probably should not be in the Share, although I'm not entirely sure how SRM works - is the ExpirationMode a property of a space, or can it be VO-specific like the Tag?
For example, all sizes of mapping policies pointing to one share might not sum up to the total size of the share.
Can you give a concrete example? To me this doesn't make sense: an SRM space is a single entity, and anyone authorised to use it shares that space. The only way you could subdivide it would be to reserve sub-spaces within the space, and then you'd need a more complex description. If you're saying that the *used* space could be subdivided according to who actually owns the files then I think that's the wrong way to go, then you're getting into detailed accounting which is not the purpose of Glue (among other things the owners don't necessarily map to the rules in the MappingPolicy).
The enviroment linked to the capacity was introduced to give an summarized stats about the total spaces of all types of enviroments.
Is there actually a good reason to want to know that? To me it's still much more useful to have the information either per space, or summarised for the entire SE. In WLCG it makes less difference anyway as we only use three classes - the only real difference is having disk space usage separated between Disk1Tape1 and Disk1Tape0 as opposed to the combined value you get for the whole SE. What is the use case to need that? At the very least I think you'd want it broken down by VO - that was the compromise we officially ended up with for 1.3 although personally I still think it's a mistake.
If you mean by SRM space something which is accessible via a SpaceTokenDescription then the first way is to put it into the share.
We should be clear about this: spaces are accessible by space token descriptions, but that is *not* unique: one space (token) can have many space token descriptions (typically one per VO).
All associated MappingPolicies would then see the same 'default' STD and path (except, you'd specify a own 'non-default' path/STD in a MappingPolicy).
??? I don't think I understand that at all. A Mapping policy is basically an ACL, i.e. something like VO:atlas or VOMS:/cms/Role=Production. I wouldn't be inclined to say that MappingPolicies "see" anything, it's the other way round: a share sees (or has) a mapping policy, i.e. a rule to say who can use it. (Would you be inclined to say that some particular unix file permission "sees" all the directories which have that permission?)
I understand your use case but is this an information system use case for the future? I would say that this is rather a deployment issue than something which needs to be published so that the middleware/user has to take care of.
Actually it's a use case for the past, we just never satisfied it! For classic SEs it is in principle something the middleware needs to know about. Say you have a VO Path of /data/atlas, and you publish a free space value for that. If you actually have two disk servers mounted at /data/atlas/mc and /data/atlas/data then that information isn't enough, e.g. /data/atlas/mc may be full even if the total free space is still large. Whether we need something similar in the SRM world, i.e. whether you need to know the underlying hardware situation as well as the logical "space token" view I'm not sure - but if we don't then why are we considering publishing at least some of it, and why are we having this discussion about completeness/covering at all?! Stephen

Paul Millar wrote:
Hi all,
Hi Paul, *, [..]
I don't know how useful this is. It's just my point-of-view of things as stand now. I'm sure there are bits that are "wrong" (either I've misunderstood and/or this description breaks a use-case), but if so, helpfully people can point which bits are wrong and (perhaps) it will stimulate some discussion.
IMO it is useful. We know from past experiences that different communities interpret different things, er, differently. Sometimes that is useful - makes schema reusable and adaptable, but as a service provider publishing the same attributes with different semantics is a nightmare. I think it's an excellent attempt and I found it useful (just took me some time before I had time to parse it :-) [..]
UserDomain:
A collection of one or more end-users; a VO is an instance of a UserDomain. All end-users that interact with the physical storage are a member of a UserDomain and, in general, derive their authorisation from that membership.
If we do it like this then we should use the hierarchical feature in UserDomain: UserDomain : WLGC UserDomain : NGS UserDomain : Diamond | | | UserDomain : LHCb UserDomain : biomed UserDomain : Pr234 | | | UserDomain : prod UserDomain : NHS UserDomain : Beamline In general, I think the high level entity is more useful than a low level entity such as the VO. Within WLCG for example, most SE info is available to all VOs in the sense that they (SEs) publish information that the VOs know how to make sense of. Outside WLCG (yes, there are people not in WLCG), the same schema could be used but e.g. with different attributes published (e.g. some req'd by WLCG could be left blank for NGS). The UserDomain could be consulted by the information consumer to check whether they should even probe further.
StorageCapacity:
A StorageCapacity object describes the ability to store data within a homogeneous storage technology. This storage technology provides a common access latency.
We used to call this a StorageComponent. http://storage.esc.rl.ac.uk/GLUE-SE-1.3-input-1.03.pdf It was debated when we discussed 1.3 whether it was even useful to expose this level of detail to users via the information system, but I would suggest in the interim we have some users who do care.
All StorageCapacity objects are specified within a certain context. The context is determined by an association between the StorageCapacity object and precisely one other higher-level object. These associations are not listed here, but are described in later sections.
What if I have an
In general, a StorageCapacity object will record some context-specific information. Examples of such information include the total storage capacity of the underlying technology and how much of that total has been used.
The underlying storage technology may affect which of the context-specific attributes are available. For example, tape storage may be considered semi-infinite, so the total and free attributes have no meaning. If this is so, then it affects all StorageCapacity objects with the same underlying technology, independent of their context.
That is too prescriptive (OK, it was just an example). For another example, some customers who pay directly for media used, they will want to know how much space is available on the tapes. Better just to leave the space thingies as optional attributes. In any case, we know the concepts of "free" and "used" (etc) are so difficult to pin down that each User Domain may well have its own interpretations. Which is why the UserDomain attr could be handy.
Different contexts may also affect what context-specific attributes are recorded. This is a policy decision when implementing GLUE, as recording all possible information may be costly and provide no great benefit.
Mm hm. Which I think is why the User Domain is a good thing if we can use it for that.
[Aside: these two reasons are why many of the attributes within StorageCapacity are optional. Rather than explicitly subclassing the objects and making the values required, it is left deliberately vague which attributes are published.]
Yep, domains will define what they want and what it means.
A StorageCapacity may represent a logical aggregation of multiple underlying storage technology instances; for example, a StorageCapacity might represent many disk storage nodes, or many tapes stored within a tape silo. GLUE makes no effort to record information at this deeper level; but by not doing so, it requires that the underlying storage technology be homogeneous. Homogeneous means that the underlying storage technology is either identical or sufficiently similar that the differences don't matter.
All that is really required is homogeneity in attributes. For example, a community that does not care about AccessLatency being published (if one such exists) would see disk and tape as homogeneous.
In most cases, the homogeneity is fairly obvious (e.g., tape storage vs disk-based storage), but there may be times where this distinction becomes contentious and judgement may be required; for example, the quality of disk-base storage might indicate that one subset is useful for a higher-quality service. If this is so, then it may make sense to represent the different class of disk by different SpaceCapacities.
Yes, but the GLUE schema should not prescribe how to do this - provide attributes for communities to publish the most common capabilities but leave it to the communities to define what they mean. GLUE could include examples but they must not be normative.
StorageEnvironment:
A StorageEnvironment is a collection of one or more StorageCapacities with a set of associated (enforced) storage management policies. Examples of these policies are Type (Volatile, Durable, Permanent) and RetentionPolicy (Custodial, Output, Replica).
I completely agree with Maarten here, we need to get away from the old overloaded names (even as examples! unless we put in in big fat letters that their use is deprecated). ExpirationMode : releaseWhenExpired, warnWhenExpired, neverExpire. RetentionPolicy : Replica, Output, Custodial. Note that these come from the SRM world (it is a bug in SRM2.2 that the old volatile etc names were retained). These names may again be meaningless to other communities, who may still wish to publish values for these. It may be better to provide attributes for ExpirationMode and RetentionPolicy (etc) and explain what they're for, but not to prescribe the values.
StorageEnvironments act as a logical aggregation of StorageCapacities, so each StorageEnvironment must have at least one associated StorageCapacity. It is the associated StorageCapacities that allow a StorageEnvironment to store data with its advertised policies; for example, to act as (Permanent, Custodial) storage of data.
OK
Since a StorageEnvironment may contain multiple StorageCapacities, it may describe a heterogeneous environment. An example of this is "tape storage", which has both tape back-end and disk front-end into which users can pin files. Such a StorageEnvironment would have two associated StorageCapacities: one describing the disk storage and another describing the tape.
We've had this case before. In that case, the StorageEnvironment should publish the minimal capabilities which it can support (or leave them blank, leaving it to the client to go and query the StorageCapacities). For example, if a StorageEnvironment contains both tape and disk, its AccessLatency should be Nearline - as it is the lowest common denominator. This only makes sense if you have a partial order on capabilities.
If a StorageCapacity is associated with a StorageEnvironment, it is associated with only one. A StorageCapacity may not be shared between different StorageEnvironments.
OK
StorageCapacities associated with a StorageEnvironment must be non-overlapping with any other such StorageCapacity and the set of all such StorageCapacities must represent the complete storage available to end-users. Each physical storage device (e.g., individual disk drive or tape) that an end-user can utilise must be represented by (some part of) precisely one StorageCapacity associated with a StorageEnvironment.
OK, except we shouldn't say "it must represent the complete storage available to end users" - it's up to the information publisher. We may have "secret" storage available, via endpoints communicated by other means.
Nevertheless, the StorageCapacities associated with StorageEnvironments may be incomplete as a site may deploy physical storage devices that are not directly under end-user control; for example, disk storage used to cache incoming transfers. GLUE makes no effort to record information about such storage.
Of course.
StorageResource:
A StorageResource is an aggregation of one or more StorageEnvironments and describes the hardware that a particular software instance has under its control.
Ummm. I would avoid using the word "hardware" here, or at least I find it confusing. For example, at RAL, CASTOR does not have exclusive use of the tapestore.
A StorageResource must have at least one StorageEnvironment, otherwise there wouldn't be much point publishing information about it. [This isn't a strict requirement, but I think it makes sense to include it.]
I would be less concerned about publishing an empty StorageResource - if a top level BDII publishes the resource and gathers the Environments from lower level BDIIs, it may at some point find itself publishing an empty StorageResource, e.g. during maintenance.
All StorageEnvironments must be part of precisely one StorageResource. SoftwareEnvironments may not be shared between StorageResources. This means that all physics hardware must be published under precisely one StorageResource.
OK.
StorageShare:
A StorageShare is a logical partitioning of one or more StorageEnvironments.
OK.
Perhaps the simplest example of a StorageShare is one associated with a single StorageEnvironment with a single associated StorageCapacity, and that represents all the available storage of that StorageCapacity. An example of a storage that could be represented by this trivial StorageShare is the classic-SE.
StorageSpaces must have one or more associated StorageCapacities. These StorageCapacities provide a complete description of the different homogeneous underlying technologies that are available under the space.
OK.
In general, the number of StorageCapacities associated with a StorageShare is the sum of the number of StorageCapacities associated with each of the StorageShare's associated StorageEnvironments.
You sort of contradict yourself in this para in some of the paras below, so I have tried to summarise the conclusion: Observation [**] 1. StorageEnvironments fully partition the StorageCapacities [that is: each StorageCapacity belongs to one and only one StorageEnvironment] 2. Each StorageShare contains one or more StorageCapacities 3. A StorageShare is associated with a StorageEnvironment if and only if they contain a common StorageCapacity.
Following from this, there is an implicit association between the StorageCapacity associated with a StorageShare and the corresponding StorageCapacity associated with a StorageEnvironment. Intuitively, this association is from the fact that the two StorageCapacities share the same underlying physical storage. This implicit association is not recorded in GLUE.
Well except the StorageShare has a unique id. Also I understood there is a 1..* - 1..* association between the Share and the Environment. s/StorageSpace/StorageShare/, cf Maarten's email. (6 occurrences)
StorageShares may overlap. Specifically, given a StorageCapacity (SC_E) that is associated with some StorageEnvironment and which has totalSize TS_E, let the sum of the totalSize attributes for all StorageCapacities that are: 1. associated with a StorageShare, and 2. that are implicitly associated with SC_E be TS_S. If the StorageShares are covering then TS_S = TS_E. If the StorageShares overlap, then TS_S > TS_E. [sorry, I couldn't easily describe this with just words without it sounding awful!]
See [**] above Note 2 is still a cover - even if the StorageShares overlap they still _cover_ the set of all StorageCapacities. If $X$ is a space (e.g. topological space) then $U:=\{A_i\subseteq X\vert i\in I\}$ is said to be a cover of $X$ if $\union_{i\in I}A_i=X$.
StorageShares may be incomplete. Following the same definitions as above, this is when TS_S < TS_E. Intuitively, this happens if the site-admin has not yet assigned all available storage.
See [**] above.
End-users within a UserDomain may wish to store or retrieve files. The StorageShares provides a complete, abstract description of the underlying storage at their disposal. No member of a UserDomain may interact with the physical hardware except through a StorageShare.
In this case it makes sense to use hierarchical UserDomains.
The partitioning is persistent through file creation and deletion.
Which partitioning? The StorageShares do not partition anything in general.
The totalSize attributes (of a StorageShare's associated StorageCapacties) do not change as a result of file creation or deletion. [Does GLUE need to stipulate this, or should we leave this vague?]
This is actually not necessarily true: if I start adding files to the share, it may expand because the storage system chooses to add more Capacities to it.
A single StorageShare may allow multiple UserDomains to access storage; if so, the StorageShare is "shared" between the different UserDomains. Such a shared StorageShare is typical if a site provides storage described by the trivial StorageShare (one that covers a complete StorageEnvironment) whilst supporting multiple UserDomains.
This is getting complicated if the UserDomains themselves are hierarchical!
StorageMappingPolicy:
The StorageMappingPolicy describes how a particular UserDomain is allowed to access a particular StorageShare. No member of a UserDomain may interact with a StorageShare except as described by a StorageMappingPolicy.
This is also too prescriptive. Surely it is not up to GLUE to mandate rules for how storage systems are used. For example, I may publish a general read-only access rule for the VO, but a subset of the VO may have write access. I should not have to publish that explicitly.
The StorageMappingPolicies may contain information that is specific to that UserDomain, such as one or more associated StorageCapacities. If provided, these provide a UserDomain-specific view of their usage of the underlying physical storage technology as a result of their usage within the StorageShare.
I agree, this is consistent with how I think UserDomains should be used.
If StorageCapacities are associated with a StorageMappingPolicy, there will be the same number as are associated with the corresponding StorageShare.
This probably needs more careful checking. For example, a StorageShare can be contained in more than one StorageShare and those StorageShares can themselves be associated with different UserDomains.
StorageEndpoint:
A StorageEndpoint specifies that storage may be controlled through a particular interface. The SRM protocol is an example of such an interface and a StorageEndpoint would be advertised for each instance of SRM.
Yep.
The access policies describing which users of a UserDomain may use the StorageEndpoint are not published. On observing that a site publishes a StorageEndpoint, one may deduce only that it is valid for at least one user of one supported UserDomain.
That should be OK - as long as the endpoints themselves are interpreted the same way by all users - which seems reasonable.
StorageAccessProtocol:
A StorageAccessProtocol describes how data may be sent or received. The presence of a StorageAccessProtocol indicates that data may be fetched or stored using this interface.
Yep.
Access to the interface may be localised; that is, only available from certain computers. It may also be restricted to specified UserDomains. However, neither policy restrictions are published in GLUE. On observing a StorageAccessProcol, one may deduce only that it is valid for at least one user of one supported UserDomain.
Where did the network description go? We used to have one. The idea is that certain protocols can be used only locally, or on certain networks. For example, a single StorageElement can have a range of GridFTP data movers on the WAN, a LAN protocol internally, and an OPN link which accepts UDP-based high speed data transfer like the astronomers use. If you are a local job you can ask it "do you support gridftp" and it would say yes, but you cannot necessarily access the GridFTP data movers from the worker nodes - and it would be less efficient than the LAN protocol. I think we need to put it back, and StorageAccessProtocol seems to me the more obvious location.
StorageService: A StorageService is an aggregation of StorageEndpoints, StorageAccessProtocols and StorageResources. It is the top-level description of the ability to transfer files to and from a site, and manipulate the files once stored.
OK. Cheers --jens

Hi Jens,
Where did the network description go? We used to have one.
The idea is that certain protocols can be used only locally, or on certain networks.
For example, a single StorageElement can have a range of GridFTP data movers on the WAN, a LAN protocol internally, and an OPN link which accepts UDP-based high speed data transfer like the astronomers use.
If you are a local job you can ask it "do you support gridftp" and it would say yes, but you cannot necessarily access the GridFTP data movers from the worker nodes - and it would be less efficient than the LAN protocol.
I think we need to put it back, and StorageAccessProtocol seems to me the more obvious location.
We discussed it a few meetings ago and felt that it overly complicated the schema for the amount of gain in the short/medium term. For example, insecure RFIO and DCAP are published without restrictions, and in practice this is not a real problem today. Since we want to converge on 2.0 ASAP, we felt such enhancements were better considered for 2.1. Would that be OK for you? Thanks, Maarten

Hi Jens, Since this is a quite specific use case, you might want to consider putting this characteristic into 'OtherInfo' of the StorageAccessProtocol. (I still need to add this field.) Would this be sufficient? Cheers, Felix --- Felix Ehm IT-GD tel : +41 22 7674580 CERN, Switzerland ----------------------------------------- -----Original Message----- From: glue-wg-bounces@ogf.org [mailto:glue-wg-bounces@ogf.org] On Behalf Of Maarten Litmaath Sent: Donnerstag, 3. April 2008 14:10 To: Jens Gunner Jensen Cc: glue-wg@ogf.org; Flavia Donno Subject: Re: [glue-wg] Some thoughts on storage objects Hi Jens,
Where did the network description go? We used to have one.
The idea is that certain protocols can be used only locally, or on certain networks.
For example, a single StorageElement can have a range of GridFTP data movers on the WAN, a LAN protocol internally, and an OPN link which accepts UDP-based high speed data transfer like the astronomers use.
If you are a local job you can ask it "do you support gridftp" and it would say yes, but you cannot necessarily access the GridFTP data movers from the worker nodes - and it would be less efficient than the
LAN protocol.
I think we need to put it back, and StorageAccessProtocol seems to me the more obvious location.
We discussed it a few meetings ago and felt that it overly complicated the schema for the amount of gain in the short/medium term. For example, insecure RFIO and DCAP are published without restrictions, and in practice this is not a real problem today. Since we want to converge on 2.0 ASAP, we felt such enhancements were better considered for 2.1. Would that be OK for you? Thanks, Maarten _______________________________________________ glue-wg mailing list glue-wg@ogf.org http://www.ogf.org/mailman/listinfo/glue-wg

glue-wg-bounces@ogf.org
[mailto:glue-wg-bounces@ogf.org] On Behalf Of Felix Nikolaus Ehm said: Since this is a quite specific use case, you might want to consider putting this characteristic into 'OtherInfo' of the StorageAccessProtocol. (I still need to add this field.) Would this be sufficient?
I don't think that would be the right solution: although it might be specific to a small number of *sites*, it would potentially affect all *users* of those sites. As we know very well publishing information is no use if nothing looks at it! If we really want this to be supported at all I think it would need to be in the schema properly so it could be built in to client tools etc. Stephen

Maarten Litmaath wrote:
Hi Jens,
Where did the network description go? We used to have one. [..]
I think we need to put it back, and StorageAccessProtocol seems to me the more obvious location.
We discussed it a few meetings ago and felt that it overly complicated the schema for the amount of gain in the short/medium term. For example, insecure RFIO and DCAP are published without restrictions, and in practice this is not a real problem today. Since we want to converge on 2.0 ASAP, we felt such enhancements were better considered for 2.1. Would that be OK for you?
So for those of you not in the telcon today, I learnt that the CESEBind aims to solve some of the same use cases, so we should aim to resolve CESEBind first (to be discussed in the telcon tomorrow at 12:00 UTC). And the idea was certainly not to introduce an overly complex solution, if it cannot be solved with a few attrs or at most one extra class then it's not worth doing. Maybe the CESEBind is this class. Cheers --jens

Still going back over the previous mails ... glue-wg-bounces@ogf.org
[mailto:glue-wg-bounces@ogf.org] On Behalf Of Jensen, J (Jens) said: It may be better to provide attributes for ExpirationMode and RetentionPolicy (etc) and explain what they're for, but not to prescribe the values.
I think we have two possibilities there: either allow open enumerations for the values, as you suggest, or define those attributes to be SRM-specific and provide different attributes for non-SRM use. The first one is obviously easier, but might cause problems e.g. if a differt technology also had a Replica QOS but meant something different by it. However, I guess we could just insist that the name mustn't clash. Since we don't really have any other examples we can't do much more. (Presumably even future SRM versions might have more/different categories?)
For example, if a StorageEnvironment contains both tape and disk, its AccessLatency should be Nearline - as it is the lowest common denominator.
This only makes sense if you have a partial order on capabilities.
In LCG that would clearly be wrong, we *define* Disk1Tape1 to mean Online! I think this needs more thought. Similarly you could argue that any SE which can provide Custodial storage must be good enough for Replica - but you may not want to use it for that if it costs more, and maybe the SRM does not in fact accept requests for Replica spaces (or does it have to?).
3. A StorageShare is associated with a StorageEnvironment if and only if they contain a common StorageCapacity.
I can't resist pointing out that this language implies that a Capacity has an identity - what does "common" mean? The thing which is common is the hardware (Datastore), not the Capacity, indeed the value of the capacity (number of bytes) is likely to be different (the Share usually doesn't fill the whole storage).
Where did the network description go? We used to have one.
The idea is that certain protocols can be used only locally, or on certain networks.
It was considered to be too complicated to represent all the possibilities (potentially everything could depend on everything else!).
I think we need to put it back, and StorageAccessProtocol seems to me the more obvious location.
Maybe, but we need a concrete use case which is likely to occur (and be important) in the real world. Stephen
participants (6)
-
Burke, S (Stephen)
-
Felix Nikolaus Ehm
-
Jensen, J (Jens)
-
Maarten Litmaath
-
Maarten.Litmaath@cern.ch
-
Paul Millar