ComputingService and Endpoints, a point of view

Hi all, I've been thinking about this ComputingService <- exposes -> Endpoints association and if it's possible to do it based on the UML diagrams contained in GFD1.47. (see figures on page 7, 22) I think the UML diagram doesn't clarify what happens to the association <- exposes -> with respect to inheritance when it comes to the specialized classes ComputingService,ComputingEndpoint. What we have is Service <- exposes -> Endpoints ComputingService <- exposes -> ComputingEndpoints Now if we could write this as a piece of object oriented pseudocode, we would have Class Service { // a list of exposed Endpoints Endpoint MyEndpoints } Class ComputingService extends Service { // a list of exposed ComputingEndpoints ComputingEndpoint MyComputingEndpoints } What happens with inheritance wrt <- exposes -> ? I think there can be two implementations for a ComputingService constructor: 1) ComputingService includes the <- exposes -> from Service, that is, an instance of ComputingService would have both MyEndpoints and MyComputingEndpoints: ComputingService( EndpointList someEndpoints , ComputingEndpointList someComputingEndpoints ) { this.MyEndpoints = someEndpoints; this.MyComputingEndpoints = someComputingEndpoints; } 2) ComputingService overrides the <- exposes -> from Service, that is an instance of ComputingService has ComputingService( ComputingEndpointList someComputingEndpoints ) { this.MyEndpoints = someComputingEndpoints } I favour the 1) as it makes possible to have Endpoints inside a ComputingService, that would give better readability and clarity to Endpoints which are NOT ComputingEndpoints. these two implementation are, in my opinion, both possible as UML does not really scope associations with inheritance. What do you think? Which one should we favour? I am really annoyed of calling ComputingEndpoint an Endpoint that has nothing to do with computing! Thanks for your answers, -- Florido Paganelli Lund University - Particle Physics ARC Middleware EMI Project

glue-wg-bounces@ogf.org [mailto:glue-wg-bounces@ogf.org] On
Behalf Of Florido Paganelli said: I favour the 1) as it makes possible to have Endpoints inside a ComputingService, that would give better readability and clarity to Endpoints which are NOT ComputingEndpoints.
What would you propose to do with Share, Resource and Manager? Also, would you allow a ComputingEndpoint to be part of a non-Computing Service?
What do you think? Which one should we favour?
In LDAP I don't think it makes very much difference. With the current way of doing it you can find all Endpoints which belong to computing services with (objectclass=GLUE2ComputingEndpoint). If you remove the objectclass from a few of them you lose that ability, and I think there is no gain other than a very small reduction in data volume. (If you want to find only the Endpoints which allow job submission you can select on the Capability.)
I am really annoyed of calling ComputingEndpoint an Endpoint that has nothing to do with computing!
As I said before, if it has nothing to do with computing maybe it shouldn't be part of a ComputingService at all. Stephen -- Scanned by iCritical.

On 2012-08-24 10:55, stephen.burke@stfc.ac.uk wrote:
glue-wg-bounces@ogf.org [mailto:glue-wg-bounces@ogf.org] On
Behalf Of Florido Paganelli said: I favour the 1) as it makes possible to have Endpoints inside a ComputingService, that would give better readability and clarity to Endpoints which are NOT ComputingEndpoints.
What would you propose to do with Share, Resource and Manager?
Same approach. As I said, this depends if we want to override the associations or not. This cannot be represented in UML, but makes sense in realizations.
Also, would you allow a ComputingEndpoint to be part of a non-Computing Service?
Well this is impossible also by looking at the model, because: 1) Service is the only entity that has an association with Endpoint; 2) Service is the most generic object in the hierarchy 3) ComputingService is the only entity that has an association with ComputingEndpoint; So any non-computing service, whether it is Service or some other non-ComputingService, does NOT have any relationship whatsoever with ComputingEndpoint, but only with Endpoint at most. a non-ComputingService _different_ from Service can inherit this an association with ComputingEndpoint in only one way: by extending ComputingService, that is, ComputingService must be the parent or super object Class of the non-ComputingService. so, I don't, as inheritance does not allow that.
What do you think? Which one should we favour?
In LDAP I don't think it makes very much difference. With the current way of doing it you can find all Endpoints which belong to computing services with (objectclass=GLUE2ComputingEndpoint). If you remove the objectclass from a few of them you lose that ability, and I think there is no gain other than a very small reduction in data volume.
In LDAP, I would scope the search for endpoints starting from the ComputingService, but we will have a debate if we go into this, I told you I like the LDAP tree structure for such reasons. But if implementations are keen adding associations, I can even scope the search by filtering on associations, which is unfortunately less performing.
(If you want to find only the Endpoints which allow job submission you can select on the Capability.)
yes I completely agree on the above
I am really annoyed of calling ComputingEndpoint an Endpoint that has nothing to do with computing!
As I said before, if it has nothing to do with computing maybe it shouldn't be part of a ComputingService at all.
but then give me something to relate a local information service and its endpoints (some OpenLDAP service), or an independent delegation Service to the box where the ComputingService is, otherwise I run the risk of quering twice the information system(s) for no reason, and submit jobs twice to the same endpoint because I cannot distinguish between them. I tried to express this in previous emails but I clearly didn't succeed. There is, however, a third approach in this service-to-service thing: In my initial implementation I wanted to use the service-to-service association described in GFD1.47 (page 7, page 13); however I was told that this was not the purpose for it to be there, but it was more to reflect some hierarchy between Services. I think the flaw in such an association based approach would be that the unique ID might be wrong at a certain point in time (for example because of ID renewal) and not refer anymore to the record it points to. What do you think about this one above? Cheers, -- Florido Paganelli Lund University - Particle Physics ARC Middleware EMI Project

Florido Paganelli [mailto:florido.paganelli@hep.lu.se] said:
What would you propose to do with Share, Resource and Manager?
Same approach. As I said, this depends if we want to override the associations or not. This cannot be represented in UML, but makes sense in realizations.
And what about the relations between them? And the same for the storage classes? I think this would be quite a big change which would need a significant advantage to be worthwhile, and so far I don't think you've given one.
In LDAP, I would scope the search for endpoints starting from the ComputingService,
You can do that if you have chosen one specific ComputingService, but in your own example of a delegation endpoint which could serve computing or others kind of service, the current definition lets you search for all the endpoints which serve computing services but not the others.
but then give me something to relate a local information service and its endpoints (some OpenLDAP service), or an independent delegation Service to the box where the ComputingService is, otherwise I run the
As I already said, I think an information endpoint should be a separate Service. For a delegation service I can't say, it would depend on how closely it's bound to the computing service and what the use cases are.
risk of quering twice the information system(s) for no reason, and submit jobs twice to the same endpoint because I cannot distinguish between them.
Queries are normally very lightweight compared with real service interactions like job submission, unless you're doing a very large number of them - querying twice is not a problem. Being able to recognise that you have the same Endpoint multiple times obviously is important, but I don't see why it would be difficult to recognise duplicates.
In my initial implementation I wanted to use the service-to-service association described in GFD1.47 (page 7, page 13); however I was told that this was not the purpose for it to be there, but it was more to reflect some hierarchy between Services.
I don't see how it could represent a hierarchy unless you had some other way to express it - Service-Service is a peer relation, there is no directionality (unlike e.g. Domain-Domain). In any case, as I've said repeatedly, the question is not what the purpose was when the schema was defined (none in particular as far as a I remember) but whether it can be used to satisfy whatever requirements you have now in a specific case. For the things you're describing this may well be sufficient.
I think the flaw in such an association based approach would be that the unique ID might be wrong at a certain point in time (for example because of ID renewal) and not refer anymore to the record it points to.
Persistency of IDs is a separate question, and a general one - IDs must be persistent for as long as necessary for all the possible uses. ServiceIDs in particular should probably change only when services are reconfigured in a major way. If references to IDs can't be followed the whole schema will be unusable! Stephen

Hi Stephen, On 2012-08-25 12:12, stephen.burke@stfc.ac.uk wrote:
Florido Paganelli [mailto:florido.paganelli@hep.lu.se] said:
What would you propose to do with Share, Resource and Manager?
Same approach. As I said, this depends if we want to override the associations or not. This cannot be represented in UML, but makes sense in realizations.
And what about the relations between them? And the same for the storage classes? I think this would be quite a big change which would need a significant advantage to be worthwhile, and so far I don't think you've given one.
There is no changes. As I said, UML cannot express inheritance so well as implementation is straightforward. But we have the opportunity to fix it in the realization documents that are not final yet. I did not spend time reasoning about the other associations, but if we agree on a composition-driven approach (every specification adds, does not overload) rather than a bare inheritance-driven approach (every specification overloads associations) I see no problem whatsoever. We're still fully consistent with the model, everything works as expected.
In LDAP, I would scope the search for endpoints starting from the ComputingService,
You can do that if you have chosen one specific ComputingService, but in your own example of a delegation endpoint which could serve computing or others kind of service, the current definition lets you search for all the endpoints which serve computing services but not the others.
Yes I understand what you mean. But what if I have a delegation endpoint that can be used both for computing and for storage? should I replicate such an endpoint in a ComputingService and in a StorageService? an in that case the same delegation endpoint would be a ComputingEndpoint and a StorageEndpoint, two different IDs. but in the end is the same endpoint! How to express is the same endpoint? same ID? but then the record would have different objectclasses and associations... It's kinda bad to have differen records with the same ID. I would rather call it Endpoint, add associations pointing to both the StorageService and ComputingService it serves, give it the same ID and place it in both Computing and Storage services.
but then give me something to relate a local information service and its endpoints (some OpenLDAP service), or an independent delegation Service to the box where the ComputingService is, otherwise I run the
As I already said, I think an information endpoint should be a separate Service. For a delegation service I can't say, it would depend on how closely it's bound to the computing service and what the use cases are.
risk of quering twice the information system(s) for no reason, and submit jobs twice to the same endpoint because I cannot distinguish between them.
Queries are normally very lightweight compared with real service interactions like job submission, unless you're doing a very large number of them - querying twice is not a problem. Being able to recognise that you have the same Endpoint multiple times obviously is important, but I don't see why it would be difficult to recognise duplicates.
querying twice is a problem on big numbers. say I have 20 information endpoints and 40 submission endpoints in an index, such as EMIR, in which every Endpoint record has also the Service.ID of the Service the endpoint belongs to. A client retrieves all the 60 of them. Then, it might want to query information endpoints to scan for submission endpoints. Scenario 1) I have Endpoints and ComputingEndpoints in a ComputingService. I'll make it easy here. A single box might have more than one information/submission endpoint, that means deciding which information/submission endpoints belonging to the same box one doesn't want to query. So, let's simplify the scenario and suppose submission endpoints belong to different boxes and information endopoints belong to different boxes. BUT there might be information endpoints on the same box of at least one submission endpoint. Then, since Endpoints and ComputingEndpoints are in the same ComputingService, IF the information endpoint has the same Service.ID of a submission endpoint, the client might decide not to query it. Operation cost: one comparison for each information endpoint and submission endpoint at most, 20*40 = 800 ops Scenario 2) Different services, Endpoints in a Information Service and ComputingEndpoints in a ComputingService. We then have different Service.IDs for each endpoint, because information endpoints belong to different services than submission endpoints. The client cannot know which relationship exists between services, and then it must query information endpoints. Suppose every information endpoint outputs 10 submission endpoints, some registered to the index (i.e. belonging to the set of 40 taken from the index) and some not (i.e. not in those 40 present in the index), ~200 endpoints. As said, since there is no information on how information and submission endpoints are coupled, I need to scan the information endpoints as I can gather more submission endpoints there. A client cannot just suppose that all the useful submission endpoints are in the index. Hence I must check all the 40 submission endpoints in the index against the 200 retrieved from the information endpoints , in order not to submit twice to the same endpoint. In the worst case is 20 queries to information endpoints + 40*200 = 8000 comparison operations, 8020 operations in total, and we're gone to the next order. The numbers are arbitrary, but I can tell you that ARC will have at least 3 submission endpoints per box and you know what happens if you take a site-bdii as an information endpoint (one might easily reach 10 there on big sites) It is easy to see that as the number of job requests increases we might occur in an incredible amount of work just to submit a single job. Of course clients can use fancy ranking algorithms and or dynamic programming to solve the problem better.
In my initial implementation I wanted to use the service-to-service association described in GFD1.47 (page 7, page 13); however I was told that this was not the purpose for it to be there, but it was more to reflect some hierarchy between Services.
I don't see how it could represent a hierarchy unless you had some other way to express it - Service-Service is a peer relation, there is no directionality (unlike e.g. Domain-Domain). In any case, as I've said repeatedly, the question is not what the purpose was when the schema was defined (none in particular as far as a I remember) but whether it can be used to satisfy whatever requirements you have now in a specific case. For the things you're describing this may well be sufficient.
It might be worth then pushing these associations records into an index. Many developers are underestimating these associations in implementations and I tend not to consider them reliable. I can see that they were meant as an approach to database integrity with a relational DB in mind. These things nowadays are better realized via graph databases. Maybe the IDs in the associations might be used as a foundation to query and build a graph database of relationships between services, but this is dreaming of the future :)
I think the flaw in such an association based approach would be that the unique ID might be wrong at a certain point in time (for example because of ID renewal) and not refer anymore to the record it points to.
Persistency of IDs is a separate question, and a general one - IDs must be persistent for as long as necessary for all the possible uses. ServiceIDs in particular should probably change only when services are reconfigured in a major way. If references to IDs can't be followed the whole schema will be unusable!
I agree on both these two comments! we must push for those IDs to be crucial for implementations. Their value and importance for distributed deployments to work has been underestimated, especially regarding the rules regulating their persistence. I guess it is already part of you EGI profile, Stephen. Cheers, -- Florido Paganelli Lund University - Particle Physics ARC Middleware EMI Project

Florido Paganelli [mailto:florido.paganelli@hep.lu.se] said:
Yes I understand what you mean. But what if I have a delegation endpoint that can be used both for computing and for storage?
Then it should clearly be a separate Service.
I would rather call it Endpoint, add associations pointing to both the StorageService and ComputingService it serves, give it the same ID and place it in both Computing and Storage services.
You can't do that - an Endpoint is related to exactly one Service. Also an object with a given ID should only be published once.
A client retrieves all the 60 of them. Then, it might want to query information endpoints to scan for submission endpoints.
I don't see the point of that - if you have an index it should contain all the endpoints, so why would you need to scan again? But anyway it still doesn't take long, e.g. I can scan the entire glue 1 BDII for CEs which support the atlas VO in about 0.1 sec, which is tiny compared with anything which has a GSI overhead - e.g. copying a small file into an SE takes 13 secs.
I'll make it easy here. A single box might have more than one information/submission endpoint, that means deciding which information/submission endpoints belonging to the same box one doesn't want to query.
I still don't know why you're obsessed with boxes, it should not be relevant to anything what physical machine something runs on.
We then have different Service.IDs for each endpoint, because information endpoints belong to different services than submission endpoints.
The client cannot know which relationship exists between services,
Yes it can, if you add it as a Service-Service association.
then it must query information endpoints.
No, it should only need to query one index, whether BDII, EMIR or indeed GOC DB, which should contain everything. I can't see any point in having an index which is not complete.
Hence I must check all the 40 submission endpoints in the index against the 200 retrieved from the information endpoints , in order not to submit twice to the same endpoint.
I think that would be a crazy way to do it - it certainly doesn't correspond to our current architecture.
It is easy to see that as the number of job requests increases we might occur in an incredible amount of work just to submit a single job.
It's easy to see that if you do things in an inefficient way it will be inefficient, but I don't think that's any kind of argument. Stephen

Hi all, On 2012-08-27 12:08, Florido Paganelli wrote:
Hi Stephen,
On 2012-08-25 12:12, stephen.burke@stfc.ac.uk wrote:
Florido Paganelli [mailto:florido.paganelli@hep.lu.se] said:
What would you propose to do with Share, Resource and Manager?
Same approach. As I said, this depends if we want to override the associations or not. This cannot be represented in UML, but makes sense in realizations.
And what about the relations between them? And the same for the storage classes? I think this would be quite a big change which would need a significant advantage to be worthwhile, and so far I don't think you've given one.
There is no changes. As I said, UML cannot express inheritance so well as implementation is straightforward.
But we have the opportunity to fix it in the realization documents that are not final yet.
I just wanted to notify everybody that I was wrong on the above. Carefully re-reading GFD1.47, I found out that ComputingService actually explicitly redefines the <- exposes -> association. It's not done in the UML schema, but on page 24, reading the table representing ComputingService, it says that the ComputingEndpoint.ID association *redefines* Endpoint.ID. Stephen was right, we have that ComputingService that can only host ComputingEndpoints. I think this is bad design, but I guess it's too late to change it now. The reasons for this restrictive behaviour are hard for me to understand... but that's what we have. Cheers, -- Florido Paganelli Lund University - Particle Physics ARC Middleware EMI Project
participants (2)
-
Florido Paganelli
-
stephen.burke@stfc.ac.uk