Re: [glue-wg] When is data stale?

21 Apr 2015

      Hi Stephen,

First, I must apologise if you felt my emails were in any way abusive 
--- they were certainly not intended that way; rather, I would like the 
effort we have all invested in GLUE and the grid infrastructure be used 
properly.

Currently, I see different groups developing their own information 
systems, running in parallel with GLUE+BDII, because of problems (both 
perceived and actual) with BDII.  I would like these problems addressed 
and find the very slow progress frustrating.

Onto the specific points...

On 21/04/15 13:00, stephen.burke@stfc.ac.uk wrote:
...
Paul Millar [mailto:paul.millar@desy.de] said:
...
From your replies, you appear to have an internal definition of a
CreationTime that is to yourself clear, self-obvious and almost
axiomatic. Unfortunately, you cannot seem to express that idea in
the>> terms defined within GLUE-2.
OK, let's have one more try. The concept which you seem to think is
missing is "entity instance". That may not be explicitly defined but
it's a general computing concept, and I find it hard to see that you
could make much sense of the schema without it. The schema defines
entities as collections of attributes with types and definitions; an
instance of that entity has specific values for the attributes. One
of those attributes is CreationTime.  Instances are created in a way
completely unspecified by the schema document, but whatever the
method the CreationTime is the time at which that creation occurs
(necessarily approximate since creation will take a finite time). If
a new instance is created it gets a new CreationTime even if all the
other attributes happen to be the same. However, if an instance is
copied the copy preserves *all* the attribute values including
CreationTime - if you change that it's a new instance and not a
copy.
Thanks, that makes sense.

Just to confirm: you define two general mechanisms through which data is 
acquired: creating an entity instance and copying an entity instance.

In concrete terms, resource-level BDII+info-provider creates entity 
instances while site- and top- level BDIIs copy entity instances.  This 
breaks the symmetry, allowing CreationTime to operate only on 
resource-level BDIIs.

Perhaps such a description is trivial or "well known", but it seems to 
me that GLUE-2 when used in a hierarchy (like the WLCG info system) 
would benefit from such a description.  This could go in GLUE-2 itself, 
or perhaps in a hierarchy profile document.
...
...
The validator should expose bugs, not hide them.  How else are
sites going to fix these bugs.
The point is that sites can't fix middleware bugs [..]
What you say is correct.  I would also say that only sites can deploy 
the bug-fixes.
...
and hence
shouldn't get tickets for them. If tickets were raised for errors
which would always occur and can't be fixed until a new middleware
release is available the validator would have been rejected - sites
must be able to clear alarms in a reasonably short time. That's also
why only ERRORs generate alarms - ERRORs are always wrong, WARNINGs
may be correct so a site may be unable to remove them. Of course, the
validator can still be run outside the Nagios framework without the
known issues mask.
Yes, it's always a bit fiddly dealing with a new test where the 
production instance currently fails.
...
...
It would be good if we could check this: I think there's a bug in
BDII where stale data is not being flushed.
Maria has been on maternity leave for several months, so all this has
been on hold. I think she should be back fairly soon, but no doubt it
will take a while to catch up. A couple of years ago there was a bug
where old data wasn't being deleted, but it should be out of the
system by now. Also bear in mind that top BDIIs can cache data for up
to four days.
Sure, I knew Maria was away; but I was hoping there would be someone 
covering for her, and that the process wasn't based on her heroic 
efforts alone.
...
...
If the validator is hiding bugs, and the policy is to do so
whenever bugs are found, then it is useless.
The policy is to submit a ticket to the middleware developers and
keep track of it. There's no point in repeatedly finding the same
bug.
Yes, that is certainly a sound policy.
...
...
AFAIK, there's no intrinsic reason why there should be anything
beyond a 2--3 minute delay: the time taken to fetch the updated
information from a site-level BDII.
The top BDII has to fetch information from several hundred site BDIIs
and the total data volume is large. It takes several minutes to do
that. And site BDIIs themselves have to collect information from the
resource BDIIs at the site. Back in 2012 Laurence did some tests to
see if the top BDII could scale to read from the resource BDIIs
directly, but the answer was no, it can cope with O(1000) sources but
not O(10000). Also the resource BDII runs on the service and loads it
to some extent so it can't update too often - a particular issue for
the CE, which is the service with the fastest-changing data.
I'm not sure I agree here.

First, the site-level BDII should cache information from resource-level 
BDIIs, as resource-level BDIIs cache information from info-providers. 
This means that load from top-level BDIIs is only experienced by 
site-level BDIIs.

Taking a complete (top-level) dump only takes a few seconds.

paul@celebrimbor:~$ /usr/bin/time -f %e ldapsearch -LLL -x -H 
ldap://lcg-bdii.cern.ch:2170 -b o=glue > /dev/null
4.49

paul@celebrimbor:~$ /usr/bin/time -f %e ldapsearch -LLL -x -H 
ldap://lcg-bdii.cern.ch:2170 -b o=grid > /dev/null
5.15

Lets say it takes about 10--15 seconds in total.

A top-level BDII is updating by this process (invoking the ldapsearch 
command).  Assuming the process is bandwidth limited, this should also 
take ~10--15 seconds as the total amount of information sent over the 
network should be about the same.  (Note that this doesn't take into 
account TCP slow-start, so it may be a slight underestimate, but see 
below for why I don't believe this is a real problem.)

Lets assume the problem isn't bandwidth limited, that the update 
frequency is limited by latency of the individual requests to site-level 
BDIIs.

I surveyed the currently registered site-level BDIIs:

for url in $(ldapsearch -LLL -x -H ldap://lcg-bdii.cern.ch:2170 -b 
o=glue $(ldapsearch -LLL -x -H ldap://lcg-bdii.cern.ch:2170 -b o=glue 
'(GLUE2ServiceType=bdii_site)' GLUE2ServiceID|perl -p00e 's/\n //'|awk 
'BEGIN{printf "(|"}/^GLUE2ServiceID/{printf 
"(GLUE2EndpointServiceForeignKey="$2")"}END{print ")"}') 
GLUE2EndpointURL|perl -p00e 's/\n //g' | sed -n 's%^GLUE2EndpointURL: 
\(ldap://[^:]*:[0-9]*/\).*%\1%p'); do /usr/bin/time -a -o times.dat -f 
%e ldapsearch -LLL -x -H $url -o nettimeout=30 -b o=glue > /dev/null; done

This query covered some 318 sites.  The ldapsearch command failed for 5 
endpoints and the query timed out for 3 endpoints.

Of the remaining 310 sites, the maximum time for ldapsearch to complete 
was about 19.21 seconds and the (median) average was 0.44 seconds.  For 
82% of sites, ldapsearch completed within a second; for 92% it completed 
within two seconds.

Repeating this for GLUE-1.3 showed similar statistics.

This suggests to me that information from responsive sites could be 
maintained with a lag of order 10 seconds to a minute (depending); 
information from sites with badly performing site-level BDIIs would be 
updated less often.

I haven't investigated injecting this information: BDII now generates a 
LDIF diff which is injected into the slapd.  This is distinct from the 
original approach, which employed a "double-buffer" with two slapd 
instances.

Still, I currently don't see why a top-level BDIIs must lag by some 30 
minutes.
...
...
Yeah, typical grid middleware response: rewrite the software rather
than fix a bug.
I could say that your response is typical: criticism without
understanding.
Perhaps, but I have reviewed the BDII code-base in the past and I know 
roughly how it works.

My simple investigation suggests maintaining a top-level BDII with 
sub-minute latencies is possible with at least 80--90% of site-level BDIIs.

Of course I may be missing something here, but it certainly seems 
feasible to achieve much better than is currently being done.

Cheers,

Paul.