
Hi Paul Without replying point to point the benchmarking you've done, which was nice work, I kindly suggest you not to benchmark a technology you maybe don't understand completely. As usual, theoretically everything is fine, but not in practice :( . Your claims are all true, LDAP is very fast in answering queries; this is why we use it... and this is not why BDII is "slow". Most of the time spent by BDII is done on restructuring the LDAP tree. LDAP indexing is tree structured index backed by a key-value berkeley db. That means that when aggregating data, all the data must be reindexed ("rewriting the dn" in LDAP slang) to fit into the tree. It is also such three structure plus the simplicity of a key-value pair db that allows LDAP to perform queries in such a fast way. Unfortunately all of this comes at a cost. Updating the db requires the following steps (I didn't look into the code recently, but I roughly remember this) 1) ldap query the sources (negligible time as you discovered) 2) rebuild the new tree(s) generating new LDIF document(s) (very time consuming, includes rekeying of ALL objects.) 3) check differences between the rebuilt tree(s) and the existing database entries 3) modify existing entries that have changed (one ldap-modify for each object) 4) Remove objects that are not there anymore (ldap-delete) 5) ldap-add new objects -- which boils down to ldap-adding a whole new LDIF document (that is, the entire DB) in most of the cases due to -- guess what -- CrationTime and Validity which are always changing!!! :D As you can see above, you just bechmarked the top of the iceberg. Laurence or Maria can correct me if the above is not true. I don't know the code that well but I had to look into it during EMI times. Over the years Laurence managed to shorten down this update time with several smart ideas, that include also enterprise-level techniques like replication, and probably partial LDIF documents where applicable. I think in this way he avoided having two LDAP servers. You have to understand that the LDAP technology is intended for data that changes rarely, and we're using it for an almost real-time system. One more hint of the fact that is a bad monitoring tool... Trust me, 30 mins is a great achievement for a technology that never was meant to do what we use it for. I might have several arguments against BDII code but not about its performance. The problems we're facing with ARC while trying to move to other technologies is that query times for LDAP are faster than other investigated technologies (e.g. REST web services) Update times are horrible, but for people is more important to have the queries fast than the information fresh it seems... And I can say this because in ARC we put also jobs in the LDAP database, which is EXTREME for today's numbers (i.e. O(10000) jobs). It's nice(?) that these numbers match those that Stephen mentioned. Cheers, Florido On 2015-04-21 20:07, Paul Millar wrote:
Hi Stephen,
First, I must apologise if you felt my emails were in any way abusive --- they were certainly not intended that way; rather, I would like the effort we have all invested in GLUE and the grid infrastructure be used properly.
Currently, I see different groups developing their own information systems, running in parallel with GLUE+BDII, because of problems (both perceived and actual) with BDII. I would like these problems addressed and find the very slow progress frustrating.
Onto the specific points...
On 21/04/15 13:00, stephen.burke@stfc.ac.uk wrote:
Paul Millar [mailto:paul.millar@desy.de] said:
From your replies, you appear to have an internal definition of a CreationTime that is to yourself clear, self-obvious and almost axiomatic. Unfortunately, you cannot seem to express that idea in the>> terms defined within GLUE-2.
OK, let's have one more try. The concept which you seem to think is missing is "entity instance". That may not be explicitly defined but it's a general computing concept, and I find it hard to see that you could make much sense of the schema without it. The schema defines entities as collections of attributes with types and definitions; an instance of that entity has specific values for the attributes. One of those attributes is CreationTime. Instances are created in a way completely unspecified by the schema document, but whatever the method the CreationTime is the time at which that creation occurs (necessarily approximate since creation will take a finite time). If a new instance is created it gets a new CreationTime even if all the other attributes happen to be the same. However, if an instance is copied the copy preserves *all* the attribute values including CreationTime - if you change that it's a new instance and not a copy.
Thanks, that makes sense.
Just to confirm: you define two general mechanisms through which data is acquired: creating an entity instance and copying an entity instance.
In concrete terms, resource-level BDII+info-provider creates entity instances while site- and top- level BDIIs copy entity instances. This breaks the symmetry, allowing CreationTime to operate only on resource-level BDIIs.
Perhaps such a description is trivial or "well known", but it seems to me that GLUE-2 when used in a hierarchy (like the WLCG info system) would benefit from such a description. This could go in GLUE-2 itself, or perhaps in a hierarchy profile document.
The validator should expose bugs, not hide them. How else are sites going to fix these bugs.
The point is that sites can't fix middleware bugs [..]
What you say is correct. I would also say that only sites can deploy the bug-fixes.
and hence shouldn't get tickets for them. If tickets were raised for errors which would always occur and can't be fixed until a new middleware release is available the validator would have been rejected - sites must be able to clear alarms in a reasonably short time. That's also why only ERRORs generate alarms - ERRORs are always wrong, WARNINGs may be correct so a site may be unable to remove them. Of course, the validator can still be run outside the Nagios framework without the known issues mask.
Yes, it's always a bit fiddly dealing with a new test where the production instance currently fails.
It would be good if we could check this: I think there's a bug in BDII where stale data is not being flushed.
Maria has been on maternity leave for several months, so all this has been on hold. I think she should be back fairly soon, but no doubt it will take a while to catch up. A couple of years ago there was a bug where old data wasn't being deleted, but it should be out of the system by now. Also bear in mind that top BDIIs can cache data for up to four days.
Sure, I knew Maria was away; but I was hoping there would be someone covering for her, and that the process wasn't based on her heroic efforts alone.
If the validator is hiding bugs, and the policy is to do so whenever bugs are found, then it is useless.
The policy is to submit a ticket to the middleware developers and keep track of it. There's no point in repeatedly finding the same bug.
Yes, that is certainly a sound policy.
AFAIK, there's no intrinsic reason why there should be anything beyond a 2--3 minute delay: the time taken to fetch the updated information from a site-level BDII.
The top BDII has to fetch information from several hundred site BDIIs and the total data volume is large. It takes several minutes to do that. And site BDIIs themselves have to collect information from the resource BDIIs at the site. Back in 2012 Laurence did some tests to see if the top BDII could scale to read from the resource BDIIs directly, but the answer was no, it can cope with O(1000) sources but not O(10000). Also the resource BDII runs on the service and loads it to some extent so it can't update too often - a particular issue for the CE, which is the service with the fastest-changing data.
I'm not sure I agree here.
First, the site-level BDII should cache information from resource-level BDIIs, as resource-level BDIIs cache information from info-providers. This means that load from top-level BDIIs is only experienced by site-level BDIIs.
Taking a complete (top-level) dump only takes a few seconds.
paul@celebrimbor:~$ /usr/bin/time -f %e ldapsearch -LLL -x -H ldap://lcg-bdii.cern.ch:2170 -b o=glue > /dev/null 4.49
paul@celebrimbor:~$ /usr/bin/time -f %e ldapsearch -LLL -x -H ldap://lcg-bdii.cern.ch:2170 -b o=grid > /dev/null 5.15
Lets say it takes about 10--15 seconds in total.
A top-level BDII is updating by this process (invoking the ldapsearch command). Assuming the process is bandwidth limited, this should also take ~10--15 seconds as the total amount of information sent over the network should be about the same. (Note that this doesn't take into account TCP slow-start, so it may be a slight underestimate, but see below for why I don't believe this is a real problem.)
Lets assume the problem isn't bandwidth limited, that the update frequency is limited by latency of the individual requests to site-level BDIIs.
I surveyed the currently registered site-level BDIIs:
for url in $(ldapsearch -LLL -x -H ldap://lcg-bdii.cern.ch:2170 -b o=glue $(ldapsearch -LLL -x -H ldap://lcg-bdii.cern.ch:2170 -b o=glue '(GLUE2ServiceType=bdii_site)' GLUE2ServiceID|perl -p00e 's/\n //'|awk 'BEGIN{printf "(|"}/^GLUE2ServiceID/{printf "(GLUE2EndpointServiceForeignKey="$2")"}END{print ")"}') GLUE2EndpointURL|perl -p00e 's/\n //g' | sed -n 's%^GLUE2EndpointURL: \(ldap://[^:]*:[0-9]*/\).*%\1%p'); do /usr/bin/time -a -o times.dat -f %e ldapsearch -LLL -x -H $url -o nettimeout=30 -b o=glue > /dev/null; done
This query covered some 318 sites. The ldapsearch command failed for 5 endpoints and the query timed out for 3 endpoints.
Of the remaining 310 sites, the maximum time for ldapsearch to complete was about 19.21 seconds and the (median) average was 0.44 seconds. For 82% of sites, ldapsearch completed within a second; for 92% it completed within two seconds.
Repeating this for GLUE-1.3 showed similar statistics.
This suggests to me that information from responsive sites could be maintained with a lag of order 10 seconds to a minute (depending); information from sites with badly performing site-level BDIIs would be updated less often.
I haven't investigated injecting this information: BDII now generates a LDIF diff which is injected into the slapd. This is distinct from the original approach, which employed a "double-buffer" with two slapd instances.
Still, I currently don't see why a top-level BDIIs must lag by some 30 minutes.
Yeah, typical grid middleware response: rewrite the software rather than fix a bug.
I could say that your response is typical: criticism without understanding.
Perhaps, but I have reviewed the BDII code-base in the past and I know roughly how it works.
My simple investigation suggests maintaining a top-level BDII with sub-minute latencies is possible with at least 80--90% of site-level BDIIs.
Of course I may be missing something here, but it certainly seems feasible to achieve much better than is currently being done.
Cheers,
Paul.
-- ================================================== Florido Paganelli ARC Middleware Developer - NorduGrid Collaboration System Administrator Lund University Department of Physics Division of Particle Physics BOX118 221 00 Lund Office Location: Fysikum, Hus B, Rum B313 Office Tel: 046-2220272 Email: florido.paganelli@REMOVE_THIShep.lu.se Homepage: http://www.hep.lu.se/staff/paganelli ==================================================