[glue-wg] On BDII performance was Re: When is data stale?

22 Apr 2015

      Hi Paul

Without replying point to point the benchmarking you've
done, which was nice work, I kindly suggest you not to benchmark a
technology you maybe don't understand completely. As usual,
theoretically everything is fine, but not in practice :( .
Your claims are all true, LDAP is very fast in answering queries; this
is why we use it... and this is not why BDII is "slow".

Most of the time spent by BDII is done on restructuring the LDAP tree.
LDAP indexing is tree structured index backed by a key-value berkeley db.
That means that when aggregating data, all the data must be reindexed
("rewriting the dn" in LDAP slang) to fit into the tree. It is also such
three structure plus the simplicity of a key-value pair db that allows
LDAP to perform queries in such a fast way.

Unfortunately all of this comes at a cost. Updating the db requires the
following steps (I didn't look into the code recently, but I roughly
remember this)

1) ldap query the sources (negligible time as you discovered)
2) rebuild the new tree(s) generating new LDIF document(s) (very time
consuming, includes rekeying of ALL objects.)
3) check differences between the rebuilt tree(s) and the existing
database entries
3) modify existing entries that have changed (one ldap-modify for each
object)
4) Remove objects that are not there anymore (ldap-delete)
5) ldap-add new objects -- which boils down to ldap-adding a whole new
LDIF document (that is,  the entire DB) in most of the cases due to --
guess what -- CrationTime and Validity which are always changing!!! :D

As you can see above, you just bechmarked the top of the iceberg.

Laurence or Maria can correct me if the above is not true. I don't know
the code that well but I had to look into it during EMI times.

Over the years Laurence managed to shorten down this update time with
several smart ideas, that include also enterprise-level techniques like
replication, and probably partial LDIF documents where applicable. I
think in this way he avoided having two LDAP servers.

You have to understand that the LDAP technology is intended for data
that changes rarely, and we're using it for an almost real-time system.
One more hint of the fact that is a bad monitoring tool...

Trust me, 30 mins is a great achievement for a technology that never was
meant to do what we use it for. I might have several arguments against
BDII code but not about its performance.

The problems we're facing with ARC while trying to move to other
technologies is that query times for LDAP are faster than other
investigated technologies (e.g. REST web services)
Update times are horrible, but for people is more important to have the
queries fast than the information fresh it seems...
And I can say this because in ARC we put also jobs in the LDAP database,
which is EXTREME for today's numbers (i.e. O(10000) jobs).
It's nice(?) that these numbers match those that Stephen mentioned.

Cheers,
Florido

On 2015-04-21 20:07, Paul Millar wrote:
...
Hi Stephen,
First, I must apologise if you felt my emails were in any way abusive
--- they were certainly not intended that way; rather, I would like the
effort we have all invested in GLUE and the grid infrastructure be used
properly.
Currently, I see different groups developing their own information
systems, running in parallel with GLUE+BDII, because of problems (both
perceived and actual) with BDII.  I would like these problems addressed
and find the very slow progress frustrating.
Onto the specific points...
On 21/04/15 13:00, stephen.burke@stfc.ac.uk wrote:
...
Paul Millar [mailto:paul.millar@desy.de] said:
...
From your replies, you appear to have an internal definition of a
CreationTime that is to yourself clear, self-obvious and almost
axiomatic. Unfortunately, you cannot seem to express that idea in
the>> terms defined within GLUE-2.
OK, let's have one more try. The concept which you seem to think is
missing is "entity instance". That may not be explicitly defined but
it's a general computing concept, and I find it hard to see that you
could make much sense of the schema without it. The schema defines
entities as collections of attributes with types and definitions; an
instance of that entity has specific values for the attributes. One
of those attributes is CreationTime.  Instances are created in a way
completely unspecified by the schema document, but whatever the
method the CreationTime is the time at which that creation occurs
(necessarily approximate since creation will take a finite time). If
a new instance is created it gets a new CreationTime even if all the
other attributes happen to be the same. However, if an instance is
copied the copy preserves *all* the attribute values including
CreationTime - if you change that it's a new instance and not a
copy.
Thanks, that makes sense.
Just to confirm: you define two general mechanisms through which data is
acquired: creating an entity instance and copying an entity instance.
In concrete terms, resource-level BDII+info-provider creates entity
instances while site- and top- level BDIIs copy entity instances.  This
breaks the symmetry, allowing CreationTime to operate only on
resource-level BDIIs.
Perhaps such a description is trivial or "well known", but it seems to
me that GLUE-2 when used in a hierarchy (like the WLCG info system)
would benefit from such a description.  This could go in GLUE-2 itself,
or perhaps in a hierarchy profile document.
...
...
The validator should expose bugs, not hide them.  How else are
sites going to fix these bugs.
The point is that sites can't fix middleware bugs [..]
What you say is correct.  I would also say that only sites can deploy
the bug-fixes.
...
and hence
shouldn't get tickets for them. If tickets were raised for errors
which would always occur and can't be fixed until a new middleware
release is available the validator would have been rejected - sites
must be able to clear alarms in a reasonably short time. That's also
why only ERRORs generate alarms - ERRORs are always wrong, WARNINGs
may be correct so a site may be unable to remove them. Of course, the
validator can still be run outside the Nagios framework without the
known issues mask.
Yes, it's always a bit fiddly dealing with a new test where the
production instance currently fails.
...
...
It would be good if we could check this: I think there's a bug in
BDII where stale data is not being flushed.
Maria has been on maternity leave for several months, so all this has
been on hold. I think she should be back fairly soon, but no doubt it
will take a while to catch up. A couple of years ago there was a bug
where old data wasn't being deleted, but it should be out of the
system by now. Also bear in mind that top BDIIs can cache data for up
to four days.
Sure, I knew Maria was away; but I was hoping there would be someone
covering for her, and that the process wasn't based on her heroic
efforts alone.
...
...
If the validator is hiding bugs, and the policy is to do so
whenever bugs are found, then it is useless.
The policy is to submit a ticket to the middleware developers and
keep track of it. There's no point in repeatedly finding the same
bug.
Yes, that is certainly a sound policy.
...
...
AFAIK, there's no intrinsic reason why there should be anything
beyond a 2--3 minute delay: the time taken to fetch the updated
information from a site-level BDII.
The top BDII has to fetch information from several hundred site BDIIs
and the total data volume is large. It takes several minutes to do
that. And site BDIIs themselves have to collect information from the
resource BDIIs at the site. Back in 2012 Laurence did some tests to
see if the top BDII could scale to read from the resource BDIIs
directly, but the answer was no, it can cope with O(1000) sources but
not O(10000). Also the resource BDII runs on the service and loads it
to some extent so it can't update too often - a particular issue for
the CE, which is the service with the fastest-changing data.
I'm not sure I agree here.
First, the site-level BDII should cache information from resource-level
BDIIs, as resource-level BDIIs cache information from info-providers.
This means that load from top-level BDIIs is only experienced by
site-level BDIIs.
Taking a complete (top-level) dump only takes a few seconds.
paul@celebrimbor:~$ /usr/bin/time -f %e ldapsearch -LLL -x -H
ldap://lcg-bdii.cern.ch:2170 -b o=glue > /dev/null
4.49
paul@celebrimbor:~$ /usr/bin/time -f %e ldapsearch -LLL -x -H
ldap://lcg-bdii.cern.ch:2170 -b o=grid > /dev/null
5.15
Lets say it takes about 10--15 seconds in total.
A top-level BDII is updating by this process (invoking the ldapsearch
command).  Assuming the process is bandwidth limited, this should also
take ~10--15 seconds as the total amount of information sent over the
network should be about the same.  (Note that this doesn't take into
account TCP slow-start, so it may be a slight underestimate, but see
below for why I don't believe this is a real problem.)
Lets assume the problem isn't bandwidth limited, that the update
frequency is limited by latency of the individual requests to site-level
BDIIs.
I surveyed the currently registered site-level BDIIs:
for url in $(ldapsearch -LLL -x -H ldap://lcg-bdii.cern.ch:2170 -b
o=glue $(ldapsearch -LLL -x -H ldap://lcg-bdii.cern.ch:2170 -b o=glue
'(GLUE2ServiceType=bdii_site)' GLUE2ServiceID|perl -p00e 's/\n //'|awk
'BEGIN{printf "(|"}/^GLUE2ServiceID/{printf
"(GLUE2EndpointServiceForeignKey="$2")"}END{print ")"}')
GLUE2EndpointURL|perl -p00e 's/\n //g' | sed -n 's%^GLUE2EndpointURL:
\(ldap://[^:]*:[0-9]*/\).*%\1%p'); do /usr/bin/time -a -o times.dat -f
%e ldapsearch -LLL -x -H $url -o nettimeout=30 -b o=glue > /dev/null; done
This query covered some 318 sites.  The ldapsearch command failed for 5
endpoints and the query timed out for 3 endpoints.
Of the remaining 310 sites, the maximum time for ldapsearch to complete
was about 19.21 seconds and the (median) average was 0.44 seconds.  For
82% of sites, ldapsearch completed within a second; for 92% it completed
within two seconds.
Repeating this for GLUE-1.3 showed similar statistics.
This suggests to me that information from responsive sites could be
maintained with a lag of order 10 seconds to a minute (depending);
information from sites with badly performing site-level BDIIs would be
updated less often.
I haven't investigated injecting this information: BDII now generates a
LDIF diff which is injected into the slapd.  This is distinct from the
original approach, which employed a "double-buffer" with two slapd
instances.
Still, I currently don't see why a top-level BDIIs must lag by some 30
minutes.
...
...
Yeah, typical grid middleware response: rewrite the software rather
than fix a bug.
I could say that your response is typical: criticism without
understanding.
Perhaps, but I have reviewed the BDII code-base in the past and I know
roughly how it works.
My simple investigation suggests maintaining a top-level BDII with
sub-minute latencies is possible with at least 80--90% of site-level BDIIs.
Of course I may be missing something here, but it certainly seems
feasible to achieve much better than is currently being done.
Cheers,
Paul.
-- 
==================================================
 Florido Paganelli
   ARC Middleware Developer - NorduGrid Collaboration
   System Administrator
 Lund University
 Department of Physics
 Division of Particle Physics
 BOX118
 221 00 Lund
 Office Location: Fysikum, Hus B, Rum B313
 Office Tel: 046-2220272
 Email: florido.paganelli@REMOVE_THIShep.lu.se
 Homepage: http://www.hep.lu.se/staff/paganelli
==================================================