Preliminary findings on query paging
Hi This is a preliminary assessment of paging for OCCI/JSON on HTTP/REST. This is not a completed list of findings. There was some debate and interest in categorizing the different methods used to implement a query page mechanism. The I've examined several specifications and standards sourced from SNIA, IETF, OMG, OGF, W3C ANSI and others. The odata specification is not considered due to protected proprietary technologies. Pertinent specifications an excepts are noted below. The preliminary findings discovered 9 methods for controlling the delivery of page results from query operations. 1. HTTP Range/Content-Range header (HTTP 1.1) 2. OFFSET/LIMIT parameters (SPARQL) 3. RANGE predicate (Xquery/Xpath) 4. pagedResultsControl.size/Cookie (LDAP) 5. Links to pages (ATOM) 6. LIMIT<offset>, <rowcount> (SQL) 7. TOP <rowcount> (SQL) 8. WHERE ROWNUM eval Value (SQL) 9. OFFSET <value> (SQL) We can distill this list down to four (4) basic approaches 1. Query parameters specifying result row Range (start_offset, end_offset) 2. Query parameters specifying result row offset and count 3. Query parameters specifying count, assuming first query and a opaque handle for context 4. Links to pages NOTE: We could force further consolidation of items one (1) and two (2) on the list above. I elected to separate each item based on the uniqueness of their semantics. Some additional information: There seems to be a growing trend in the html5 communities to adopt the use of the http 1.1 RANGE header to provide query pagination. For example, the dojo javascript library. An open, patent free approach to query pagination for URI query expression does not seems to exist. If OOCI/JSON elects to adopt that approach, it will need clear definition describing the behaviors surrounding insertions, deletes and modification of OCCI information records. I have included below an example document showing a the level detail we should consider for the OCCI specification. Since REST is a context free approach to API and there a no ordering and sequence constraints defined for the delivery, modification and insertion of OCCI information, there may be opportunities where duplicate and missing OCCI information can occur. We should also consider a well defined approach to reconcile inconsistencies in paged operations. I'll be following up with more information so we can develop recommendations to fully resolve paging issues for URI, HTTP and OCCI operations in payload data. cheers, gary mazzaferro http://snia.org/sites/default/files/CDMI%20v1.0.2.pdf Cloud Data Management Interface (CDMI^(TM)) Version 1.0.2 5.13.3 Range Support The server shall support HTTP Range headers and partial content responses (see Section 14.16 of RFC 2616). http://www.w3.org/TR/DOM-Level-2-Traversal-Range/traversal.html Document Object Model (DOM) Level 2 Traversal and Range Specification Version 1.0 W3C Recommendation 13 November, 2000 1. Document Object Model Traversal 1.1. Overview 1.1.1. NodeIterators 1.1.2. NodeFilters 1.1.3. TreeWalker 1.2. Formal Interface Definition NodeIterator, NodeFilter, TreeWalker, DocumentTraversal 2. Document Object Model Range 2.1. Introduction 2.2. Definitions and Notation 2.2.1. Position 2.2.2. Selection and Partial Selection 2.2.3. Notation 2.3. Creating a Range 2.4. Changing a Range's Position 2.5. Comparing Range Boundary-Points 2.6. Deleting Content with a Range 2.7. Extracting Content 2.8. Cloning Content 2.9. Inserting Content 2.10. Surrounding Content 2.11. Miscellaneous Members 2.12. Range modification under document mutation 2.12.1. Insertions 2.12.2. Deletions 2.13. Formal Description of the Range Interface Range, DocumentRange, RangeException, RangeExceptionCode http://www.w3.org/TR/xquery/ XQuery 1.0: An XML Query Language (Second Edition) W3C Recommendation 14 December 2010 (Link errors corrected 3 January 2011) 2.1 Path Expressions XQuery path expressions use the abbreviated syntax of XPath, extended with a new "dereference" operator and a new type of predicate called a "range predicate" (described below). XPath syntax is used in several XML-related applications including XSLT [XSLT] and XPointer [XPointer]. In XQuery, the result of a path expression is an ordered list of nodes (of course, each node includes its descendant nodes, so the result can be thought of as an ordered forest.) The top-level nodes in the path expression result are ordered according to their position in the original hierarchy, in top-down, left-to-right order. The result of a path expression may contain duplicate values (i.e., multiple nodes with the same type and content), but it will not contain duplicate nodes (i.e., multiple nodes with the same node identity). A path expression consists of a series of steps. Each step represents movement through a document in a particular direction, and each step can apply one or more predicates to eliminate nodes that fail to satisfy a given condition. The result of each step is a list of nodes that serves as a starting point for the next step. A path expression can begin with an expression that identifies a specific node, such as the function document(string), which returns the root node of a named document. A query can also contain a path expression beginning with "/" or "//" which represents an implicit root node, determined by the environment in which the query is executed. A complete discussion of XPath abbreviated syntax can be found in [XPath]. Briefly, the following symbols are used: . Denotes the current node. .. Denotes the parent of the current node. / Denotes the root node, or a separator between steps in a path. // Denotes descendants of the current node. @ Denotes attributes of the current node. * Denotes "any" (node with unrestricted name). [ ] Brackets enclose a Boolean expression that serves as a predicate for a given step. [n] When a predicate consists of an integer, it serves to select the element with the given ordinal number from a list of elements. ...... It is sometimes desirable to isolate a list of elements whose ordinal numbers span a certain range. For this purpose, XQuery provides a RANGE predicate that is adapted from XQL [XQL], as illustrated in the following example. The first element in a list has ordinal number 1. The ordinal numbers of elements in a list are not affected by the presence of other types of nodes such as comments or processing instructions. (Q2) Find all the figures in chapters 2 through 5 of the document named "zoo.xml." document("zoo.xml")/chapter[RANGE 2 TO 5]//figure In addition to the operators of the XPath abbreviated syntax, XQuery introduces an operator called the dereference operator ("->"). When a dereference operator follows an IDREF-type attribute, it returns the element(s) that are referenced by the attribute. A dereference operator is followed by a "name test" that specifies the target element. Following the usual XPath convention, a name test of "*" allows the target element to be of any type. 2.1.2 Dynamic Context [Definition: The dynamic context of an expression is defined as information that is available at the time the expression is evaluated.] If evaluation of an expression relies on some part of the dynamic context that has not been assigned a value, a dynamic error is raised [err:XPDY0002]. The individual components of the dynamic context are summarized below. Further rules governing the semantics of these components can be found in C.2 Dynamic Context Components. The dynamic context consists of all the components of the static context, and the additional components listed below. [Definition: The first three components of the dynamic context (context item, context position, and context size) are called the focus of the expression. ] The focus enables the processor to keep track of which items are being processed by the expression. Certain language constructs, notably the path expression E1/E2 and the predicate E1[E2], create a new focus for the evaluation of a sub-expression. In these constructs, E2 is evaluated once for each item in the sequence that results from evaluating E1. Each time E2 is evaluated, it is evaluated with a different focus. The focus for evaluating E2 is referred to below as the inner focus, while the focus for evaluating E1 is referred to as the outer focus. The inner focus exists only while E2 is being evaluated. When this evaluation is complete, evaluation of the containing expression continues with its original focus unchanged. [Definition: The context item is the item currently being processed. An item is either an atomic value or a node.][Definition: When the context item is a node, it can also be referred to as the context node.] The context item is returned by an expression consisting of a single dot (.). When an expression E1/E2 or E1[E2] is evaluated, each item in the sequence obtained by evaluating E1 becomes the context item in the inner focus for an evaluation of E2. [Definition: The context position is the position of the context item within the sequence of items currently being processed.] It changes whenever the context item changes. When the focus is defined, the value of the context position is an integer greater than zero. The context position is returned by the expression fn:position(). When an expression E1/E2 or E1[E2] is evaluated, the context position in the inner focus for an evaluation of E2 is the position of the context item in the sequence obtained by evaluating E1. The position of the first item in a sequence is always 1 (one). The context position is always less than or equal to the context size. [Definition: The context size is the number of items in the sequence of items currently being processed.] Its value is always an integer greater than zero. The context size is returned by the expression fn:last(). When an expression E1/E2 or E1[E2] is evaluated, the context size in the inner focus for an evaluation of E2 is the number of items in the sequence obtained by evaluating E1. Example: XPath="newsSection/news[position() < 6] http://www.w3.org/TR/sparql11-query/#sparqlOffsetLimit SPARQL 1.1 Query Language W3C Proposed Recommendation 08 November 2012 18.2.5.5 OFFSET and LIMIT If the query contains "OFFSET start" or "LIMIT length" M := Slice(M, start, length) start defaults to 0 length defaults to (size(M)-start). http://www.ietf.org/rfc/rfc2696.txt Network Working Group Request for Comments: 2696 LDAP Control Extension for Simple Paged Results Manipulation 2. The Control This control is included in the searchRequest and searchResultDone messages as part of the controls field of the LDAPMessage, as defined in Section 4.1.12 of [LDAPv3]. The structure of this control is as follows: pagedResultsControl ::= SEQUENCE { controlType 1.2.840.113556.1.4.319, criticality BOOLEAN DEFAULT FALSE, controlValue searchControlValue } The searchControlValue is an OCTET STRING wrapping the BER-encoded version of the following SEQUENCE: realSearchControlValue ::= SEQUENCE { size INTEGER (0..maxInt), -- requested page size from client -- result set size estimate from server cookie OCTET STRING } 3. Client-Server Interaction An LDAP client application that needs to control the rate at which results are returned MAY specify on the searchRequest a pagedResultsControl with size set to the desired page size and cookie set to the zero-length string. The page size specified MAY be greater than zero and less than the sizeLimit value specified in the searchRequest. If the page size is greater than or equal to the sizeLimit value, the server should ignore the control as the request can be satisfied in a single page. If the server does not support this control, the server MUST return an error of unsupportedCriticalExtension if the client requested it as critical, otherwise the server SHOULD ignore the control. The remainder of this section assumes the server does not ignore the client's pagedResultsControl. Each time the server returns a set of results to the client when processing a search request containing the pagedResultsControl, the server includes the pagedResultsControl control in the searchResultDone message. In the control returned to the client, the size MAY be set to the server's estimate of the total number of entries in the entire result set. Servers that cannot provide such an estimate MAY set this size to zero (0). The cookie MUST be set to an empty value if there are no more entries to return (i.e., the page of search results returned was the last), or, if there are more entries to return, to an octet string of the server's choosing,used to resume the search. The client MUST consider the cookie to be an opaque structure and make no assumptions about its internal organization or value. When the client wants to retrieve more entries for the result set, it MUST send to the server a searchRequest with all values identical to the initial request with the exception of the messageID, the cookie, and optionally a modified pageSize. The cookie MUST be the octet string on the last searchResultDone response returned by the server. Returning cookies from previous searchResultDone responses besides the last one is undefined, as the server implementation may restrict cookies from being reused. http://tools.ietf.org/html/rfc2616 RFC 2616 HTTP/1.1 June 1999 13.5.4 Combining Byte Ranges A response might transfer only a subrange of the bytes of an entity- body, either because the request included one or more Range specifications, or because a connection was broken prematurely. After several such transfers, a cache might have received several ranges of the same entity-body. If a cache has a stored non-empty set of subranges for an entity, and an incoming response transfers another subrange, the cache MAY combine the new subrange with the existing set if both the following conditions are met: - Both the incoming response and the cache entry have a cache validator. - The two cache validators match using the strong comparison function (see section 13.3.3). If either requirement is not met, the cache MUST use only the most recent partial response (based on the Date values transmitted with every response, and using the incoming response if these values are equal or missing), and MUST discard the other partial information. 14.16 Content-Range The Content-Range entity-header is sent with a partial entity-body to specify where in the full entity-body the partial body should be applied. Range units are defined in section 3.12. Content-Range = "Content-Range" ":" content-range-spec content-range-spec = byte-content-range-spec byte-content-range-spec = bytes-unit SP byte-range-resp-spec "/" ( instance-length | "*" ) byte-range-resp-spec = (first-byte-pos "-" last-byte-pos) | "*" instance-length = 1*DIGIT The header SHOULD indicate the total length of the full entity-body, unless this length is unknown or difficult to determine. The asterisk "*" character means that the instance-length is unknown at the time when the response was generated. Unlike byte-ranges-specifier values (see section 14.35.1), a byte- range-resp-spec MUST only specify one range, and MUST contain absolute byte positions for both the first and last byte of the range. A byte-content-range-spec with a byte-range-resp-spec whose last- byte-pos value is less than its first-byte-pos value, or whose instance-length value is less than or equal to its last-byte-pos value, is invalid. The recipient of an invalid byte-content-range- spec MUST ignore it and any content transferred along with it. A server sending a response with status code 416 (Requested range not satisfiable) SHOULD include a Content-Range field with a byte-range- resp-spec of "*". The instance-length specifies the current length of the selected resource. A response with status code 206 (Partial Content) MUST NOT include a Content-Range field with a byte-range- resp-spec of "*". Examples of byte-content-range-spec values, assuming that the entity contains a total of 1234 bytes: . The first 500 bytes: bytes 0-499/1234 . The second 500 bytes: bytes 500-999/1234 . All except for the first 500 bytes: bytes 500-1233/1234 . The last 500 bytes: bytes 734-1233/1234 When an HTTP message includes the content of a single range (for example, a response to a request for a single range, or to a request for a set of ranges that overlap without any holes), this content is transmitted with a Content-Range header, and a Content-Length header showing the number of bytes actually transferred. For example, HTTP/1.1 206 Partial content Date: Wed, 15 Nov 1995 06:25:24 GMT Last-Modified: Wed, 15 Nov 1995 04:58:08 GMT Content-Range: bytes 21010-47021/47022 Content-Length: 26012 Content-Type: image/gif When an HTTP message includes the content of multiple ranges (for example, a response to a request for multiple non-overlapping ranges), these are transmitted as a multipart message. The multipart media type used for this purpose is "multipart/byteranges" as defined in appendix 19.2. See appendix 19.6.3 for a compatibility issue. A response to a request for a single range MUST NOT be sent using the multipart/byteranges media type. A response to a request for multiple ranges, whose result is a single range, MAY be sent as a multipart/byteranges media type with one part. A client that cannot decode a multipart/byteranges message MUST NOT ask for multiple byte-ranges in a single request. When a client requests multiple byte-ranges in one request, the server SHOULD return them in the order that they appeared in the request. If the server ignores a byte-range-spec because it is syntactically invalid, the server SHOULD treat the request as if the invalid Range header field did not exist. (Normally, this means return a 200 response containing the full entity). If the server receives a request (other than one including an If- Range request-header field) with an unsatisfiable Range request- header field (that is, all of whose byte-range-spec values have a first-byte-pos value greater than the current length of the selected resource), it SHOULD return a response code of 416 (Requested range not satisfiable) (section 10.4.17). Note: clients cannot depend on servers to send a 416 (Requested range not satisfiable) response instead of a 200 (OK) response for an unsatisfiable Range request-header, since not all servers implement this request-header. 14.35 Range 14.35.1 Byte Ranges Since all HTTP entities are represented in HTTP messages as sequences of bytes, the concept of a byte range is meaningful for any HTTP entity. (However, not all clients and servers need to support byte- range operations.) Byte range specifications in HTTP apply to the sequence of bytes in the entity-body (not necessarily the same as the message-body). A byte range operation MAY specify a single range of bytes, or a set of ranges within a single entity. ranges-specifier = byte-ranges-specifier byte-ranges-specifier = bytes-unit "=" byte-range-set byte-range-set = 1#( byte-range-spec | suffix-byte-range-spec ) byte-range-spec = first-byte-pos "-" [last-byte-pos] first-byte-pos = 1*DIGIT last-byte-pos = 1*DIGIT The first-byte-pos value in a byte-range-spec gives the byte-offset of the first byte in a range. The last-byte-pos value gives the byte-offset of the last byte in the range; that is, the byte positions specified are inclusive. Byte offsets start at zero. If the last-byte-pos value is present, it MUST be greater than or equal to the first-byte-pos in that byte-range-spec, or the byte- range-spec is syntactically invalid. The recipient of a byte-range- set that includes one or more syntactically invalid byte-range-spec values MUST ignore the header field that includes that byte-range- set. If the last-byte-pos value is absent, or if the value is greater than or equal to the current length of the entity-body, last-byte-pos is taken to be equal to one less than the current length of the entity- body in bytes. By its choice of last-byte-pos, a client can limit the number of bytes retrieved without knowing the size of the entity. suffix-byte-range-spec = "-" suffix-length suffix-length = 1*DIGIT A suffix-byte-range-spec is used to specify the suffix of the entity-body, of a length given by the suffix-length value. (That is, this form specifies the last N bytes of an entity-body.) If the entity is shorter than the specified suffix-length, the entire entity-body is used. If a syntactically valid byte-range-set includes at least one byte- range-spec whose first-byte-pos is less than the current length of the entity-body, or at least one suffix-byte-range-spec with a non- zero suffix-length, then the byte-range-set is satisfiable. Otherwise, the byte-range-set is unsatisfiable. If the byte-range-set is unsatisfiable, the server SHOULD return a response with a status of 416 (Requested range not satisfiable). Otherwise, the server SHOULD return a response with a status of 206 (Partial Content) containing the satisfiable ranges of the entity-body. Examples of byte-ranges-specifier values (assuming an entity-body of length 10000): - The first 500 bytes (byte offsets 0-499, inclusive): bytes=0- 499 - The second 500 bytes (byte offsets 500-999, inclusive): bytes=500-999 - The final 500 bytes (byte offsets 9500-9999, inclusive): bytes=-500 - Or bytes=9500- - The first and last bytes only (bytes 0 and 9999): bytes=0-0,-1 - Several legal but not canonical specifications of the second 500 bytes (byte offsets 500-999, inclusive): bytes=500-600,601-999 bytes=500-700,601-999 14.35.2 Range Retrieval Requests HTTP retrieval requests using conditional or unconditional GET methods MAY request one or more sub-ranges of the entity, instead of the entire entity, using the Range request header, which applies to the entity returned as the result of the request: Range = "Range" ":" ranges-specifier A server MAY ignore the Range header. However, HTTP/1.1 origin servers and intermediate caches ought to support byte ranges when possible, since Range supports efficient recovery from partially failed transfers, and supports efficient partial retrieval of large entities. If the server supports the Range header and the specified range or ranges are appropriate for the entity: - The presence of a Range header in an unconditional GET modifies what is returned if the GET is otherwise successful. In other words, the response carries a status code of 206 (Partial Content) instead of 200 (OK). - The presence of a Range header in a conditional GET (a request using one or both of If-Modified-Since and If-None-Match, or one or both of If-Unmodified-Since and If-Match) modifies what is returned if the GET is otherwise successful and the condition is true. It does not affect the 304 (Not Modified) response returned if the conditional is false. In some cases, it might be more appropriate to use the If-Range header (see section 14.27) in addition to the Range header. If a proxy that supports ranges receives a Range request, forwards the request to an inbound server, and receives an entire entity in reply, it SHOULD only return the requested range to its client. It SHOULD store the entire received response in its cache if that is consistent with its cache allocation policies. 14.27 If-Range If a client has a partial copy of an entity in its cache, and wishes to have an up-to-date copy of the entire entity in its cache, it could use the Range request-header with a conditional GET (using either or both of If-Unmodified-Since and If-Match.) However, if the condition fails because the entity has been modified, the client would then have to make a second request to obtain the entire current entity-body. The If-Range header allows a client to "short-circuit" the second request. Informally, its meaning is `if the entity is unchanged, send me the part(s) that I am missing; otherwise, send me the entire new entity'. If-Range = "If-Range" ":" ( entity-tag | HTTP-date ) If the client has no entity tag for an entity, but does have a Last- Modified date, it MAY use that date in an If-Range header. (The server can distinguish between a valid HTTP-date and any form of entity-tag by examining no more than two characters.) The If-Range header SHOULD only be used together with a Range header, and MUST be ignored if the request does not include a Range header, or if the server does not support the sub-range operation. If the entity tag given in the If-Range header matches the current entity tag for the entity, then the server SHOULD provide the specified sub-range of the entity using a 206 (Partial content) response. If the entity tag does not match, then the server SHOULD return the entire entity using a 200 (OK) response. http://www.ietf.org/rfc/rfc5005.txt Feed Paging and Archiving Request for Comments: 5005 September 2007 3. Paged Feeds A paged feed is a set of linked feed documents that together contain the entries of a logical feed, without any guarantees about the stability of each document's contents. Paged feeds are lossy; that is, it is not possible to guarantee that clients will be able to reconstruct the contents of the logical feed at a particular time. Entries may be added or changed as the pages of the feed are accessed, without the client becoming aware of them. Therefore, clients SHOULD NOT present paged feeds as coherent or complete, or make assumptions to that effect. Paged feeds can be useful when the number of entries is very large, infinite, or indeterminate. Clients can "page" through the feed, only accessing a subset of the feed's entries as necessary. For example, a search engine might make query results available as a paged feed, so that queries with very large result sets do not overwhelm the server, the network, or the client. The feed documents in a paged feed are tied together with the following link relations: o "first" - A URI that refers to the furthest preceding document in a series of documents. o "last" - A URI that refers to the furthest following document in a series of documents. o "previous" - A URI that refers to the immediately preceding document in a series of documents. o "next" - A URI that refers to the immediately following document in a series of documents. Paged feed documents MUST have at least one of these link relations present, and should contain as many as practical and applicable. Example: Atom-formatted Paged Feed <?xml version="1.0" encoding="utf-8"?> <feed xmlns="http://www.w3.org/2005/Atom"> <title>Example Feed</title> <link href="http://example.org/"/> <link rel="self" href="http://example.org/index.atom"/> <link rel="next" href="http://example.org/index.atom?page=2"/> <updated>2003-12-13T18:30:02Z</updated> <author> <name>John Doe</name> </author> <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id> <entry> <title>Atom-Powered Robots Run Amok</title> <link href="http://example.org/2003/12/13/atom03"/> <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id> <updated>2003-12-13T18:30:02Z</updated> <summary>Some text.</summary> </entry> </feed> This specification does not address duplicate entries in paged feeds.
participants (1)
-
Gary Mazz