Apache Avro™ 1.7.6 Documentation

Dear DFDL folks, Thought you would be interested in the following link. I'd be interested in comparisons of this (Apache Avro), the other two systems mentioned (Apache Thrift and Google Protocol Buffers) with DFDL in terms of goals, schema capabilities and general application potential. Alan Topic: Introduction Apache Avro%u2122 is a data serialization system. Avro provides: Rich data structures. A compact, fast, binary data format. A container file, to store persistent data. Remote procedure call (RPC). Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages. Schemas Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing. When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present. When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved. Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries. Comparison with other systems Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects. Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages. Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size. No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names. Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Apache Software Foundation. Link: http://avro.apache.org/docs/current/

Avro, protocol buffers etc. are tuned towards data exchange over a wire -- one can encode/decode (serialize/de-serialize) user-defined data structures with relative ease, and the transport layer ensures data integrity etc. Essentially, those are data exchange protocols with flexible payload definitions. DFDL on the other hand focuses solely on the data description side of things, and thus cannot easily be compared to those protocols. Focusing on the data description side, the expressiveness of the data descriptions used by Avro and PB etc. are usually much more limited than DFDL's capabilities -- they are geared toward ease of use more than toward completeness. My $0.02, Andre. On Sat, May 10, 2014 at 9:08 AM, Sill, Alan <alan.sill@ttu.edu> wrote:
Dear DFDL folks,
Thought you would be interested in the following link. I'd be interested in comparisons of this (Apache Avro), the other two systems mentioned (Apache Thrift and Google Protocol Buffers) with DFDL in terms of goals, schema capabilities and general application potential.
Alan
Topic: Introduction Apache Avro%u2122 is a data serialization system.
Avro provides:
Rich data structures. A compact, fast, binary data format. A container file, to store persistent data. Remote procedure call (RPC). Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages. Schemas Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.
When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.
When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.
Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.
Comparison with other systems Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.
Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages. Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size. No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names. Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Apache Software Foundation.
Link: http://avro.apache.org/docs/current/
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
-- It was a sad and disappointing day when I discovered that my Universal Remote Control did not, in fact, control the Universe. (Not even remotely.)

The distinction is prescriptive vs descriptive. All the things you mentioned are prescriptive. You use the format and gain the benefits thereof. Dfdl is descriptive. You have data in some form a priori. You are not choosing to represent it in some way. You have to describe the way the data *is*. DFDL is not something new. It is a standard designed from experience with many commercial software systems that each have their own distinct and proprietary way of describing data. An important distinction is also about many to one vs point to point. To communicate between two or a few systems you can select a preferred technology. When you need to enable data interchange among hundreds of different systems that you have no control or influence over the design of. That is when the descriptive approach is most important. On May 10, 2014 3:08 AM, "Sill, Alan" <alan.sill@ttu.edu> wrote:
Dear DFDL folks,
Thought you would be interested in the following link. I'd be interested in comparisons of this (Apache Avro), the other two systems mentioned (Apache Thrift and Google Protocol Buffers) with DFDL in terms of goals, schema capabilities and general application potential.
Alan
Topic: Introduction Apache Avro%u2122 is a data serialization system.
Avro provides:
Rich data structures. A compact, fast, binary data format. A container file, to store persistent data. Remote procedure call (RPC). Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages. Schemas Avro relies on schemas. When Avro data is read, the schema used when writing it is always present. This permits each datum to be written with no per-value overheads, making serialization both fast and small. This also facilitates use with dynamic, scripting languages, since data, together with its schema, is fully self-describing.
When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program. If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.
When Avro is used in RPC, the client and server exchange schemas in the connection handshake. (This can be optimized so that, for most calls, no schemas are actually transmitted.) Since both client and server both have the other's full schema, correspondence between same named fields, missing fields, extra fields, etc. can all be easily resolved.
Avro schemas are defined with JSON . This facilitates implementation in languages that already have JSON libraries.
Comparison with other systems Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.
Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages. Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size. No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names. Apache Avro, Avro, Apache, and the Avro and Apache logos are trademarks of The Apache Software Foundation.
Link: http://avro.apache.org/docs/current/
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
participants (3)
-
Andre Merzky
-
Mike Beckerle
-
Sill, Alan