Apache Avro is a serialization framework that designed with compact, fast, extensibility and interoperability in mind. It is first started to provide better serialization mechanism for Apache Hadoop, the open source distributed computing framework.

Avro provides mechanisms to store object data or sending it over the network for RPC. In both case, the data is always serialized with its schema. Because the schema is always present at serialization and deserialization, there’s no need to tag the data. This can result in more compact serialized data. This post covers the Avro schema.

Avro schema defines how the data associated with it should be serialized and deserialized. It is represented in JSON format.

0. Primitive Types

Avro defines eight primitive types, including null (not value), boolean (a binary value), int (32-bit signed integer), long (64-bit signed integer), float (32-bit floating point), double (64-bit floating point), bytes (sequence of 8-bit unsigned bytes), and string (unicode character sequence). Primitive types have no attributes.

Note that Avro types are language independent. Different language can have different representation for the same Avro data type. For instance, the double type is represented as double in C/C++ and Java, but as float in Python.

1. Complex Types

Six complex types are supported, including records, enums, arrays, maps, unions and fixed. Complex types has attributes, and can be formed  by primitive types or complex types.

Record: record is the most commonly used complex type. An example of a record defined in a schema file is shown as below.

{

 "namespace": "avro", 

 "type": "record", 

 "name": "DemoRecord",

 "aliases": ["LinkedRecord"],                      // old name for this

 "fields" : [

{"name": "desp", "type": "string"},

{"name": "value", "type": "long"},             // each element has a long

{"name": "next", "type": ["DemoRecord", "null"]} // optional next element

 ]

}

The following attributes are required in a record schema.

  • name: the name of the record
  • type: a JSON string “record”
  • fields: a JSON array describing the data fields used to form the record. A data record can have the following attributes.
    • name: required. Indicating the name of the field.
    • type: required. Indicating the type of the field. It can be primitive or complex type.
    • default: optional. However, this is required when we deserialize data that doesn’t have this field.
    • order: optional. Specify how this field affect the sort ordering of the record.
    • doc: optional. a JSON string describing the field.
    • aliases: optional. a JSON array of strings, providing alternative names for this field.

The following attributes are optional for a record schema.

  • namespace: used to avoid name conflict. corresponds to Java package.
  • doc: describing the record
  • aliases: a JSON array of strings. indicating alternate names

Enum: An example of enum defined in a schema file is shown as below.

{ 

 "namespace": "avro", 

 "type": "enum",

 "name": "DemoEnum",

 "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]

}

It has the following required attributes.

  • name: the name of the enum
  • type: a JSON string “enum”
  • symbols: a JSON array of JSON strings, indicating all symbols defined for the enum. The symbols are required to be unique.

Similar to records, namespace, doc and aliases are optional.
Arrays: An example of a record containing array as one of its fields are as below,

{

 "namespace": "avro", 

 "type": "record", 

 "name": "DemoArray",

 "fields" : [

{"name": "anArr", "type": {"type": "array", "items": "string"}}

]

}

The following attributes are required.

  • type: must be a JSON string “array”
  • items: the schema of the array’s items. In the example above, the items are string.

Map: An example of a schema file defining a record with a map as one of its fields are as below.

{

 "namespace": "avro", 

 "type": "record", 

 "name": "DemoMap",

 "fields" : [

{"name": "aMap", "type": {"type": "map", "values": "long"}}

 ]

}

The following attributes are required.

  • type: a JSON string “map”
  • values: the schema of the map’s values. The keys are assumed to be strings. In the example above, the values are of long type.

Union: Below is a sample schema of a record containing a union as one of its fields.

{

 "namespace": "avro", 

 "type": "record", 

 "name": "DemoUnion",

 "fields" : [

{"name": "aUnion", "type": ["string", "null"]}

 ]

}

Unions are defined as a JSON array. In the example above, the field aUnion can either be a string or null.

Fixed: fixed define a data of fixed number of bytes. Below is a sample schema of a fixed.

{"namespace": "avro", "type": "fixed", "size": 16, "name": "DemoFixed"}

It has the following required attributes.

  • name: a JSON string contains the name of the fixed.
  • type: a JSON string “fixed”
  • size: the number of bytes in a value of the fixed type.

It has the following optional attributes.

  • namespace: similar to record.
  • aliases: similar to record.

References:
Apache Avro specification 1.7.1: http://avro.apache.org/docs/current/spec.html

 

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Set your Twitter account name in your settings to use the TwitterBar Section.