Apache Avro: Understand Schema
Apache Avro is a serialization framework that designed with compact, fast, extensibility and interoperability in mind. It is first started to provide better serialization mechanism for Apache Hadoop, the open source distributed computing framework.
Avro provides mechanisms to store object data or sending it over the network for RPC. In both case, the data is always serialized with its schema. Because the schema is always present at serialization and deserialization, there’s no need to tag the data. This can result in more compact serialized data. This post covers the Avro schema.
Avro schema defines how the data associated with it should be serialized and deserialized. It is represented in JSON format.
0. Primitive Types
Avro defines eight primitive types, including null (not value), boolean (a binary value), int (32-bit signed integer), long (64-bit signed integer), float (32-bit floating point), double (64-bit floating point), bytes (sequence of 8-bit unsigned bytes), and string (unicode character sequence). Primitive types have no attributes.
Note that Avro types are language independent. Different language can have different representation for the same Avro data type. For instance, the double type is represented as double in C/C++ and Java, but as float in Python.
1. Complex Types
Six complex types are supported, including records, enums, arrays, maps, unions and fixed. Complex types has attributes, and can be formed by primitive types or complex types.
Record: record is the most commonly used complex type. An example of a record defined in a schema file is shown as below.
{
"namespace": "avro",
"type": "record",
"name": "DemoRecord",
"aliases": ["LinkedRecord"], // old name for this
"fields" : [
{"name": "desp", "type": "string"},
{"name": "value", "type": "long"}, // each element has a long
{"name": "next", "type": ["DemoRecord", "null"]} // optional next element
]
}
The following attributes are required in a record schema.
- name: the name of the record
- type: a JSON string “record”
- fields: a JSON array describing the data fields used to form the record. A data record can have the following attributes.
- name: required. Indicating the name of the field.
- type: required. Indicating the type of the field. It can be primitive or complex type.
- default: optional. However, this is required when we deserialize data that doesn’t have this field.
- order: optional. Specify how this field affect the sort ordering of the record.
- doc: optional. a JSON string describing the field.
- aliases: optional. a JSON array of strings, providing alternative names for this field.
The following attributes are optional for a record schema.
- namespace: used to avoid name conflict. corresponds to Java package.
- doc: describing the record
- aliases: a JSON array of strings. indicating alternate names
Enum: An example of enum defined in a schema file is shown as below.
{
"namespace": "avro",
"type": "enum",
"name": "DemoEnum",
"symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"]
}
It has the following required attributes.
- name: the name of the enum
- type: a JSON string “enum”
- symbols: a JSON array of JSON strings, indicating all symbols defined for the enum. The symbols are required to be unique.
Similar to records, namespace, doc and aliases are optional.
Arrays: An example of a record containing array as one of its fields are as below,
{
"namespace": "avro",
"type": "record",
"name": "DemoArray",
"fields" : [
{"name": "anArr", "type": {"type": "array", "items": "string"}}
]
}
The following attributes are required.
- type: must be a JSON string “array”
- items: the schema of the array’s items. In the example above, the items are string.
Map: An example of a schema file defining a record with a map as one of its fields are as below.
{
"namespace": "avro",
"type": "record",
"name": "DemoMap",
"fields" : [
{"name": "aMap", "type": {"type": "map", "values": "long"}}
]
}
The following attributes are required.
- type: a JSON string “map”
- values: the schema of the map’s values. The keys are assumed to be strings. In the example above, the values are of long type.
Union: Below is a sample schema of a record containing a union as one of its fields.
{
"namespace": "avro",
"type": "record",
"name": "DemoUnion",
"fields" : [
{"name": "aUnion", "type": ["string", "null"]}
]
}
Unions are defined as a JSON array. In the example above, the field aUnion can either be a string or null.
Fixed: fixed define a data of fixed number of bytes. Below is a sample schema of a fixed.
{"namespace": "avro", "type": "fixed", "size": 16, "name": "DemoFixed"}
It has the following required attributes.
- name: a JSON string contains the name of the fixed.
- type: a JSON string “fixed”
- size: the number of bytes in a value of the fixed type.
It has the following optional attributes.
- namespace: similar to record.
- aliases: similar to record.
References:
Apache Avro specification 1.7.1: http://avro.apache.org/docs/current/spec.html
Leave a Reply Cancel reply
40% Discount on My Book — Android NDK Cookbook
Android NDK Cookbook ebook 40% discount with promotion code MREANC40 at Packt Publishing The promotion code is valid until 15th June.Categories
- Android Apps (18)
- Android Audio Editor (1)
- TS 2 (3)
- Video Converter Android (8)
- Video2Gif (1)
- Android Tutorial (27)
- Android Dev Tools (1)
- API illustrated (8)
- Multimedia API (3)
- ffmpeg on Android (4)
- NDK (6)
- UI (6)
- Animation (2)
- Code Snippet (2)
- Coding Beyond Technique (18)
- a word, a world (4)
- Bug Rectified (4)
- Programming Habit (1)
- Software as a Career (1)
- Software as User Experience (1)
- Compilers and Related (2)
- ELF (2)
- Computer Languages (31)
- C/C++ (13)
- Java (9)
- JavaScript (2)
- PHP (1)
- Python (8)
- Data Structure & Algorithms (29)
- Bits (1)
- Data Structure (5)
- Integers (10)
- BigInteger (1)
- Prime (4)
- Search (3)
- Sorting (5)
- Strings (5)
- Database (1)
- SQLite (1)
- Digital Signal Processing (33)
- Distributed Systems (17)
- Apache Cassandra (6)
- Apache Hadoop (8)
- Apache Avro (3)
- Apache Nutch (3)
- Apache Solr (1)
- Linux Study Notes (40)
- crontab (1)
- Linux Kernel Programming (8)
- Linux Programming (12)
- IPC (2)
- Linux Network Programming (5)
- Linux Signals (2)
- Linux Shell Scripting (1)
- ssh (3)
- Machinery (30)
- misc (1)
- My Ideas (1)
- My Project (3)
- Mobile Caching (1)
- Selective Decoding (2)
- My Publication (1)
- My Readings (1)
- Networking (15)
- Program for Performance (8)
- Uncategorized (1)
- Virtual Machine (2)
- Web Dev (8)
- web components (3)
- Android Apps (18)
Recent Comments
Archives
- May 2013 (2)
- April 2013 (1)
- March 2013 (4)
- December 2012 (2)
- November 2012 (6)
- October 2012 (6)
- September 2012 (3)
- August 2012 (13)
- July 2012 (15)
- June 2012 (3)
- May 2012 (8)
- April 2012 (4)
- March 2012 (13)
- February 2012 (19)
- January 2012 (9)
- December 2011 (11)
- November 2011 (12)
- October 2011 (4)
- September 2011 (12)
- August 2011 (16)
- July 2011 (15)
- June 2011 (6)
- May 2011 (10)
- April 2011 (13)
- March 2011 (20)
- February 2011 (4)
- November 2010 (2)
- May 2010 (1)
- April 2010 (1)
- February 2010 (1)




