Avro is an Apache™ open source project that provides data serialization and data exchange services for Hadoop®. These services can be used together or independently. Using Avro, big data can be exchanged between programs written in any language
Using the serialization service, programs can efficiently serialize data into files or into messages. The data storage is compact and efficient. Avro stores both the data definition and the data together in one message or file making it easy for programs to dynamically understand the information stored in an Avro file or message. Avro stores the data definition in JSON format making it easy to read and interpret, the data itself is stored in binary format making it compact and efficient. Avro files include markers that can be used to splitting large data sets into subsets suitable for MapReduce processing. Some data exchange services use a code generator to interpret the data definition and produce code to access the data. Avro doesn't require this step, making it ideal for scripting languages.
Avro supports a rich set of primitive data types including: numeric, binary data and strings; and a number of complex types including arrays, maps, enumerations and records. A sort order can also be defined for the data. A key feature of Avro is robust support for data schemas that change over time - often called schema evolution. Avro cleanly handles schema changes like missing fields, added fields and changed fields; as a result, old programs can read new data and new programs can read old data. Avro includes API's for Java, Python, Ruby, C, C++ and more. Data stored using Avro can easily be passed from a program written in one language to a program written in another language, even from a complied language like C to a scripting language like Pig.
Using the data exchange service, programs can easily communicate data and information to other programs using Remote Procedure Calls. An Avro Remote Procedure Call interface is specified in JSON. An interface has two sections, a protocol declaration and a wire format. The protocol declaration defines the messages that will be exchanged. These are defined as Avro data schemas.
The wire format defines three things:
- How request and response messages are sent, received and buffered
- A handshake protocol to establish communication
- The request and response message exchanges.