Apache Avro

{{Short description|Open-source remote procedure call framework}}

{{Use mdy dates|date=April 2016}}

{{Infobox software

| name = Apache Avro

| logo = Apache Avro Logo 2023.svg

| logo size = 250px

| screenshot =

| caption =

| collapsible =

| developer = Apache Software Foundation

| released = {{Start date and age|2009|11|02|df=yes}}{{Cite web|url=https://blog.cloudera.com/blog/2009/11/avro-a-new-format-for-data-interchange/|title=Apache Avro: a New Format for Data Interchange|website=blog.cloudera.com|access-date=2019-03-10}}

| latest release version = 1.11.3

| latest release date = {{release date and age|2023|9|23}}{{Cite web|url=https://avro.apache.org/releases.html|title=Apache Avro Releases|website=avro.apache.org|access-date=2023-09-23}}

| latest preview version =

| latest preview date =

| operating_system =

| repo = {{URL|https://gitbox.apache.org/repos/asf?p{{=}}avro.git|Avro Repository}}

| programming language = Java, C, C++, C#, Perl, Python, PHP, Ruby

| genre = Remote procedure call framework

| license = Apache License 2.0

| website = {{url|//avro.apache.org/}}

}}

Avro is a row-oriented remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.

Avro uses a schema to structure the data that is being encoded. It has two different types of schema languages: one for human editing (Avro IDL) and another which is more machine-readable based on JSON.{{cite book |last1=Kleppmann |first1=Martin |title=Designing Data-Intensive Applications |date=2017 |publisher=O'Reilly |page=122 |edition= First}}

It is similar to Thrift and Protocol Buffers, but does not require running a code-generation program when a schema changes (unless desired for statically-typed languages).

Apache Spark SQL can access Avro as a data source.{{cite web|url=http://dataconomy.com/3-reasons-hadoop-analytics-big-deal/|title=3 Reasons Why In-Hadoop Analytics are a Big Deal - Dataconomy|date=April 21, 2016|website=dataconomy.com}}

Avro Object Container File

An Avro Object Container File consists of:{{Cite web|url=https://avro.apache.org/docs/++version++/specification/#object-container-files|title=Apache Avro Specification: Object Container Files|website=avro.apache.org|access-date=2024-09-08}}

A file header consists of:

  • Four bytes, ASCII 'O', 'b', 'j', followed by the Avro version number which is 1 (0x01) (Binary values 0x4F 0x62 0x6A 0x01).
  • File metadata, including the schema definition.
  • The 16-byte, randomly-generated sync marker for this file.

For data blocks Avro specifies two serialization encodings:{{Cite web|url=https://avro.apache.org/docs/++version++/specification/#encodings|title=Apache Avro Specification: Encodings|website=avro.apache.org|access-date=2024-09-08}} binary and JSON. Most applications will use the binary encoding, as it is smaller and faster. For debugging and web-based applications, the JSON encoding may sometimes be appropriate.

Schema definition

Avro schemas are defined using JSON. Schemas are composed of primitive types (null, boolean, int, long, float, double, bytes, and string) and complex types (record, enum, array, map, union, and fixed).{{Cite web|url=https://avro.apache.org/docs/current/gettingstartedpython.html#Defining+a+schema|title=Apache Avro Getting Started (Python)|website=avro.apache.org|access-date=2019-03-11|archive-date=June 5, 2016|archive-url=https://web.archive.org/web/20160605204052/http://avro.apache.org/docs/current/gettingstartedpython.html#Defining+a+schema|url-status=dead}}

Simple schema example:

{

"namespace": "example.avro",

"type": "record",

"name": "User",

"fields": [

{"name": "name", "type": "string"},

{"name": "favorite_number", "type": ["null", "int"]},

{"name": "favorite_color", "type": ["null", "string"]}

]

}

Serializing and deserializing

Data in Avro might be stored with its corresponding schema, meaning a serialized item can be read without knowing the schema ahead of time.

= Example serialization and deserialization code in Python =

Serialization:{{Cite web|url=https://avro.apache.org/docs/current/gettingstartedpython.html|title=Apache Avro Getting Started (Python)|website=avro.apache.org|access-date=2019-03-11|archive-date=June 5, 2016|archive-url=https://web.archive.org/web/20160605204052/http://avro.apache.org/docs/current/gettingstartedpython.html|url-status=dead}}

import avro.schema

from avro.datafile import DataFileReader, DataFileWriter

from avro.io import DatumReader, DatumWriter

  1. Need to know the schema to write. According to 1.8.2 of Apache Avro

schema = avro.schema.parse(open("user.avsc", "rb").read())

writer = DataFileWriter(open("users.avro", "wb"), DatumWriter(), schema)

writer.append({"name": "Alyssa", "favorite_number": 256})

writer.append({"name": "Ben", "favorite_number": 8, "favorite_color": "red"})

writer.close()

File "users.avro" will contain the schema in JSON and a compact binary representation{{Cite web|url=https://avro.apache.org/docs/++version++/specification/#data-serialization-and-deserialization|title=Apache Avro Specification: Data Serialization|website=avro.apache.org|access-date=2024-09-08}} of the data:

$ od -v -t x1z users.avro

0000000 4f 62 6a 01 04 14 61 76 72 6f 2e 63 6f 64 65 63 >Obj...avro.codec<

0000020 08 6e 75 6c 6c 16 61 76 72 6f 2e 73 63 68 65 6d >.null.avro.schem<

0000040 61 ba 03 7b 22 74 79 70 65 22 3a 20 22 72 65 63 >a..{"type": "rec<

0000060 6f 72 64 22 2c 20 22 6e 61 6d 65 22 3a 20 22 55 >ord", "name": "U<

0000100 73 65 72 22 2c 20 22 6e 61 6d 65 73 70 61 63 65 >ser", "namespace<

0000120 22 3a 20 22 65 78 61 6d 70 6c 65 2e 61 76 72 6f >": "example.avro<

0000140 22 2c 20 22 66 69 65 6c 64 73 22 3a 20 5b 7b 22 >", "fields": [{"<

0000160 74 79 70 65 22 3a 20 22 73 74 72 69 6e 67 22 2c >type": "string",<

0000200 20 22 6e 61 6d 65 22 3a 20 22 6e 61 6d 65 22 7d > "name": "name"}<

0000220 2c 20 7b 22 74 79 70 65 22 3a 20 5b 22 69 6e 74 >, {"type": ["int<

0000240 22 2c 20 22 6e 75 6c 6c 22 5d 2c 20 22 6e 61 6d >", "null"], "nam<

0000260 65 22 3a 20 22 66 61 76 6f 72 69 74 65 5f 6e 75 >e": "favorite_nu<

0000300 6d 62 65 72 22 7d 2c 20 7b 22 74 79 70 65 22 3a >mber"}, {"type":<

0000320 20 5b 22 73 74 72 69 6e 67 22 2c 20 22 6e 75 6c > ["string", "nul<

0000340 6c 22 5d 2c 20 22 6e 61 6d 65 22 3a 20 22 66 61 >l"], "name": "fa<

0000360 76 6f 72 69 74 65 5f 63 6f 6c 6f 72 22 7d 5d 7d >vorite_color"}]}<

0000400 00 05 f9 a3 80 98 47 54 62 bf 68 95 a2 ab 42 ef >......GTb.h...B.<

0000420 24 04 2c 0c 41 6c 79 73 73 61 00 80 04 02 06 42 >$.,.Alyssa.....B<

0000440 65 6e 00 10 00 06 72 65 64 05 f9 a3 80 98 47 54 >en....red.....GT<

0000460 62 bf 68 95 a2 ab 42 ef 24 >b.h...B.$<

0000471

Deserialization:

  1. The schema is embedded in the data file

reader = DataFileReader(open("users.avro", "rb"), DatumReader())

for user in reader:

print(user)

reader.close()

This outputs:

{'name': 'Alyssa', 'favorite_number': 256, 'favorite_color': None}

{'name': 'Ben', 'favorite_number': 8, 'favorite_color': 'red'}

Languages with APIs

Though theoretically any language could use Avro, the following languages have APIs written for them:{{cite web|url=https://github.com/phunt/avro-rpc-quickstart|title=GitHub - phunt/avro-rpc-quickstart: Apache Avro RPC Quick Start. Avro is a subproject of Apache Hadoop.|author=phunt|work=GitHub|access-date=April 13, 2016}}{{Cite web| title = Supported Languages - Apache Avro - Apache Software Foundation| access-date = 2016-04-21| url = https://cwiki.apache.org/confluence/display/AVRO/Supported+Languages}}

  • C
  • C++
  • C#{{cite web|url=https://issues.apache.org/jira/browse/AVRO/fixforversion/12316197|title=Avro: 1.5.1 - ASF JIRA|access-date=April 13, 2016}}{{cite web|url=https://issues.apache.org/jira/browse/AVRO-533|title=[AVRO-533] .NET implementation of Avro - ASF JIRA|access-date=April 13, 2016}}{{cite web|url=https://cwiki.apache.org/confluence/display/AVRO/Supported+Languages|title=Supported Languages|access-date=April 13, 2016}}
  • Elixir{{Cite web|title=AvroEx|url=https://hexdocs.pm/avro_ex/api-reference.html|access-date=October 18, 2017|website=hexdocs.pm}}{{Cite web|title=Avrora — avrora v0.21.1|url=https://hexdocs.pm/avrora/readme.html|access-date=2021-06-11|website=hexdocs.pm}}
  • Go{{Cite web |title=avro package - github.com/hamba/avro - Go Packages |url=https://pkg.go.dev/github.com/hamba/avro |access-date=2023-07-04 |website=pkg.go.dev}}{{Citation |title=goavro |date=2023-06-30 |url=https://github.com/linkedin/goavro |access-date=2023-07-04 |publisher=LinkedIn}}
  • Haskell{{cite web|url=https://github.com/GaloisInc/avro|title=Native Haskell implementation of Avro|publisher=Thomas M. DuBuisson, Galois, Inc.|access-date=August 8, 2016}}
  • Java
  • JavaScript{{cite web|url=https://github.com/mtth/avsc|title=Pure JavaScript implementation of the Avro specification.

|website=GitHub

|access-date=May 4, 2020}}

  • Perl
  • PHP
  • Python{{Cite web |title=Getting Started (Python) |url=https://avro.apache.org/docs/1.11.1/getting-started-python/ |access-date=2023-07-04 |website=Apache Avro |language=en}}{{Citation |last=Avro |first=Apache |title=avro: Avro is a serialization and RPC framework. |url=https://avro.apache.org/ |access-date=2023-07-04}}
  • Ruby
  • Rust{{cite web|url=https://crates.io/crates/apache-avro|title=Apache Avro client library implementation in Rust

|access-date=December 17, 2018}}

Avro IDL

In addition to supporting JSON for type and protocol definitions, Avro includes experimental{{cite web|url=http://avro.apache.org/docs/current/idl.html|title=Apache Avro 1.8.2 IDL|access-date=2019-03-11|archive-date=September 20, 2010|archive-url=https://web.archive.org/web/20100920001652/http://avro.apache.org/docs/current/idl.html|url-status=dead}} support for an alternative interface description language (IDL) syntax known as Avro IDL. Previously known as GenAvro, this format is designed to ease adoption by users familiar with more traditional IDLs and programming languages, with a syntax similar to C/C++, Protocol Buffers and others.

Logo

The original Apache Avro logo was from the defunct British aircraft manufacturer Avro (originally A.V. Roe and Company).{{Cite web|url=http://avroheritagemuseum.co.uk/the-avro-logo/|title=The Avro Logo|website=avroheritagemuseum.co.uk|access-date=2018-12-31}}

The Apache Avro logo was updated to an original design in late 2023.{{Cite web|url=https://issues.apache.org/jira/browse/AVRO-3908|title=[AVRO-3908] Update project logo everywhere - ASF JIRA|website=apache.org|access-date=2024-02-06}}

See also

References

{{Reflist}}

Further reading

  • {{cite book

| url = https://archive.org/details/hadoopdefinitive0000whit

| url-access = registration

| title = Hadoop: The Definitive Guide

| isbn = 978-1-4493-8973-4

|date=November 2010

| first = Tom | last = White

}}

{{Apache Software Foundation}}

{{Data Exchange}}

{{DEFAULTSORT:Avro}}

Category:Apache Software Foundation projects

Category:Inter-process communication

Category:Application layer protocols

Category:Remote procedure call

Category:Data serialization formats

Category:Articles with example Python (programming language) code