Why you should use protocol buffers for data serialization?
Overview
Protocol buffers or protos, as sometimes called is open source cross-platform library used to serialize structured data. It is useful in developing programs to communicate with each other over a network or for storing data.
Why Protobuf?
In an event driven architecture, we need to send as well as store events across network and thus, data needs to be serialized and deserialized often. We have popular and effective serialization solutions :
- Java serialization
- Protocol Buffers
- Kryo
- Avro
- Thrift
Key factors in deciding a solution are data format and schema evolution. we will use the same to distinguish between two popular solutions, Protobuf vs Avro model. But before that we will cover the other ones and their pro and cons:
Java serialization has been here for a long time and been used very extensively for serialization/deserialization but it suffers from security vulnerabilities and slow conversion. There is a white paper to fix this and evolve. One can read here Towards better serialization . Second, it does not provide any schema evolution. Some one will bat for JSON, but again it is too verbose and slow to parse, no binary support and no schema support standards. Finally, we have three candidates who fits the bill: Thrift, Protocol Buffer and Avro. All three support cross language serialization of data using a schema, binary data transmission and code generation API.
The performance of a serialization format depends on many factors like data set, library API implementation. for instance, avro does not support extension and nesting like protobuf and should not be used for very complex objects. We can find bench markings of formats on web. Overall, i have found protobuf has better documentation, support extension, nesting, complex structures like map and better schema evolution support.
Schema Evolution
In real life, data is always in flux. The moment we think we have finalised a schema, someone will come up with a use case that wasn’t anticipated, and wants to “just quickly add a field”. Fortunately Thrift, Protobuf and Avro all support schema evolution: it means one can change the schema, or one can have producers and consumers with different versions of the schema at the same time, and it all continues to work. That is an extremely valuable feature when we’re dealing with a big production system, because it allows us to update different components of the system independently, at different times, without worrying about compatibility.
Protocol buffer Model
Protobuf is a typed Interface Definition Language (IDL) with many primitive data types. It also allows for composite types and namespaces through package. We define a .proto schema IDL file and from the protocol file, a provided compiler (protoc) then generates the data access classes for the user’s language of choice. In the generated class, field access and builder methods are provided for the application to interact with the data. so, in order to use protocol buffers as model for your events, we need three things:
Pre-requisite
- Schema — message formats in a
.proto
file - Protoc — protocol buffer compiler
- Java API — Generated Java API to read and write messages.
How to create and generate protos?
- Define a proto schema in project directory here— src/main/proto and add this to the classpath. This is the default standard directory for adding proto definitions. if we would like to add it anywhere else, we have to define exclusively in protobuf maven plugin. here is the sample proto.
syntax = "proto3";package demo;option java_package = "com.demo";
option java_outer_classname = "CustomerProtos";
option optimize_for=SPEED;message Customer {
int32 id = 1;
string firstName = 2;
string lastName = 3;enum EmailType {
PRIVATE = 1;
PROFESSIONAL = 2;
}message EmailAddress {
string email = 1;
EmailType type = 2 [default = PROFESSIONAL];
}repeated EmailAddress email = 5;
}message Organization {
string name = 1;
repeated Customer customer = 2;
}message CustomerList {
repeated Customer customer = 1;
}
2. Protobuf schema is language agnostic and can be used for cross language data transfer. In order to work in java, we need to generate code API using protoc compiler. There are two ways one can work with protoc, either use protoc executable directly to generate code or protobuf-maven-plugin. We will see both here :
protoc executable
- Download protoc windows executable — here
- Run below command to execute and generate code:
C:\BLKDeveloper\Tools\protoc-3.11.4-win64\bin
λ protoc -I=$SRC_DIR --java_out=$DST_DIR $SRC_DIR/customer.protofor exampleλ protoc-3.5.0-windows-x86_64.exe -I=C:\BLKDeveloper\eclipse-ee-2020-03\libcii\src --java_out=C:\BLKDeveloper\eclipse-ee-2020-03\libcii\src\main\java C:\BLKDeveloper\eclipse-ee-2020-03\libcii\src\main\proto\customer.proto
3. This will generate code in the src directory as per the package declaration in schema file.
protobuf-maven plugin
- Add below commands in pom.xml in order to generate the java code while build. Google provides platform-specific
protoc
binaries in the form of maven artifacts. So, protoc will be downloaded while doing maven build. we will add these artifacts as a compile-time dependency to our maven project and invoke the platform-dependent binary to compile.proto
sources as shown.
<properties>
<protobuf.version>3.11.4</protobuf.version>
<protobuf-maven-plugin.version>0.6.1</protobuf-maven-plugin.version>
</properties>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>${protobuf.version}</version>
</dependency><build>
<extensions>
<extension>
<groupId>kr.motd.maven</groupId>
<artifactId>os-maven-plugin</artifactId>
<version>1.6.0</version>
</extension>
</extensions>
<plugins>
<plugin>
<groupId>org.xolstice.maven.plugins</groupId>
<artifactId>protobuf-maven-plugin</artifactId>
<version>${protobuf-maven-plugin.version}</version>
<configuration>
<protocArtifact>com.google.protobuf:protoc:${protobuf.version}:exe:${os.detected.classifier}
</protocArtifact>
</configuration>
<executions>
<execution>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
2. Kindly note that protoc
maven artifact is provided in various platform-specific classifications: linux-x86_32
, linux-x86_64
, osx-x86_32
, osx-x86_64
, windows-x86_32
, windows-x86_64
. In order to pick the right artifact, we have employed os.detected.classifier
property exposed by os-maven-plugin as shown above.
3. This will generate the java api to use proto schema for the default proto input path “src/main/proto” and generate the output “target\generated-sources\protobuf\java” and add the path to the build path. These are default input and output dir. if we need custom, we need to add as below to plugin.
<!-- protobuf paths --><protobuf.input.directory>${project.basedir}/src/main/proto</protobuf.input.directory>
<protobuf.output.directory>${project.build.directory}/generated-sources</protobuf.output.directory>
4. While working where we keep continuously update schema and want to generate code, we can add below maven command and run in IDE with the location of proto executable.
mvn protobuf:compile -DprotocExecutable="C:/BLKDeveloper/Tools/protoc-3.11.4-win64/bin/protoc.exe"
Troubleshooting steps while setting protobuf
If you are using IntelliJ IDEA, you should not have any problem.
If you are using Eclipse, kindly install an additional Eclipse plugin because m2e does not evaluate the extension specified in a pom.xml
. Download os-maven-plugin-1.6.1.jar
and put it into the <ECLIPSE_HOME>/dropins
directory and restart.
Schema Evolution (data versioning)
Versioning is controlled in the .proto IDL file through field numbers. These tag numbers designate which of the named fields must be present in a message to be considered valid. A message version is a function of the field numbering provided by protobuf and how those are changed between different iterations of the data structure.
General Rules for versioning
below are guidelines on how .proto fields should be updated to insure compatible protobuf versioned data:
- Do not change the numbered tags for the fields in the messages. This will break the design considerations meant for backward and forward compatibility.
- Do not remove a field right away if it not being used anymore. Mark it deprecated and have a timeline to completely remove it, thereby giving the integrated applications time to flexibly remove the dependency on that field.
- Changing a default value for a field is allowed. (Default values are sent only if the field is not provided)
- Add new fields for newer implementations and depreciate older fields in a timely way.
- Adding fields is always a safe option as long as you manage them and don’t end up with too many of them.
Please note that tags (REQUIRED/OPTIONAL) are not supported in proto 3 and dropped as they violate protobuf compatibility semantics.