Thursday, January 27, 2022

Java: Validating JSON against an AVRO schema

AVRO schema are often used to serialize JSON data into a compact binary format in order to for example transport it efficiently over Kafka. When you want to validate your JSON against an AVRO schema in Java, you will encounter a challenge. The JSON which is required to allow validation against an AVRO schema from the Apache AVRO libraries is not standard JSON. It requires explicit typing of fields. Also when the validation fails, you will get errors like: "Expected start-union. Got VALUE_STRING" or "Expected start-union. Got VALUE_NUMBER_INT" without a specific object, line number or indication  of what is expected. Especially during development, this is insufficient.

In this blog post I'll describe a method (inspired by this) on how you can check your JSON against an AVRO schema and get usable validation results. First you generate Java classes of your AVRO schema using the Apache AVRO Maven plugin (which is configured differently than documented). Next you serialize a JSON object against these classes using libraries from the Jackson project. During serialization, you will get clear exceptions. See my sample code here.

Generate Java classes based on the AVRO schema

How to generate Java classes using Maven is described here. There is an error in the documentation though. The configuration section of the plugin will be ignored if you follow the manual (used Apache AVRO 1.11.0 libraries). I did check if there was an easy way to create an issue or mail someone about this but could not find this quickly enough. The configuration should be placed directly below the plugin tag and not below the execution.

Now you can generate Java classes by executing "mvn avro:schema". It expects to find schema files in the resources folder of your Maven project and will generate sources under the src/main/java folder.

Validate a JSON against generated classes

The Jackson ObjectMapper can map JSON to the generated classes. My my case the JSON file was in the src/main/resources folder of my project.

// Validate a JSON file
ClassLoader classLoader = this.getClass().getClassLoader();
InputStream is_json = classLoader.getResourceAsStream("file.json");
ObjectMapper mapper = new ObjectMapper();
try {
// DeSerializing the JSON to Avro class, but this doesn't check for Schema restrictions
YOURGENERATEDCLASS obj = mapper.readValue(is_json, YOURGENERATEDCLASS.class);
// Encoding the class and serializing to raw format, this step validates based on schema
obj.toByteBuffer();
System.out.println("No errors");
} catch (Exception e) {
System.out.println(e.getMessage());
}

See it in action

You can find my sample code here.

I use the following AVRO schema

{
"namespace": "com.demo.avro",
"type": "record",
"name": "Customer",
"fields": [
{
"name": "id",
"type": "int"
},
{
"name": "name",
"type": "string"
},
{
"name": "faxNumber",
"type": [
"null",
"string"
],
"default": null
}
]
}

First I've generated Java classes for the AVRO schema ("mvn avro:schema")

A valid JSON

Next I check whether my valid JSON is valid according to the schema;

{"id": 1677155554, "name": "Maarten", "faxNumber": "1234567890"}
I perform 3 different types on validation on the same JSON/AVRO combination with the following results:

  • Directly against the schema using the Apache AVRO library:
    Expected start-union. Got VALUE_STRING
  • Against the generated classes using the Apache AVRO library:
    Expected start-union. Got VALUE_STRING
  • Against the generated classes using the the Jackson ObjectMapper:
    No errors
This JSON, although generally considered it should pass the validation, will be considered invalid by the Apache AVRO library. I did not manage to conjure up a JSON which would pass this validation. Even a random JSON generated by the Apache AVRO library itself based on the schema (using org.apache.avro.util.RandomData, see here) fails. The validation using Jackson however does work as expected.

A different field name

If I make the JSON deliberately invalid by using a "naam" field instead of a "name" field:

{"id": 1677155554, "naam": "Maarten", "faxNumber": "1234567890"}
  • Directly against the schema using the Apache AVRO library:
    Expected field name not found: name
  • Against the generated classes using the Apache AVRO library:
    Expected field name not found: name
  • Against the generated classes using the the Jackson ObjectMapper:
    Unrecognized field "naam" (class com.demo.avro.Customer), not marked as ignorable (3 known properties: "id", "faxNumber", "name"]) at [Source: (BufferedInputStream); line: 1, column: 29] (through reference chain: com.demo.avro.Customer["naam"])
In all cases, the "name" field is mentioned. However Jackson also mentions the "naam" field which helps when determining why the JSON is invalid.

Field to object

If I change the type of name to an object in the JSON
{"id": 1677155554, "name": {"name": "Maarten"}, "faxNumber": "1234567890"}
  • Directly against the schema using the Apache AVRO library:
    Expected string. Got START_OBJECT
  • Against the generated classes using the Apache AVRO library:
    Expected string. Got START_OBJECT
  • Against the generated classes using the the Jackson ObjectMapper:
    Cannot deserialize value of type `java.lang.String` from Object value (token `JsonToken.START_OBJECT`) at [Source: (BufferedInputStream); line: 1, column: 28] (through reference chain: com.demo.avro.Customer["name"])

Especially in the last example, it becomes clear that when using the Apache AVRO library to do validations, it does not always indicate where (which field) the error occurs. Especially when working with large JSON files and AVRO schema, this becomes a problem when you want to figure out why a message is invalid.

No comments:

Post a Comment