Saturday, March 26, 2022

Apache NiFi: Avoid these common pitfalls

Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. It has a powerful UI which can be used for both development and operations. In addition, the NiFi Registry is available to make promoting software from one environment to the next, easier. In order to use NiFi efficiently, I'd like to point out some common pitfalls when using NiFi.

Manually execute deployment tasks

You might think Apache NiFi is a low code solution and you can avoid coding related to NiFi altogether. I'm afraid this is not completely the case.

Suppose you want to upgrade a process group to a new version, all queues need to be empty. It helps to disable controller services and process groups to prevent new messages from being processed (and queues filling up again). If you want to do this manually, it takes quite some time (especially for the controller services and queues) and it's boring. In order to make releases more reproducible, predictable and less error prone, it helps to automate such tasks, for example with a script like this. Deployments will go a lot faster this way (meaning less downtime and less boring work to do).

NiFi provides a powerful API to help you automate tasks. If you want to efficiently use the API, several SDKs are available. One of those SDKs is NiPyAPI. NiPyAPI makes it easier to use the API from Python code.

You can also quite easily automate finding unused parameters or empty sensitive values (also see the above sample script). Dealing with those helps to improve the quality of the environment.

Avoid scripts for processing of flowfiles

NiFi is easy to extend and has scripting capabilities in several processors. For example in the ExecuteScript processor. Often however, with some creativity, you can achieve very similar functionality using the standard available processors. Using the standard functionality is (most of the time) more secure and easier to maintain. If you cannot achieve something with the standard functionality, you can also consider creating an external service for this or using a source or target system.

Scripting introduces a security liability

In order to use the scripting capabilities, you need the "execute code" permission (see here). This permission allows you to "execute arbitrary code assuming all permissions that NiFi has". This is very powerful and allows you to do just about anything the user which runs the NiFi server can do. This can be abused. In addition, it is experimental and the impact of sustained usage has not been verified. In theory it could have a memory leak somewhere and make your system less stable/reliable.

Scripts are difficult to maintain

Using scripts in NiFI can be done in 2 ways. 

  • You can copy/paste the code directly in a property of a NiFi processor (Script Body in the below screenshot)
  • You can refer to a script on the filesystem on which the NiFi server is running (Script File in the below screenshot). 


If you copy/paste the code directly in a property of a NiFi processor, the code will end up in the NiFi Registry when you commit a new version of a process group. The NiFi Registry however lacks several features of products which are specifically made for version control of code such as for example Git. You only get to see who committed which process group but for example you cannot use branching/merging or see who is to blame for which line of code (this might be possible when using Git to back the NiFi Registry but I have not checked this).  

When using a file on the filesystem, version control of the process group and the script are detached. If you want to upgrade the script, you cannot do so from the NiFi API or web interface but you need access to the NiFi server filesystem of all the nodes. This is not only not so secure but it probably requires you to build something for deployment of scripts in addition to the NiFi Registry and probably you need to coordinate the deployment of the flow and the script in a single release, adding additional complexity. Better to avoid this altogether.

If you use external dependencies, you need to put them on the filesystem of the server and configure them explicitly. In case of Python, which in case of NiFi, is actually Jython, you can encounter the challenge where there is a Python module available but no Jython alternative. In such a case you can use ExecuteProcess to execute a native Python interpreter (which has to be installed on the NiFi server). The challenge here is how to make the flowfile available to this process. This will probably mean first persisting the flowfile somewhere and use it as an argument to a script. Again not so straightforward, not so secure and difficult to maintain.

Do not use templates for deployment of flows

Templates


You can create templates of process groups. These templates can be exported in one environment and imported in the next. You might be tempted to think you can use them to for example promote changes from one environment to the next or to help you setup a local environment from a place where you do not have access to the NiFi Registry. 

A template contains the process groups, processors, connections and controller services which are scoped to the process group (or child process groups). The process groups in the template lose connection to version control (are not tracked anymore). Also the explicit assignment of parameter contexts to process groups is lost. In effect this means that if parameters or parameter contexts are missing, processors will become invalid when imported in another environment.

When the connection to version control is lost, I have not found an option to reestablish this. You can start version control on unversioned process groups, but it will complain if the process group is already present in the Registry. I have not seen the option to indicate that a process group is already present in the Registry and that it should assume a specific version (i.e. not add it to version control as a new process group but re-establish the link with an existing process group which is already present in the Registry).

Templates can be a big help during development if you have many flows which are relatively similar. You can create a template in which as much as possible is parametrized and assign a parameter context to make it specific. Do mind though that when a template is applied, the link to the original template is lost. When the original template is updated, the flows which were based on it, will not be updated.

The NiFi Registry


When using the NiFi Registry to promote flows, the reference to the process group in version control remains. Even when child process groups are version controlled, those also remain linked to version control. The link means you can see if the local version has been changed when compared to the version in the Registry, if there is a new version available in the Registry and if local modifications cause a conflict with the new version.

Parameter contexts, the link process groups have to parameter contexts and parameters (but not sensitive values) are part of what is committed to the registry. This makes it easier (requires less scripting or manual tasks) to promote those to different environments.

No comments:

Post a Comment