Oracle SOA / Java blog: Graceful shutdown of forked workers in Python and JavaScript running in Docker containers

You might encounter a situation where you want to fork a script during execution. For example if the amount of forks is dependent on user input or another specific situation. I encountered such a situation in which I wanted to put load on a service using multiple concurrent processes. In addition, when running in a docker container, only the process with PID=1 receives a SIGTERM signal. If it has terminated, the worker processes receive a SIGKILL signal and are not allowed a graceful shutdown. In order to do a graceful shutdown of the worker processes, the main process needs to manage them and only exit after the worker processes have terminated gracefully. Why do you want processes to be terminated gracefully? In my case because I store performance data in memory (disk is too slow) and only write the data to disk when the test has completed.

This seems relatively straightforward, but there are some challenges. Also I implemented this in JavaScript running on Node and in Python. Python and JavaScript handle forking differently.

Docker and the main process

There are different ways to start a main process in a Dockerfile. A best practice (e.g. here) is to use ENTRYPOINT exec syntax which accepts a JSON array to specify a main executable and fixed parameters. A CMD can be used to give some default parameters. The ENTRYPOINT exec syntax can look like:

ENTRYPOINT ['/bin/sh']

This will start sh with PID=1.

ENTRYPOINT also has a shell syntax. For example:

ENTRYPOINT /bin/sh

This does something totally different! It actually executes '/bin/sh -c /bin/sh' in which the first /bin/sh has PID=1. The second /bin/sh will not receive a SIGTERM when 'docker stop' is called. Also CMD and RUN commands after the second ENTRYPOINT example will not be executed while they will in the first case. A benefit of using the shell variant is that variable substitution takes place. The below examples can be executed with the code put in a Dockerfile and the following command:

docker build -t test .
docker run test

Thus

FROM registry.fedoraproject.org/fedora-minimal
ENV greeting hello
ENTRYPOINT [ "/bin/echo", "$greeting" ]
CMD [ "and some more" ]

Will display '$greeting and some more' while /bin/echo will have PID=1.

While

FROM registry.fedoraproject.org/fedora-minimal
ENV greeting hello
ENTRYPOINT /bin/echo $greeting
CMD [ ' and some more' ]

will display 'hello' and you cannot be sure of the PID of /bin/echo

You can use arguments in a Dockerfile. If however you do not want to use the shell variant of ENTRYPOINT, here's a trick you can use:

FROM azul/zulu-openjdk:8u202
VOLUME /tmp
ARG JAR_FILE
COPY ${JAR_FILE} app.jar
ENTRYPOINT ["java","-jar","/app.jar"]

You can use the argument in a copy statement and make sure the target file is always the same. This way you can use the same ENTRYPOINT line while running in this case an app.jar file which is determined by a supplied argument.

Forking and signal handling

Python

See the complete sample code here

In Python you need only two imports to do signal handling: os and signal

I can fork a process by calling

pid=os.fork()

What this does is split the code execution from that point forward with one important difference between master and worker process. The value of pid in the master is 0 while in the worker it has a value greater than 0. You can base logic on the value of pid which is specific to master or worker. Do not mistake the result of the pid variable with the result of os.getpid(). Both processes can have different os.getpid() values of greater than 0.

If you want the master to be able to signal the workers, you can save the pid of the workers in a variable in the master. You can register a signal handler using: signal.signal(signal.SIGINT, exit_signal_handler). In this case the function exit_signal_handler is called when SIGINT is received. You can kill a worker by doing os.kill(worker_pid, signal.SIGINT) in the cleanup procedure of the master. Do not forget to wait until the worker is finished with finished = os.waitpid(worker_pid, 0) or else the master might be finished before the worker causing the worker to be killed in a not so graceful matter.

JavaScript

See the complete sample code here

In JavaScript, forking might be a less obvious thing to do when comparing to Python since in JavaScript it is a good practice to code as much non-blocking as possible. The event loop and (Node.js) workers which pick up tasks, will take care of threading for you. It is a common misconception that JavaScript running on Node is single threaded. It is not; there are multiple worker threads handling tasks. Every fork in this example has its own thread pool and worker threads thus the total amount of threads JavaScript uses when forking is (much) higher than Python.

A drawback of counting on the workers which pull events from the event loop is that it is difficult to obtain fine grained control and thus predictability. If I put something on the event loop, I won't have a guarantee about when it will be picked up. Also I can't be sure the callback handler is called immediately after execution. In my case I needed that control so eventually gave up on Node. I did however create a similar implementation to the one above for Python.

In JavaScript you can use cluster and process to provide you with forking and signal handling capabilities

var cluster = require('cluster');
var process = require('process');

Forking can be done with:

cluster.fork();

This works differently though than with Python. A new process with a new PID is created. This new process though starts the code from the start with some differences. cluster.isMaster is false in the workers and the worker contains an array of workers: cluster.workers. This can be used to signal the workers and wait for them to have gracefully shutdown. Also do mind that master and workers do not share similar variable values since the worker is started as an entirely new process not splitting execution at the fork command like with Python..

Signal handling can be done like;

process.on('SIGTERM', (error, next) => {

mylogger("INFO\t"+pid+"\tSIGTERM received");

cleanup();

});

Signalling the workers can be done with:

for (const id in cluster.workers) {

mylogger("INFO\t"+pid+"\tSending SIGINT to worker with id: "+String(id));

cluster.workers[id].process.kill();

}

The command process.kill() on the worker waits until worker has gracefully shutdown. Do mind that in the cleanup function you need to do different things for the master and the workers. Also mind that the id of the worker in the master is not the PID. The PID is process.pid.

Putting it together

In order to put everything together and make sure when a docker stop is issued, even the workers get a chance at graceful shutdown, several things are needed;

The master needs to have PID=1 so it can receive the SIGTERM which is issued by docker stop. This can be achieved by using the ENTRYPOINT exec syntax
The master needs a signal handler for SIGTERM in order to respond and inform the workers
The master needs to know how to signal the workers (by pid for Python and by id for JavaScript). In JavaScript an array of workers is available by default. In Python you need to keep track yourself.
The master needs to signal the workers
The master needs to wait for the workers to finish with their graceful shutdown before exiting itself. Else the workers are still killed in a not so graceful manner. This works out of the box in JavaScript. In Python, it needs an explicit os.waitpid.
The workers need signal handlers to know when to initiate a graceful shutdown

You now know how to do all of this in Python and JavaScript with available sample code to experiment with. Have fun!

Oracle SOA / Java blog

Thursday, June 6, 2019

Graceful shutdown of forked workers in Python and JavaScript running in Docker containers

No comments:

Post a Comment