Gearman (job queue manager) command line tool: routing the workload to the worker process arguments with xargs

Abstract

Gearman provides a basic command line tool that helps to distribute work from clients to worker processes/machines. By default the worker process can only receive data from its standard input. With the xargs tool we can circumvent this and route data to the process arguments of the worker process.

Introduction

At work, I regularly have to run several longer running processes, like neural network training. Most processes are independent and running them one after the other on same core of the same machine is obviously a big waste of time as we have several number crunching machines with several cores here at the lab.

I'm investigating some job queue/load balancing tools to parallelize my work and currently I'm looking into Gearman, a generic framework for distributing work to other machines or processes and seems to fit most of my needs. Its core is written in C, but it provides an API for several languages, so you can write your client processes (which define the work to do) and worker processes (which do the real work) in for example Python, PHP or Perl.

However, I don't want to depend heavily on Gearman as a framework, because I want to be able to run my applications and scripts also without the Gearman framework. Therefore, I'm trying to limit myself to the Gearman command line tool to run clients and workers, so I don't depend on Gearman in my source code, but only use it when launching scripts and jobs.

Gearman command line tool example

The Gearman manual describes how to use the command line tool gearman. For the sake of discussion I'll adapt the example given there (counting words with the wc tool). The worker, which registers the function 'wordcount', is started with

$ gearman -w -f wordcount wc

The worker runs in foreground here (and keeps running until we kill it with CTRL-C for example), so for the remainder of the experiment we have to open a new shell.

You can then consume the 'wordcount' service as a client with

$ gearman -f wordcount < /etc/passwd
     62     114    3306

Note that the gearman command line workflow employs the standard input and output to communicate: the worker process receives its input through the standard input, generates "results" on its standard output, which is redirected to the client. In the example given above we also used the standard input of the client to provide the data to the client. We can also provide data to the client as command line arguments, but note how it will launch a job for each argument separately (also note that the jobs will not be run in parallel this way):

$ gearman -f wordcount foo barr "a string with spaces"
      0       1       3
      0       1       4
      0       4      20

Can I haz procezz argumentz?

In my case however, I typically want to run jobs with commands like

some_script.py --option=A /path/to/file
some_script.py --option=B /path/to/other/file

But with the gearman command line tool I can't set the arguments (like --option=A and /path/to/other/file) of the worker process, as I only have access to its standard input. I could of course rewrite some_script.py to get its settings from the standard input, but that breaks my workflow when I don't want/have to use the Gearman framework.

Xargs to the rescue!

A workaround is to use Xargs for routing the standard input to process arguments. To illustrate this, consider this small Python script show_input.py, which shows the input data it received from the process arguments and the standard input:

#!/usr/bin/env python
import sys
print "sys.argv:", sys.argv
print "sys.stdin:", sys.stdin.readlines()

For example:

$ echo foo bar | ./show_input.py baz
sys.argv: ['./show_input.py', 'baz']
sys.stdin: ['foo bar\n']

and with xargs:

$ echo foo bar | xargs ./show_input.py baz
sys.argv: ['./show_input.py', 'baz', 'foo', 'bar']
sys.stdin: []

Now, let's start a Gearman worker with xargs and this script:

$ gearman -w -f show_input xargs ./show_input.py additional_argument

And consume the service:

$ gearman -f show_input foo bar
sys.argv: ['./show_input.py', 'additional_argument', 'foo']
sys.stdin: []
sys.argv: ['./show_input.py', 'additional_argument', 'bar']
sys.stdin: []

Note again how a separate job is created for each argument. To group arguments, we can use quotes.

$ gearman -f show_input "foo bar baz"
sys.argv: ['./show_input.py', 'additional_argument', 'foo', 'bar', 'baz']
sys.stdin: []

When arguments should have spaces, we have to escape them too or use other quotes (this works for me in Bash at least). Also, when the first argument starts with a dash, add a -- to the mix, so gearman does not try to consume it himself.

$ gearman -f show_input -- "-a foo bar\ baz 'one two three'"
sys.argv: ['./show_input.py', 'additional_argument', '-a', 'foo', 'bar baz', 'one two three']
sys.stdin: []

Conclusion

Gearman provides a basic command line tool that helps to distribute work from clients to worker processes/machines. By default the worker process can only receive data from its standard input. If we want to set the process arguments of the worker process instead, e.g. as if we would invoke the worker process with

# Without Gearman: directly running the job.
$ worker_command -a foo "bar baz"

we can use a xargs workaround when launching the worker:

# With Gearman: launch worker.
$ gearman -w -f name xargs worker_command

and consume it as a client with

# With Gearman: client consumes service.
$ gearman -f name -- '-a foo "bar baz"'