Gearman provides a basic command line tool that helps to distribute work from clients to worker processes/machines. By default the worker process can only receive data from its standard input. With the
xargs tool we can circumvent this and route data to the process arguments of the worker process.
At work, I regularly have to run several longer running processes, like neural network training. Most processes are independent and running them one after the other on same core of the same machine is obviously a big waste of time as we have several number crunching machines with several cores here at the lab.
I'm investigating some job queue/load balancing tools to parallelize my work and currently I'm looking into Gearman, a generic framework for distributing work to other machines or processes and seems to fit most of my needs. Its core is written in C, but it provides an API for several languages, so you can write your client processes (which define the work to do) and worker processes (which do the real work) in for example Python, PHP or Perl.
However, I don't want to depend heavily on Gearman as a framework, because I want to be able to run my applications and scripts also without the Gearman framework. Therefore, I'm trying to limit myself to the Gearman command line tool to run clients and workers, so I don't depend on Gearman in my source code, but only use it when launching scripts and jobs.
Gearman command line tool example
The Gearman manual describes how to use the command line tool
gearman. For the sake of discussion I'll adapt the example given there (counting words with the
wc tool). The worker, which registers the function 'wordcount', is started with
$ gearman -w -f wordcount wc
The worker runs in foreground here (and keeps running until we kill it with CTRL-C for example), so for the remainder of the experiment we have to open a new shell.
You can then consume the 'wordcount' service as a client with
$ gearman -f wordcount < /etc/passwd 62 114 3306
Note that the gearman command line workflow employs the standard input and output to communicate: the worker process receives its input through the standard input, generates "results" on its standard output, which is redirected to the client. In the example given above we also used the standard input of the client to provide the data to the client. We can also provide data to the client as command line arguments, but note how it will launch a job for each argument separately (also note that the jobs will not be run in parallel this way):
$ gearman -f wordcount foo barr "a string with spaces" 0 1 3 0 1 4 0 4 20
Can I haz procezz argumentz?
In my case however, I typically want to run jobs with commands like
some_script.py --option=A /path/to/file some_script.py --option=B /path/to/other/file
But with the gearman command line tool I can't set the arguments (like
/path/to/other/file) of the worker process, as I only have access to its standard input. I could of course rewrite
some_script.py to get its settings from the standard input, but that breaks my workflow when I don't want/have to use the Gearman framework.
Xargs to the rescue!
A workaround is to use Xargs for routing the standard input to process arguments. To illustrate this, consider this small Python script
show_input.py, which shows the input data it received from the process arguments and the standard input:
#!/usr/bin/env python import sys print "sys.argv:", sys.argv print "sys.stdin:", sys.stdin.readlines()
$ echo foo bar | ./show_input.py baz sys.argv: ['./show_input.py', 'baz'] sys.stdin: ['foo bar\n']
$ echo foo bar | xargs ./show_input.py baz sys.argv: ['./show_input.py', 'baz', 'foo', 'bar'] sys.stdin: 
Now, let's start a Gearman worker with
xargs and this script:
$ gearman -w -f show_input xargs ./show_input.py additional_argument
And consume the service:
$ gearman -f show_input foo bar sys.argv: ['./show_input.py', 'additional_argument', 'foo'] sys.stdin:  sys.argv: ['./show_input.py', 'additional_argument', 'bar'] sys.stdin: 
Note again how a separate job is created for each argument. To group arguments, we can use quotes.
$ gearman -f show_input "foo bar baz" sys.argv: ['./show_input.py', 'additional_argument', 'foo', 'bar', 'baz'] sys.stdin: 
When arguments should have spaces, we have to escape them too or use other quotes (this works for me in Bash at least). Also, when the first argument starts with a dash, add a
-- to the mix, so gearman does not try to consume it himself.
$ gearman -f show_input -- "-a foo bar\ baz 'one two three'" sys.argv: ['./show_input.py', 'additional_argument', '-a', 'foo', 'bar baz', 'one two three'] sys.stdin: 
Gearman provides a basic command line tool that helps to distribute work from clients to worker processes/machines. By default the worker process can only receive data from its standard input. If we want to set the process arguments of the worker process instead, e.g. as if we would invoke the worker process with
# Without Gearman: directly running the job. $ worker_command -a foo "bar baz"
we can use a
xargs workaround when launching the worker:
# With Gearman: launch worker. $ gearman -w -f name xargs worker_command
and consume it as a client with
# With Gearman: client consumes service. $ gearman -f name -- '-a foo "bar baz"'