PDA

View Full Version : parallel shell scripting


PopcornKing
10th March 2010, 03:07 PM
So I was executing some bash scripts on a directory of data (~600 files) last night.
My processor usage was limited to 1 core and I couldnt really see it on a graph.
I have dual quadcore and was thinking what a shame to not use other cores.
So I threw together a little bash script to try to run my jobs in parallel.
My total processor usage now looks like a saw tooth function.

To make it a rectangle/uniform function:
I think I need a more sophisticated wait/spawn subshell scheme.
I currently wait for all subshells to finish, but ideally once one is done I should start another.
Whats the best way to implement that?
Thoughts? Comments in general are welcome?


#already have created file_list containing filenames
num_files=${#file_list }
ncpu=`cat /proc/cpuinfo | grep processor | wc -l`
#determine remainder when dividing file_list size by the number of your cpus/cores
n_start_files=`expr ${num_files} % ${ncpu}`

#process the remainder in parallel
echo "Processing initial num of files"
for(( i=0; i < ${n_start_files}; i++))
do
#spawn all in subshells
( call_my_script "${file_list[${i}]}" ) &
done
#wait for the subshells/scripts to finish
wait
echo "Done processing inital num of files"


i=${n_start_files}
#number of full parallel loops required to run your data
num_loops=`expr ${num_files}/${ncpu}`
echo "Begin parallel processing of files"
for (( j=0; j < ${num_loops}; j++))
do
#this spawns $ncpu subshells executing the script
for ((k=0; k < ${ncpu}; k++))
do
( call_my_script "${file_list[${i}+${j}*${ncpu}+${k}]}" ) &
done
#wait for subshells/script to finish
wait
echo "finished ${j}th parallel iteration"
done
echo "Done parallel processing of files"

echo "Leaving ${0}"

exit 0

droidhacker
10th March 2010, 08:48 PM
Yes, something with a little more synchronization would help you.

There is the "suspend" command, which might be helpful here..... one idea might be to fork a few times in a loop until some predetermined number of child processes exist, and then suspend. Have each child process send a sigcont to the parent process upon completion.

i.e.:


for $FILE in WHATEVER
do
(child_process_here "$FILE"; kill -CONT $PPID) &
if [ `jobs | wc -l` -gt 15 ]
then
suspend -f
fi
done


Note the final command of the child process; that gets the parent to resume and spawn a new child (until the for loop is exhausted, obviously).

Probably could use a little tweaking, but should be enough to get you going.

You see the main difference here: rather than waiting for the children to all finish, it just waits for ANY child to give it a kick.

This isn't foolproof though... I can tell you one spot where it could fail (result of timing)... if one of the child processes finishes execution and signals the parent AFTER the test, but BEFORE the suspend, then you will be down one process.

Simple solution to this is to run more processes than you have cpu cores. You say you have 8 cores on your machine, run 16 child processes in parallel. The chances that NINE or more processes will ALL finish after the test and before the suspend is negligible.

Note:
What you are looking for to solve this last timing problem is "mutual exclusion" and/or "semaphore".
You can achieve this using atomic operations, like "mkdir" or "ln -s" -- the return code of these indicates whether the command succeeded or failed. So you can do something really simple, like right before your child process signals the parent, do something like this;

while ( ! ln -s somefile )
do
usleep 10
done
sleep 1
kill -CONT $PPID
** this will put the child into a sleeping loop until it is able to actually create a symlink to some file, i.e. while that symlink ***already exists*** due to having been created by some OTHER child process.

and throw in an "rm -f somefile" to the parent immediately before it suspends.
The "sleep 1" there will ensure that the parent is able to delete the symlink and suspend before any waiting child will have a chance to signal it again.

And yes, there are ways of getting rid of all the sleeps (except the usleep, which is needed just to keep the child process from being overly aggressive in sucking up your CPU).

bmvbab
11th March 2010, 10:11 AM

Try this out: http://code.google.com/p/ppss/

Louwrentius
25th March 2010, 11:32 PM
The main problem with many 'simple' parallel scripts is that there are often race conditions as described above. It makes your script unreliable.

I am the author of PPSS and specifically made PPSS atomic. It is not affected (to my knowledge) by any race conditions. PPSS uses atomic operations such as 'mkdir' and locking mechanisms to prevent race conditions.

It depends of course on what you're planning to do, but PPSS may be of help to you.

OleTange
7th June 2010, 10:58 PM
With GNU Parallel http://www.gnu.org/software/parallel/ you can do:

find datadir -type f | parallel -j+0 call_my_script

This would run call_my_script for each of the files running number-of-cores jobs in parallel - even if some jobs are slower than others.

If your jobs are CPU heavy you may want to let remote machines help you. Here we assume 'call_my_script filename' generates a file called 'filename.output' which we will have to get back from the remote servers:

find datadir -type f | parallel -j+0 -S server1.example.com -S server2.example.net -S : --trc {}.output call_my_script