Thursday, May 21, 2009

find -exec vs xargs

If you want to execute a command on lots of files found by the find command, there are a few different ways this can be achieved (some more efficient than others):

-exec command {} \;
This is the traditional way. The end of the command must be punctuated by an escaped semicolon. The command argument {} is replaced by the current path name found by find. Here is a simple command which echoes file paths.

sharfah@starship:~> find . -type f -exec echo {} \;
.
./1.txt
./2.txt
This is very inefficient, because whenever find finds a file, it forks a process for your command, waits for this child process to complete and then searches for the next file. In this example, you will get the following child processes: echo .; echo ./1.txt; echo ./2.txt. So if there are 1000 files, there are 1000 child processes and find waits.

-exec command {} +
If you use a plus (+) instead of the escaped semicolon, the arguments will be grouped together before being passed to the command. The arguments must be at the end of the command.

sharfah@starship:~> find . -type f -exec echo {} +
. ./1.txt ./2.txt
In this case, only one child process is created: echo . ./1.txt ./2.txt, which is much more efficient, because it avoids a fork/exec for each single argument.

xargs
This is similar to the approach above, in that files found are bundled up (usually in batches of about 20-50 names) and sent to the command as few times as possible. find doesn't wait for your command to finish.

sharfah@starship:~> find . -type f | xargs echo
. ./1.txt ./2.txt
This approach is efficient and works well as long as you do not have funny characters (e.g. spaces) in your filenames as they won't be escaped.

Performance Testing
So which one of the above approaches is fastest? I ran a test across a directory with 10,000 files out of which 5,600 matched my find pattern. I ran the test 10 times, changing the order of the finds each time, but the results were always the same. xargs and + were very close, with \; always finishing last. Here is one result:

time find . -name "*20090430*" -exec touch {} +
real    0m31.98s
user    0m0.06s
sys     0m0.49s

time find . -name "*20090430*" | xargs touch
real    1m8.81s
user    0m0.13s
sys     0m1.07s

time find . -name "*20090430*" -exec touch {} \;
real    1m42.53s
user    0m0.17s
sys     0m2.42s
I'm going to be using the -exec command {} + method, because it is faster and can handle my funny filenames.

1 comment:

  1. Aaron Peschel10:44 PM

    You can use 'find ... -print0 | xargs -0 ...' to resolve the issue with filenames with spaces.

    ReplyDelete

Note: Only a member of this blog may post a comment.