Saturday, October 20, 2012

Joining Two Files with the Unix join Command

The join command is a useful tool for joining two files on a common field. It allows you to join two files, similar to the way you would join two tables in a SQL database.

The following example illustrates the power of the join command. You have two files, one containing a list of employees with their department ids and the other containing departments and their ids. You want to find out the names of the departments for each employee. You MUST first sort the files on the department id column (using the sort command) and then join them on that column.

$ cat employees.txt
Jones,33
Steinberg,33
Robinson,34
Smith,34
Rafferty,31
John,

$ cat departments.txt
31,Sales
33,Engineering
34,Clerical
35,Marketing

$ join -a 1 -t, -1 2 -2 1 -o 1.1 2.2 <(sort -t, -k2 employees.txt) <(sort -t, -k1 departments.txt)
John,
Rafferty,Sales
Jones,Engineering
Steinberg,Engineering
Robinson,Clerical
Smith,Clerical

Joining on multiple columns
The join command joins on a single field. What do you do if you want to join on multiple fields? You create a composite field by combining the multiple fields together! This can be done using awk. For example:
$ cat employees2.txt
Jones,33,50
Steinberg,33,51
Robinson,34,50
Smith,34,50
Rafferty,31,51

$ awk -F, '{print $2"_"$3","$0}' employees2.txt
33_50,Jones,33,50
33_51,Steinberg,33,51
34_50,Robinson,34,50
34_50,Smith,34,50
31_51,Rafferty,31,51

As you can see, an additional field has been created by concatenating the second and third fields of the file. Now you can join the files on the new composite field.

(File data courtesy of Wikipedia.)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.