# comm -12 /tmp/src /tmp/txt | wc -l
10338
# join /tmp/src /tmp/txt | wc -l
10355
Both the files are single columns of alphanumeric strings and sort
-ed. Shouldn't the开发者_运维知识库y be the same?
Updated following @Kevin-s answer below:
cat /tmp/txt | sed 's/^[:space:]*//' > /tmp/stxt
cat /tmp/src | sed 's/^[:space:]*//' > /tmp/ssrc
and the result:
#join /tmp/ssrc /tmp/stxt | wc -l
516
# comm -12 /tmp/ssrc /tmp/stxt | wc -l
513
On manual inspection of the diff
-s ... the results differ due to some whitespaces that were not taken out by the sed
.
There are a couple of differences between comm
and join
:
comm
compares whole lines;join
compares fields within lines.comm
prints whole lines;join
can print selected parts of lines.
When you have a single column of data in each file, there is relatively little difference. When you have multiple columns, there can be a lot of difference.
Also note that under the right circumstances, join
can output multiple copies of the data from one file while joining with different lines from the other file. This looks to me like your problem; you probably have some duplicate values in one of the files. Suppose you have:
src txt
123 123
123
123
If you do comm -12 src txt
, you will get one line of output; if you do join src txt
, you will get three lines of output. This is expected.
The join
command can also handle 'outer joins' where data is missing from the second file for a line in the first file (a LEFT OUTER JOIN in terms of SQL) or vice versa (a RIGHT OUTER JOIN), or both at once (a FULL OUTER JOIN).
All-in-all, join
is a more complex command, but it is attempting to do a more complex job. Both are useful; but they are useful in different places.
The main utility of join
is to select lines which share one field, like you can do in a database. Say you have the following files:
File A
Alice 24
Bill 16
Claire 31
John 10
John -14
File B
Bill Copenhagen
John Adelaide
... you can select the "John" and "Bill" lines from File A by giving File B as the file to join with, and the first field of both as the field to join on. The requirement that both files have to be sorted on that field is rather cumbersome in practice, though.
I haven't used either extensively, but from a quick look at the man pages and test input, it seems that if the two files differ, comm prints both and join only prints matching lines. The -12 took care of that. You could store the output of the two into files and do a diff to see how they differ.
$ echo -e '1\n2\n3\n5' > a
$ echo -e '1\n2\n4\n5' > b
$ comm a b
1
2
3
4
5
$ join a b
1
2
5
$
Edit: Join only compares the first whitespace-separated field but comm compares the whole line. Any whitespace on the line will therefore make the output differ.
Use [[:space:]]
(instead of [:space:]
) to strip whitespace with sed
.
# compare
{
echo ' abc' | sed 's/^[:space:]*//'
echo ' abc' | sed 's/^[[:space:]]*//'
}
精彩评论