开发者

bash: Difference between join and comm

开发者 https://www.devze.com 2023-04-01 02:09 出处:网络
# comm -12 /tmp/src /tmp/txt | wc -l 10338 # join /tmp/src /tmp/txt | wc -l 10355 Both the files are single columns of alphanumeric strings and sort-ed. Shouldn\'t the开发者_运维知识库y be the same?
# comm -12 /tmp/src /tmp/txt | wc -l
  10338
# join /tmp/src /tmp/txt | wc -l
  10355

Both the files are single columns of alphanumeric strings and sort-ed. Shouldn't the开发者_运维知识库y be the same?


Updated following @Kevin-s answer below:

cat /tmp/txt | sed 's/^[:space:]*//' > /tmp/stxt
cat /tmp/src | sed 's/^[:space:]*//' > /tmp/ssrc

and the result:

#join /tmp/ssrc /tmp/stxt | wc -l
516
# comm -12 /tmp/ssrc /tmp/stxt | wc -l
513

On manual inspection of the diff-s ... the results differ due to some whitespaces that were not taken out by the sed.


There are a couple of differences between comm and join:

  1. comm compares whole lines; join compares fields within lines.
  2. comm prints whole lines; join can print selected parts of lines.

When you have a single column of data in each file, there is relatively little difference. When you have multiple columns, there can be a lot of difference.

Also note that under the right circumstances, join can output multiple copies of the data from one file while joining with different lines from the other file. This looks to me like your problem; you probably have some duplicate values in one of the files. Suppose you have:

src           txt
123           123
              123
              123

If you do comm -12 src txt, you will get one line of output; if you do join src txt, you will get three lines of output. This is expected.

The join command can also handle 'outer joins' where data is missing from the second file for a line in the first file (a LEFT OUTER JOIN in terms of SQL) or vice versa (a RIGHT OUTER JOIN), or both at once (a FULL OUTER JOIN).

All-in-all, join is a more complex command, but it is attempting to do a more complex job. Both are useful; but they are useful in different places.


The main utility of join is to select lines which share one field, like you can do in a database. Say you have the following files:

File A
Alice  24
Bill   16
Claire 31
John   10
John  -14

File B
Bill   Copenhagen
John   Adelaide

... you can select the "John" and "Bill" lines from File A by giving File B as the file to join with, and the first field of both as the field to join on. The requirement that both files have to be sorted on that field is rather cumbersome in practice, though.


I haven't used either extensively, but from a quick look at the man pages and test input, it seems that if the two files differ, comm prints both and join only prints matching lines. The -12 took care of that. You could store the output of the two into files and do a diff to see how they differ.

$ echo -e '1\n2\n3\n5' > a
$ echo -e '1\n2\n4\n5' > b
$ comm a b
                1
                2
3
        4
                5
$ join a b
1
2
5
$

Edit: Join only compares the first whitespace-separated field but comm compares the whole line. Any whitespace on the line will therefore make the output differ.


Use [[:space:]] (instead of [:space:]) to strip whitespace with sed.

# compare
{
echo '   abc' | sed 's/^[:space:]*//'
echo '   abc' | sed 's/^[[:space:]]*//'
}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号