开发者

Is SQL GROUP BY a design flaw? [closed]

开发者 https://www.devze.com 2022-12-21 01:38 出处:网络
Closed. This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing
Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.

Closed 2 years ago.

Improve this question

Why does SQL require that I specify on which attributes to group? Why can't it just use all non-aggregates?

If an attribute is not aggregated and is not in the GROUP BY clause then nondeterministic choice would be the only option assuming tuples are unordered (mysql kind of does this) and that is a huge gotcha. As far as I know, Postgresql requires that all attributes not appearing in the GROUP BY must be aggregated, which reinforces that it is superfluous.

  • Am I missing something or is this a language design flaw that promotes loose implementations and makes queries harder to write?
  • If I am missing something, what is an example query where group attributes c开发者_StackOverflowan not be inferred?   


You don't have to group by the exactly the same thing you're selecting, e.g. :

SQL:select priority,count(*) from rule_class
group by priority

PRIORITY COUNT(*) 70 1 50 4 30 1 90 2 10 4

SQL:select decode(priority,50,'Norm','Odd'),count(*) from rule_class group by priority

DECO COUNT(*) Odd 1 Norm 4 Odd 1 Odd 2 Odd 4

SQL:select decode(priority,50,'Norm','Odd'),count(*) from rule_class group by decode(priority,50,'Norm','Odd')

DECO COUNT(*) Norm 4 Odd 8


There is one more reason for why does SQL requires that I specify on which attributes to group.

Lets sat we have two simple tables: friend and car, where we store info about our friends and their cars.

And lets say we want to show all our friends's data (from table friend) and for everyone of our friends, how many cars they own now, have sold, have crashed and the total number. Oh, and we want the elders first, younger last.

We'd do something like:

SELECT f.id
     , f.firstname
     , f.lastname
     , f.birthdate
     , COUNT(NOT c.sold AND NOT c.crashed) AS owned
     , COUNT(c.sold) AS sold
     , COUNT(c.crashed) AS crashed
     , COUNT(c.friendid) AS totalcars
FROM friend f
LEFT JOIN car c     <--to catch (shame!) those friends who have never had a car 
  ON f.id = c.friendid
GROUP BY f.id
       , f.firstname
       , f.lastname
       , f.birthdate
ORDER BY f.birthdate DESC

But do we really need all those fields in the GROUP BY? Isn't every friend uniquely determined by his id? In other words, aren't the firstname, lastname and birthdate functionally dependend on the f.id? Why not just do (as we can in MySQL):

SELECT f.id
     , f.firstname
     , f.lastname
     , f.birthdate
     , COUNT(NOT c.sold AND NOT c.crashed) AS owned
     , COUNT(c.sold) AS sold
     , COUNT(c.crashed) AS crashed
     , COUNT(c.friendid) AS totalcars
FROM friend f
LEFT JOIN car c     <--to catch (shame!) those friends who have never had a car 
  ON f.id = c.friendid
GROUP BY f.id
ORDER BY f.birthdate 

And what if we had 20 fields in the SELECT (plus ORDER BY) parts? Isn't the second query shorter, clearer and probably faster (in the RDBMS that accept it)?

I say yes. So, do the SQL 1999 and 2003 specs say, if this article is correct: Debunking group by myths


I would say if you have a large number of items in the group by clause then perhaps the core info should be pulled out into a tabular sub-query which you inner join into.

There is a probably a performance hit, but it makes for neater code.

select  id, count(a), b, c, d
from    table
group by
        id, b, c, d

becomes

select  id, myCount, b, c, d
from    table t
        inner join (
            select id, count(*) as myCount
            from table
            group by id
        ) as myCountTable on myCountTable.id = t.id

That said, I'm interested to hear counter-arguments for doing this as opposed to having a large group by clause.


I agree its verbose that the group by list shouldn't implicitly be the same as then non-aggregated select columns. In Sas there are data aggregation operations that are more succinct.

Also : it's hard to come up with an example where it would be useful to have a longer list of columns in the group list than the select list. The best I can come up with is ...

create table people
(  Nam char(10)
  ,Adr char(10)
)

insert into people values ('Peter', 'Tibet')
insert into people values ('Peter', 'OZ')
insert into people values ('Peter', 'OZ')

insert into people values ('Joe', 'NY')
insert into people values ('Joe', 'Texas')
insert into people values ('Joe', 'France')

-- Give me people where there is a duplicate address record

select * from people where nam in 
(
select nam              
from People        
group by nam, adr        -- group list different from select list
having count(*) > 1
)


If you issue just regarding to easier way to write scripts. Here is one tip:

In MS SQL MGMS write you query in text something like select * from my_table after that select text right click and "Design Query in Editor.." Sql studio will open new editor with filed up all fields after that again right click and select "Add Gruop BY" Sql MGM studio will add code for you .

I fund this method extremely useful for insert statements. When I need to write script for insert a lot of fields in table, I just do select * from table_where_want_to_insert and after that change type of select statement to insert,


I Agree

I quite agree with the question. I asked the same one here.

I honestly think it's a language flaw.

I realise that there are arguments against that, but I have yet to use a GROUP BY clause containing anything other than all the non-aggregated fields from the SELECT clause in the real world.


This thread provides some useful explanations.

http://social.msdn.microsoft.com/Forums/en/transactsql/thread/52482614-bfc8-47db-b1b6-deec7363bd1a


I'd say it is more likely to be a language design choice that decisions be explicit, not implicit. For instance, what if I wish to group the data in a different order than that in which I output the columns? Or if I want to group by columns that aren't included in the columns selected? Or if I want to output grouped columns only and not use aggregate functions? Only by explicitly stating my preferences in the group by clause are my intentions clear.

You also have to remember that SQL is a very old language (1970). Look at how Linq flipped everything around in order to make Intellisense work - it looks obvious to us now, but SQL predates IDEs and so couldn't have taken into account such issues.


The "superflous" attributes influence the ordering of the result.

Consider:

create table gb (
  a number,
  b varchar(3),
  c varchar(3)
);

insert into gb values (   3, 'foo', 'foo');
insert into gb values (   1, 'foo', 'foo');
insert into gb values (   0, 'foo', 'foo');

insert into gb values (  20, 'foo', 'bar');
insert into gb values (  11, 'foo', 'bar');
insert into gb values (  13, 'foo', 'bar');

insert into gb values ( 170, 'bar', 'foo');
insert into gb values ( 144, 'bar', 'foo');
insert into gb values ( 130, 'bar', 'foo');

insert into gb values (2002, 'bar', 'bar');
insert into gb values (1111, 'bar', 'bar');
insert into gb values (1331, 'bar', 'bar');

This statement

select sum(a), b, c
  from gb
group by b, c;

results in

    44 foo bar
   444 bar foo
     4 foo foo
  4444 bar bar

while this one

select sum(a), b, c
  from gb
group by c, b;

results in

   444 bar foo
    44 foo bar
     4 foo foo
  4444 bar bar
0

精彩评论

暂无评论...
验证码 换一张
取 消