Table of Contents
Selecting fields that are not in GROUP BY
Task Description
Let`s assume that we have some table where each record describes the person, person's phone and that phone's priority among the other phones belonged to that person.
CREATE TABLE Person ( name String, phone long, priority long ); INSERT INTO Person VALUES ( 'John Smith', 555111, 1 ); INSERT INTO Person VALUES ( 'Ivan Ivanov', 555222, 1 ); INSERT INTO Person VALUES ( 'John Smith', 555333, 2 );
And we need to get the phones with max priority for each person. Also, let the priority be in range 1 .. n.
Problem
The first approach could be following:
SELECT name, MAX(priority) FROM Person GROUP BY name;
It is ok until we don't want to see “phone” column in the result. But if we need to show such additional columns (i.e. “show me the top-priority phone of each person”), then we get a problem.
We can NOT do it simply adding “phone” into the select clause, because according to the SQL standard, SELECT clause may contain columns mentioned in the GROUP BY and expressions based on aggregative functions only.
Why? Because of the ambiguity - “group by” produces a single record for each group of records. “Group By” columns contain the same value for all records in the group and any aggregative function produces the single value for a group, but “phone” can be 555111 or 555333 for 'John Smith' group - which one should we choose as a single value for 'John Smith' group?
!!! **WRONG** !!! SELECT name, phone, MAX(priority) FROM Person GROUP BY name;
Also, we can NOT apply some aggregative function for “phone” (like we did for “priority”) because we may get values from different records:
!!! **WRONG** !!! SELECT name, FIRST(phone), MAX(priority) FROM Person GROUP BY name; -- 'John Smith' 555111 2 'Ivan Ivanov' 555222 1 Instead OF: -- 'John Smith' 555333 2 'Ivan Ivanov' 555222 1
Solutions
There are few tricks to solve such a problem:
Correlated subquery
SELECT * FROM Person t1 WHERE t1.priority = (SELECT MAX(t2.priority) FROM Person t2 WHERE t1.name = t2.name); -- 'John Smith' 555333 2 'Ivan Ivanov' 555222 1
This way is workable but can take a lot of time because of its correlation nature - we have to find max priority for a particular person checking each record in the table.
ANY predicate
SELECT * FROM Person t1 WHERE (t1.name, t1.priority) = ANY(SELECT t2.name, MAX(t2.priority) FROM Person t2 GROUP BY t2.name); -- 'John Smith' 555333 2 'Ivan Ivanov' 555222 1
Here we will find all pairs (name, max(priority)) first (and we do it once!). Then we just check all candidates from t1 for being present in that set of pairs.
For particular test data (~5500 records in Person, ~1500 groups) this way is 8 times faster than subqueries.
ORDER BY in aggregative functions First()/Last()
New for v.5.0
SELECT * FROM Person t1 WHERE t1.recid IN (SELECT FIRST(t2.recid ORDER BY t2.priority) FROM Person t2 GROUP BY t2.name); -- 'John Smith' 555333 2 'Ivan Ivanov' 555222 1
Here we find all “first 'recid' sorted out by 'priority'” inside of each group of records. Then we just check all candidates from t1 for being present in that set of pairs. Moreover, this way lets us get rid of the records with equal priority inside each group.
For the particular test data (~5500 records in Person, ~1500 groups) this way is 400 times faster than correlated subqueries.
Note, that this solution is also more powerful, because since you have list of RecIDs, you can do SELECT *, i.e. select all/many fields of that records.