Interview question: SQL recursion in one statement_问答_开发者

Someone I know went to an interview and was given the following problem to solve. I've thought about it for a few hours and believe that it's not possible to do without using some database-specific extensions or features from recent standards that don't have wide support yet.

I don't remember the story behind what is being represented, but it's not relevant. In simple terms, you're trying to represent chains of unique numbers:

chain 1: 1  -> 2  -> 3
chain 2: 42 -> 78
chain 3: 4
chain 4: 7  -> 8  -> 9
...

This information is already stored for you in the following table structure:

id | parent
---+-------
1  | NULL
2  | 1
3  | 2
42 | NULL
78 | 42
4  | NULL
7  | NULL
8  | 7
9  | 8

There could be millions of such chains and each chain can have an unlimited number of entries. The goal is to create a second table that would contain the exact same information, but with a third column that contains the starting point of the chain:

id | parent | start
---+--------+------
1  | NULL   | 1
2  | 1      | 1
3  | 2      | 1
42 | NULL   | 42
78 | 42     | 42
4  | NULL   | 4
7  | NULL   | 7
8  | 7      | 7
9  | 8      | 7

The claim (made by the interviewers) is that this can be achieved with just 2 SQL queries. The hint they provide is to first populate the destination table (I'll call it dst) with the root elements, like so:

INSERT INTO dst SELECT id, parent, id FROM src WHERE parent IS NULL

This will give you the following content:

id | parent | start
---+--------+------
1  | NULL   | 1
42 | NULL   | 42
4  | NULL   | 4
7  | NULL   | 7

They say that you can now execute just one more query to get to the goal shown above.

In my opinion, you can do one of two things. Use recursion in the source table to get to the front of each chain, or continuously execute some version of SELECT start FROM dst WHERE dst.id = src.parent after each update to dst (i.e. can't cache the results).

I don't think either of these situations is supported by common databases like MySQL, PostgreSQL, SQLite, etc. I do know that in PostgreSQL 8.4 you can achieve recursion using WITH RECURSIVE query, and in Oracle you have START WITH and CONNECT BY clauses. The point is that these things are specific to database type and version.

Is there any way to achieve the desired result using regular SQL92 in just one query? The best I could do is fill-in the start column for the first child with the following (can also use a LEFT JOIN to achieve the same result):

INSERT INTO dst
    SELECT s.id, s.parent,
        (SELECT start FROM d开发者_运维百科st AS d WHERE d.id = s.parent) AS start 
    FROM src AS s
    WHERE s.parent IS NOT NULL

If there was some way to re-execute the inner select statement after each insert into dst, then the problem would be solved.

It can not be implemented in any static SQL that follows ANSI SQL 92.

But as you said it can be easy implemented with oracle's CONNECT BY

    SELECT id,
           parent,
           CONNECT_BY_ROOT id
      FROM table
START WITH parent IS NULL
CONNECT BY PRIOR id = parent

In SQL Server you would use a Common Table Expression (CTE).

To replicate the stored data I've created a temporary table

-- Create a temporary table
CREATE TABLE #SourceData
(
    ID INT
    , Parent INT
)

-- Setup data (ID, Parent, KeyField)
INSERT INTO #SourceData VALUES (1,  NULL);
INSERT INTO #SourceData VALUES (2,  1);
INSERT INTO #SourceData VALUES (3,  2);
INSERT INTO #SourceData VALUES (42, NULL);
INSERT INTO #SourceData VALUES (78, 42);
INSERT INTO #SourceData VALUES (4,  NULL);
INSERT INTO #SourceData VALUES (7,  NULL);
INSERT INTO #SourceData VALUES (8,  7);
INSERT INTO #SourceData VALUES (9,  8);

Then I create the CTE to compile the data result:

-- Perform CTE
WITH RecursiveData (ID, Parent, Start) AS
(
    -- Base query
    SELECT  ID, Parent, ID AS Start
    FROM    #SourceData
    WHERE   Parent IS NULL

    UNION ALL

    -- Recursive query
    SELECT  s.ID, s.Parent, rd.Start
    FROM    #SourceData AS s
            INNER JOIN RecursiveData AS rd ON s.Parent = rd.ID
)
SELECT * FROM RecursiveData WHERE Parent IS NULL

Which will output the following:

id | parent | start
---+--------+------
1  | NULL   | 1
42 | NULL   | 42
4  | NULL   | 4
7  | NULL   | 7

Then I clean up :)

-- Clean up
DROP TABLE #SourceData

There is no recursive query support in ANSI-92, because it was added in ANSI-99. Oracle has had it's own recursive query syntax (CONNECT BY) since v2. While Oracle supported the WITH clause since 9i, SQL Server is the first I knew of to support the recursive WITH/CTE syntax -- Oracle didn't start until 11gR2. PostgreSQL added support in 8.4+. MySQL has had a request in for WITH support since 2006, and I highly doubt you'll see it in SQLite.

The example you gave is only two levels deep, so you could use:

INSERT INTO dst
   SELECT a.id,
          a.parent,
          COALESCE(c.id, b.id) AS start
     FROM SRC a
LEFT JOIN SRC b ON b.id = a.parent
LEFT JOIN SRC c ON c.id = b.parent
    WHERE a.parent IS NOT NULL

You'd have to add a LEFT JOIN for the number of levels deep, and add them in proper sequence to the COALESCE function.