开发者

SQL: eliminate the selection of redundant records that occur within X minutes

开发者 https://www.devze.com 2023-04-11 02:32 出处:网络
DB in use is Firebird 2.1,in case not familiar with here is select statement sql ref: http://ibexpert.net/ibe/index.php?n=Doc.DataRetrieval

DB in use is Firebird 2.1,in case not familiar with here is select statement sql ref:

http://ibexpert.net/ibe/index.php?n=Doc.DataRetrieval

functions ref: http://www.firebirdsql.org/file/documentation/reference_manuals/reference_material/html/langrefupd21.html

i will be happy with any sql slang [i'll convert it].

table schema:

CREATE TABLE EVENT_MASTER (
EVENT_ID                BIGINT NOT NULL,
EVENT_TIME              BIGINT NOT NULL,
DATA_F1                 VARCHAR(40),
DATA_F2                 VARCHAR(40),
PRIMARY KEY (EVENT_ID)
);

the bad news is EVENT_TIME stored as seconds elapsed since the Epoch.

data samples:

"EVENT_ID","EVENT_TIME","DATA_F1","DATA_F2"
25327,1297824698,"8604","A"
25328,1297824770,"8604","I"
25329,1297824773,"8604","A"
25330,1297824793,"8604","A"
25331,1297824809,"8604","1"
25332,1297824811,"8604","GREY"
25333,1297824812,"8604","A"
25334,1297824825,"8604","GREY"
25335,1297824831,"8604","A"
25336,1297824833,"8604","GREY"
25337,1297824838,"8604","A"
25338,1297824840,"8604","1"
25339,1297824850,"8604","A"
25340,1297824864,"8604","A"
25341,1297824875,"8804","GREY" //notice DATA_F1 is different
25342,1297824876,"8604","G"
25343,1297824877,"8604","A"
25344,1297824880,"8604","GREY"
25345,1297824895,"8604","1"
25346,1297824899,"8604","A"
25347,1297824918,"8604","GREY"
25348,1297824930,"8604","YELLOW"
25349,1297824939,"8604","GREY"
25350,1297824940,"8604",""
25351,1297824944,"8604","A"
25352,1297824945,"8604","1"
25353,1297824954,"8604","B"
25354,1297824958,"8604",""
25355,1297824964,"8604","1"
25356,1297824966,"8604","GREY"
25357,1297824974,"8604","1"
25358,1297824981,"8604","GREY"
25359,1297824983,"8604",""
25360,1297824998,"8604","GREY"
25361,1297825003,"8604","2"
25362,1297825009,"8604","G"
25363,1297825018,"8604","GREY"
25364,1297825026,"8604","F"
25365,1297825045,"8604","GREY"
25366,1297825046,"8604","1"

expected output:

distinct "DATA_F1","DATA_F2" rows within X minutes according to EVENT_TIME: like:

25341,1297824875,"8804","GREY"
25327,1297824698,"8604","A"
25328,1297824770,"8604","I"
25332,1297824811,"8604","GREY"
25348,1297824930,"8604","YELLOW"
..etc

Requirements: eliminate the selection of redundant records that occur within 5 minutes [range calcs according to EVENT_TIME column].

last i'm trying follow this pattern:

SELECT * FROM EVENT_MASTER inner join (
SELECT distinct  DATA_F1, DATA_F2 FROM EVENT_MASTER where /*the hard stuff that i need help with: (EVENT_TIME difference within X minutes)*/
) as RemovedDup ON /*EVENT_MASTER.EVENT_ID = problem is i cant select RemovedDup ID otherwise distinct becomes useless!!*/

please help ASAP.

thanks,

EDIT

adding output based on Andrei K. answer:

25331,1297824809,"8604","1"
25327,1297824698,"8604","A"
25342,1297824876,"8604","G"
25332,1297824811,"8604","GREY"
25328,1297824770,"8604","I"
25341,1297824875,"8804","GREY"
25350,1297824940,"8604",""
25352,1297824945,"8604","1" /*bug: time still within 300 seconds, this same as first record*/
25361,1297825003,"8604","2"
25351,1297824944,"8604","A"
25353,1297824954,"8604","B"
25364,1297825026,"8604","F"
25362,1297825009,"8604","G"
25347,1297824918,"8604","GREY"
25372,1297825087,"8604","ORANGE"
25348,1297824930,"8604","YELLOW"
25382,1297825216,"8604","1"
25387,1297825270,"8604","B"
25394,1297825355,"8604","BLUE"
25381,1297825211,"8604","GREY"

EDIT 2: Russell query output: not good output and its very very slow.

1297824698,"8604","A"
1297824770,"8604","I"
1297824809,"8604","1"
1297824811,"8604","GREY"
1297824825,"8604","GREY"
1297824840,"8604","1"
1297824875,"8804","GREY"
1297824876,"8604","G"
1297824880,"8604","GREY"
1297824918,"8604","GREY"
1297824930,"8604","YELLOW"
1297824939,"8604","GREY"
1297824940,"8604",""
1297824945,"8604","1"
1297824954,"8604","B"
1297824964,"8604","1"
1297824998,"8604","GREY"
1297825003,"8604","2"
1297825018,"8604","GREY"
1297825026,"8604","F"
1297825045,"8604","GREY"
1297825046,"8604","1"
1297825063,"8604","1"
1297825079,"8604","GREY"
1297825087,"8604","ORANGE"
1297825094,"8604","GREY"
1297825100,"8604","1"
1297825133,"8604","GREY"
1297825176,"8604","GREY"
1297825216,"8604","1"

EDIT 3:

based on Russell request here is: all rows WHERE DATA_F1 = '8604' AND DATA_F2 = 'GREY'

25332,1297824811,"8604","GREY"
25334,1297824825,"8604","GREY"
25336,1297824833,"8604","GREY"
25344,1297824880,"8604","GREY"
25347,1297824918,"8604","GREY"
25349,1297824939,"8604","GREY"
25356,1297824966,"8604","GREY"
25358,1297824981,"8604","GREY"
25360,1297824998,"8604","GREY"
25363,1297825018,"8604","GREY"
25365,1297825045,"8604","GREY"
25367,1297825059,"8604","GREY"
25371,1297825079,"8604","GREY"
25373,1297825094,开发者_如何转开发"8604","GREY"
25376,1297825116,"8604","GREY"
25378,1297825133,"8604","GREY"
25380,1297825176,"8604","GREY"
25381,1297825211,"8604","GREY"
25384,1297825234,"8604","GREY"
25389,1297825286,"8604","GREY"
25390,1297825314,"8604","GREY"
25391,1297825323,"8604","GREY"
25393,1297825343,"8604","GREY"
25396,1297825370,"8604","GREY"
25397,1297825387,"8604","GREY"
25399,1297825416,"8604","GREY"
25401,1297825436,"8604","GREY"
25402,1297825445,"8604","GREY"
25404,1297825454,"8604","GREY"
50282,1299137344,"8604","GREY"
380151,1309849420,"8604","GREY"

As of this moment [Oct 11,2011 5:00 AM GMT] no absolute correct answer posted, and Andrei K. answer still the best try among others. so sql experts please help me find the solution otherwise i'll start to think that sql is not capable of handling the question requirements! Is it??

remark: event_time is not unique so multiple events can occur at the same second.


If by redundant rows you mean rows registered within 5 minutes and having the same data_f1, data_f2 then try something like this:

SELECT
  e2.event_id,
  e2.event_time,
  e2.data_f1,
  e2.data_f2
FROM
  (SELECT trunc(event_time / 300), data_f1, data_f2, min(event_id) as e_id
   FROM event_master
   GROUP BY 1, 2, 3) e1 
  JOIN 
    event_master e2 ON e1.e_id = e2.event_id


U can try this :::

SELECT * FROM EVENT_MASTER group by (DATAF1, DATAF2) where 
event_time >(SELECT TIME_TO_SEC(now())-300)

Hope this will help you..


I am not familiar with Firebird but I am using the documentation so if that is correct then this should work.

SELECT DISTINCT MIN(A.EVENT_TIME) as MINEVENT_TIME, B.DATA_F1, B.DATA_F2 
FROM EVENT_MASTER as A 
JOIN EVENT_MASTER as B ON A.EVENT_TIME BETWEEN B.EVENT_TIME-299 AND B.EVENT_TIME 
AND B.DATA_F1 = A.DATA_F1 AND B.DATA_F2 = A.DATA_F2 
GROUP BY B.DATA_F1, B.DATA_F2, B.EVENT_TIME 

This is syntax checked but untested.


This assumes all records have different values in event_time (or they will exclude each other).

SELECT
  *
FROM
  event_master AS data
WHERE
  NOT EXISTS (
    SELECT * FROM event_master
    WHERE event_time >  data.event_time - 300
      AND event_time <= data.event_time
  )

If mutliple events CAN happen with the same value in event_time, can we assume that an event with a higher event_id CAN NOT happen before an event with a lower event_id? If so, you can modify the above as follows

SELECT
  *
FROM
  event_master AS data
WHERE
  NOT EXISTS (
    SELECT * FROM event_master
    WHERE event_time >  data.event_time - 300
      AND event_time <= data.event_time
      AND event_id   <  data.event_id
  )

In the cases of multiple events happening at the same time, the one with the lowest event_id will be chosen.


In terms of performance, ensure that the data has an index where event_time is the first indexed field.


As I understand this, you want to get distinct values for DATA_F1 and DATA_F2 but only for a 'Window' of 5 minutes; after that the values may occur again, right? (Sorry if I missunderstood the question, it's been a long day...) I don't know much about Firebird, but here's how you would do that in MS SQL server:

SELECT a.EVENT_ID, a.DATA_F1, a.DATA_F2, a.EVENT_TIME FROM
  EVENT_MASTER AS a LEFT JOIN EVENT_MASTER AS b 
    ON a.DATA_F1=b.DATA_F1 AND 
      a.DATA_F2=b.DATA_F2 AND 
      a.EVENT_TIME<b.EVENT_TIME AND 
      b.EVENT_TIME-a.EVENT_TIME<=5*60
WHERE
  b.EVENT_ID IS NULL

Also, while you test this, also try the modified version below: Hope this helps!

SELECT a.EVENT_ID, a.DATA_F1, a.DATA_F2, a.EVENT_TIME FROM
  EVENT_MASTER AS a LEFT JOIN EVENT_MASTER AS b 
    ON a.DATA_F1=b.DATA_F1 AND 
      a.DATA_F2=b.DATA_F2 AND
      a.EVENT_ID<b.EVENT_ID AND 
      a.EVENT_TIME<=b.EVENT_TIME AND 
      b.EVENT_TIME-a.EVENT_TIME<=5*60
WHERE
  b.EVENT_ID IS NULL

ADDED: Ok, It seems like we have the correct results. Here's my suggestion to optimize this baby (as I've seen that Firebird support the EXISTS keyword, I've rewritten the query below):

SELECT a.EVENT_ID, a.DATA_F1, a.DATA_F2, a.EVENT_TIME FROM EVENT_MASTER AS a 
WHERE NOT EXISTS (SELECT * FROM EVENT_MASTER AS b 
    WHERE a.DATA_F1=b.DATA_F1 AND 
      a.DATA_F2=b.DATA_F2 AND
      a.EVENT_ID<b.EVENT_ID AND 
      a.EVENT_TIME<=b.EVENT_TIME AND 
      b.EVENT_TIME-a.EVENT_TIME<=5*60)

Also, please add the following index:

CREATE INDEX IX_SPEED ON EVENT_MASTER (EVENT_ID DESC, EVENT_TIME ASC, DATA_F1 ASC, DATA_F2 ASC)

Hope this helps!


You'd need a very nasty recursive query to accomplish this in purely "functional" way. I don't profess to be clever enough to construct such a query, let alone make it performant.

On the other hand, allowing side-effects (i.e. temporary table) significantly simplifies things. You should even be able to make it fairly quick by adding appropriate indexes on the temporary table (not shown here). Here is the actual SQL:

CREATE GLOBAL TEMPORARY TABLE EVENT_MASTER_TMP (
    EVENT_ID                BIGINT NOT NULL,
    EVENT_TIME              BIGINT NOT NULL,
    DATA_F1                 VARCHAR(40),
    DATA_F2                 VARCHAR(40),
    PRIMARY KEY (EVENT_ID)
);

INSERT INTO EVENT_MASTER_TMP
SELECT * FROM
    (SELECT * FROM EVENT_MASTER ORDER BY EVENT_TIME) E
WHERE
    NOT EXISTS (
        SELECT *
        FROM EVENT_MASTER_TMP T
        WHERE
            E.DATA_F1 = T.DATA_F1
            AND E.DATA_F2 = T.DATA_F2
            AND E.EVENT_TIME - T.EVENT_TIME <= 5*60
    );

SELECT * FROM EVENT_MASTER_TMP;

In plain English:

  • Go through events from older to newer,
  • for each event, check if it is redundant relative to some row that is already in the temporary table
  • and if not, insert it in the temporary table, so it can be used as a criteria for the remaining events.

Executing this on your test data yields:

25327   1297824698  8604    A
25328   1297824770  8604    I
25331   1297824809  8604    1
25332   1297824811  8604    GREY
25341   1297824875  8804    GREY
25342   1297824876  8604    G
25348   1297824930  8604    YELLOW
25350   1297824940  8604    
25353   1297824954  8604    B
25361   1297825003  8604    2
25364   1297825026  8604    F

Lowering the time threshold from 5*60 to, say, 233, yield this:

25327   1297824698  8604    A
25328   1297824770  8604    I
25331   1297824809  8604    1
25332   1297824811  8604    GREY
25341   1297824875  8804    GREY
25342   1297824876  8604    G
25348   1297824930  8604    YELLOW
25350   1297824940  8604    
25351   1297824944  8604    A       <-- 246s difference
25353   1297824954  8604    B
25361   1297825003  8604    2
25364   1297825026  8604    F
25365   1297825045  8604    GREY    <-- 234s difference
25366   1297825046  8604    1       <-- 237s difference


try:

SELECT T1.* FROM EVENT_MASTER T1 WHERE EXISTS (
    SELECT * FROM EVENT_MASTER T2 
    WHERE T2.DATA_F1=T1.DATA_F1 
    AND T2.DATA_F2=T1.DATA_F2 
    AND (T2.EVENT_TIME-T1.EVENT_TIME)<300
)
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号