开发者

Get duplicates for two columns with LINQ

开发者 https://www.devze.com 2023-04-06 19:47 出处:网络
LINQ drives me crazy. Why does following query not return the duplicates, whereas it works with only one identifier? Where is my error?

LINQ drives me crazy. Why does following query not return the duplicates, whereas it works with only one identifier? Where is my error?

' generate some test-data '
Dim source As New DataTable
source.Columns.Add(New DataColumn("RowNumber", GetType(Int32)))
source.Columns.Add(New DataColumn("Value1", GetType(Int32)))
source.Columns.Add(New DataColumn("Value2", GetType(Int32)))
source.Columns.Add(New DataColumn("Text", GetType(String)))
Dim rnd As New Random()
For i As Int32 = 1 To 100
    Dim newRow = source.NewRow
    Dim value = rnd.Next(1, 20)
    newRow("RowNumber") = i
    newRow("Value1") = value
    newRow("Value2") = (value + 1)
    newRow("Text") = String.Format("RowNumber{0}-Text", i)
    source.Rows.Add(newRow)
Next
' following query does not work, it always has Count=0 '
' although it works with only one identifier '
Dim dupIdentifiers = From row In source
         Group row By grp = New With {.Val1 = row("Value1"), .Val2 = row("Value2")}
         Into Group
         Where Group.Count > 1
         Select idGroup = New With {grp.Val1, grp.Val2, Group.Count}

Edit: Following is the complete solution, thanks to @Jon Skeet's answer :)

Dim dupKeys = From row In source
        Group row By grp = New With {Key .Val1 = CInt(row("Value1")), Key .Val2 = CInt(row("Value2"))}
        Into Group Where Group.Count > 1
        Select RowNumber = CInt(Group.FirstOrDefault.Item("RowNumber"))

Dim dupRows = From row In source
        Join dupKey In dupKeys 
        On row("RowNumber") Equals dupKey 
        Select row

If dupRows.Any Then
    ' create a new DataTable from the first duplicate rows '
    Dim dest = dupRows.CopyToDataTable
End If

The main problem with grouping was that i must make them key properties. The next problem in my above code was to get the duplicate rows from the original table. Because nearly every row has a duplicate(according to two fields), the result DataTable contained 99 of 100 rows and not only the 19 duplicate values. I needed to select only the first duplicate row and join them with the original table on the P开发者_Python百科K.

Select RowNumber = CInt(Group.FirstOrDefault.Item("RowNumber"))

Although this works in my case, maybe someone can explain me how to select only the duplicates from the original table if i would have had only composite keys.


Edit: I'v answered the last part of the question myself, so here is all i need:

Dim dups = From row In source
         Group By grp = New With {Key .Value1 = CInt(row("Value1")), Key .Value2 = CInt(row("Value2"))}
         Into Group Where Group.Count > 1
         Let Text = Group.First.Item("Text")
         Select Group.First

If dups.Any Then
      Dim dest = dups.CopyToDataTable
End If

I needed the Let-Keyword in order to keep the other column(s) into the same context and return only the first row of the grouped dups. On this way i can use CopyToDataTable to create a DataTable from the duplicate rows.

Only a few lines of code overall (i can save the second query to find the rows in the original table) to find duplicates on multiple columns and create a DataTable of them.


The problem is the way anonymous types work in VB - they're mutable by default; only Key properties are included for hashing and equality. Try this:

Group row By grp = New With {Key .Val1 = row("Value1"), Key .Val2 = row("Value2")}

(In C# this wouldn't be a problem - anonymous types in C# are always immutable in all properties.)


What I use to get duplicate rows across two columns in a EF table to show up as duplicates using Lin-q with C Sharp:

            var DuplicatesFoundInTable =
            entities.LocationDatas
           .Where(c => c.TrailerNumber != null && c.CarrierName != null && (c.TrailerNumber ?? string.Empty) != string.Empty && (c.CarrierName ?? string.Empty) != string.Empty)
           .GroupBy(o => new { o.TrailerNumber, o.CarrierName }, l => new { customer.TrailerNumber, customer.CarrierName })
           .Where(g => g.Count() > 1)
           .Select(y => y.Key)
           .ToList();

When I want to see if it is a duplicate on inputs (if the entry already exists in two columns):

          //Check to see if any rows are the same values on TrailerNumber and CarrierName for inputs. 
            bool AlreadyInTableComparer = entities.LocationDatas.Any(l => String.Compare(l.TrailerNumber, customer.TrailerNumber, StringComparison.InvariantCulture) == 0 && String.Compare(l.CarrierName, customer.CarrierName, StringComparison.InvariantCulture) == 0);
            bool AlreadyInTable = entities.LocationDatas.Any(t => t.TrailerNumber == customer.TrailerNumber && t.CarrierName == customer.CarrierName);

SQL Server Checking for duplicates (commented out delete duplicates):

 WITH CTE
 AS
  (
 SELECT [TrailerNumber], [CarrierName]
 ,ROW_NUMBER() OVER(Partition BY TrailerNumber Order by TrailerNumber, 
    CarrierName) AS NumRows, ROW_NUMBER() OVER(Partition BY TrailerNumber, 
   CarrierName Order by CarrierName) AS NumRows2
   FROM [dbo].[LocationData] --Please note, duplicates are shown in this 
   table.
   WHERE  TrailerNumber != '' AND CarrierName != '' 
   )
   SELECT [TrailerNumber], [CarrierName], [NumRows2] FROM CTE WHERE NumRows2 > 1
   --DELETE FROM CTE WHERE NumRows2 > 1  --Delete Duplicates.

Validate SQL Server to prove correct against CTE filtering:

    SELECT TrailerNumber, CarrierName, COUNT(*) AS Duplicates
    FROM [dbo].[LocationData]
    WHERE TrailerNumber IS NOT NULL OR CarrierName IS NOT NULL 
    GROUP BY TrailerNumber, CarrierName
    HAVING COUNT(*) >1 AND TrailerNumber != '' AND CarrierName != ''    
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号