Pages

Wednesday, 16 October 2013

Deleting Duplicate Rows

There are several methods to clean up duplicate data. The following sections include the windowing, surrogate key, and SELECT DISTINCT INTO methods.

To experiment with duplicate data, the following script sets up a poorly designed (no primary key) table and inserts some duplicate data:

USE tempdb ;
go

CREATE TABLE DupsNoPK (
Col1 INT NULL,
Col2 CHAR (5) NULL
);
Go

-- Insert multiple dup rows (can be executed mulitple times)

INSERT DupsNoPK (Col1, Col2)
VALUES (1, ‘abc’),
(2, ‘abc’),
(2, ‘abc’),
(2, ‘abc’),
(7, ‘xyz’),
(7, ‘xyz’)

To verify that the table does in fact have duplicate data, the following query uses a GROUP BY and HAVING clause to return only the duplicated rows, with a count of the number of duplicates:

SELECT Col1, Col2, COUNT(*) AS DupCount
FROM DupsNoPK
GROUP BY Col1, Col2
HAVING COUNT (*) > 1;

Result:
Col1 Col2 DupCount
----------- ----- -----------
2 abc 3
7 xyz 2


Deleting duplicate rows using windowing:

Of the three methods to remove duplicate rows, this method is the most straightforward because it doesn’t need to alter the table or generate a second table.

The key to this method is using the windowing’s OVER() clause with a ROW_NUMBER() function and a partition. The partition will begin renumbering with every new partition. Set the OVER() clause to PARTITION BY every column to be checked for duplicate data. In this case, every column is being checked.

Running the windowing query first shows how it applies the row number:

SELECT Col1, Col2,
ROW_NUMBER() OVER (PARTITION BY Col1, Col2 ORDER BY Col1) AS rn
FROM DupsNoPK

Result:
Col1 Col2 rn
----------- ----- --------------------
1 abc 1
2 abc 1
2 abc 2
2 abc 3
7 xyz 1
7 xyz 2

Every duplicate row has an rn value of greater than 1, so it’s now easy to delete the duplicates:

WITH DupsNumbered
AS (
SELECT Col1, Col2,
ROW_NUMBER() OVER (PARTITION BY Col1, Col2 ORDER BY Col1) AS rn
FROM DupsNoPK
)

DELETE DupsNumbered
WHERE rn > 1;

The next SELECT tests the effect of the windowing remove duplicates query:

SELECT Col1, Col2
FROM DupsNoPK;

Result:
Col1 Col2
----------- -----
1 abc
2 abc
7 xyz


Deleting duplicate rows using a surrogate key:

A traditional method of removing duplicate rows uses a surrogate key to uniquely identify each row.
This means the table itself must be altered to add the surrogate key column.

Assuming the DupsNoPK table is reset with the original rows, the following script applies an IDENTITY surrogate key and looks at the altered table:

ALTER TABLE dbo.DupsNoPK
ADD PK INT IDENTITY NOT NULL
CONSTRAINT PK_DupsNoPK PRIMARY KEY;

SELECT *
FROM DupsNoPK;

Result:
Col1 Col2 PK
----------- ----- -----------
1 abc 74
2 abc 75
2 abc 76
2 abc 77
7 xyz 78
7 xyz 79

To search and destroy the duplicate data, the next query finds and deletes all the rows with matching Col1 and Col2 data but higher primary key values:

DELETE DupsNoPK
WHERE EXISTS ( SELECT *
FROM DupsNoPK AS D1
WHERE D1.Col1 = DupsNoPK.Col1
AND D1.Col2 = DupsNoPK.Col2
AND D1.PK > DupsNoPK.PK );

SELECT *
FROM DupsNoPK;

Result:
Col1 Col2 PK
----------- ----- -----------
1 abc 74
2 abc 77
7 xyz 79

Approach 2:

DELETE
FROM
DupsNoPK
WHERE PK NOT IN
(
SELECT MAX(PK)
FROM
DupsNoPK
GROUP BY col1,col2)

Approach 3:

DELETE
FROM
DupsNoPK
WHERE PK NOT IN
(
SELECT MIN(PK)
FROM
DupsNoPK
GROUP BY col1,col2)



Deleting duplicate rows using select distinct into:

The third method of removing duplicate data may seem crude, but if the goal is to remove duplicates while creating a new table — perhaps as part of an ETL process — it may be the best choice.

A SELECT DISTINCT will automatically pass the data through a filter that eliminates duplicate rows.

The INTO option causes the results of the select to be placed into a new table instead of going to the client application. Mix the two options together and you have an instant duplicate row remover.Again, assuming the DupsNoPK table is reset with its original duplicate data, the following query generates a new table without duplicates and then examines the contents of the new table:

SELECT distinct Col1, Col2 INTO NoDups
FROM DupsNoPK;

SELECT Col1, Col2
FROM NoDups;
Result:
Col1 Col2
----------- -----
1 abc
2 abc
7 xyz

Using ROWCOUNT and TOP:
To delete the duplicate record with SQL Server 2000 and 2005 we can use the SET ROWCOUNT command to limit the number of rows affected by a query. 
By setting it to 1 we can just delete one of these rows in the table. 
Note: The select commands are just used to show the data prior and after the delete occurs.
Note: If there are only a few sets of duplicate record values, the best procedure is to delete these manually on an individual basis. For example:

SELECT FROM dbo.duplicateTest
SET ROWCOUNT 1
DELETE FROM dbo.duplicateTest WHERE ID 1
SET ROWCOUNT 0
SELECT FROM dbo.duplicateTest

The ROWCOUNT 1 will change depends on number of duplicate record this table contain, this syntax applicable only for table has two duplicate records.

With SQL Server 2005 we can also use the TOP command when we issue the delete, such as the following.
Note: the select commands are just used to show the data prior and after the delete occurs.
SELECT FROM dbo.duplicateTest

DELETE TOP (1FROM dbo.duplicateTest WHERE ID 1

SELECT FROM dbo.duplicateTest

The TOP(1) will change depends on number of duplicate record this table contain, this syntax applicable only for table has two duplicate records.


COMPLETE APPROACH USING TOP:

CREATE TABLE #Tempnodups
(   Col1 VARCHAR(20),
    Col2 VARCHAR(20),
    Count INT
)


INSERT INTO # Tempnodups (col1,col2, Count)
SELECT col1,col2, Count(*) FROM DupsNoPK
GROUP BY col1,col2

DECLARE @col1 VARCHAR(20), @col2 VARCHAR(20), @Count INT
  
DECLARE DuplicateCursor CURSOR
FOR SELECT clo1,col2, Count FROM #Tempnodups

OPEN DuplicateCursor
FETCH NEXT FROM DuplicateCursor INTO @col1, @col2, @Count



WHILE @@FETCH_STATUS = 0
BEGIN
    DELETE TOP (@Count-1) FROM DupsNoPK
    WHERE col1 = @col1 AND col2 = @col2
  
FETCH NEXT FROM DuplicateCursor INTO @col1, @col2, @Count
END

CLOSE DuplicateCursor
DEALLOCATE DuplicateCursor

DROP TABLE #Tempnodups


SELECT * FROM DupsNoPK

No comments:

Post a Comment