My Sql Note: Deleting Duplicate Rows

There are several methods to clean up duplicate data. The following sections include the windowing, surrogate key, and SELECT DISTINCT INTO methods.

To experiment with duplicate data, the following script sets up a poorly designed (no primary key) table and inserts some duplicate data:

USE tempdb ;

CREATE TABLE DupsNoPK (

Col1 INT NULL,

Col2 CHAR (5) NULL

);

-- Insert multiple dup rows (can be executed mulitple times)

INSERT DupsNoPK (Col1, Col2)

VALUES (1, ‘abc’),

(2, ‘abc’),

(7, ‘xyz’),

(7, ‘xyz’)

To verify that the table does in fact have duplicate data, the following query uses a GROUP BY and HAVING clause to return only the duplicated rows, with a count of the number of duplicates:

SELECT Col1, Col2, COUNT(*) AS DupCount

FROM DupsNoPK

GROUP BY Col1, Col2

HAVING COUNT (*) > 1;

Result:

Col1 Col2 DupCount

----------- ----- -----------

2 abc 3

7 xyz 2

Deleting duplicate rows using windowing:

Of the three methods to remove duplicate rows, this method is the most straightforward because it doesn’t need to alter the table or generate a second table.

The key to this method is using the windowing’s OVER() clause with a ROW_NUMBER() function and a partition. The partition will begin renumbering with every new partition. Set the OVER() clause to PARTITION BY every column to be checked for duplicate data. In this case, every column is being checked.

Running the windowing query first shows how it applies the row number:

SELECT Col1, Col2,

ROW_NUMBER() OVER (PARTITION BY Col1, Col2 ORDER BY Col1) AS rn

FROM DupsNoPK

Result:

Col1 Col2 rn

----------- ----- --------------------

1 abc 1

2 abc 1

2 abc 2

2 abc 3

7 xyz 1

7 xyz 2

Every duplicate row has an rn value of greater than 1, so it’s now easy to delete the duplicates:

WITH DupsNumbered

AS (

SELECT Col1, Col2,

ROW_NUMBER() OVER (PARTITION BY Col1, Col2 ORDER BY Col1) AS rn

FROM DupsNoPK

)

DELETE DupsNumbered

WHERE rn > 1;

The next SELECT tests the effect of the windowing remove duplicates query:

SELECT Col1, Col2

FROM DupsNoPK;

Result:

Col1 Col2

----------- -----

1 abc

2 abc

7 xyz

Deleting duplicate rows using a surrogate key:

A traditional method of removing duplicate rows uses a surrogate key to uniquely identify each row.

This means the table itself must be altered to add the surrogate key column.

Assuming the DupsNoPK table is reset with the original rows, the following script applies an IDENTITY surrogate key and looks at the altered table:

ALTER TABLE dbo.DupsNoPK

ADD PK INT IDENTITY NOT NULL

CONSTRAINT PK_DupsNoPK PRIMARY KEY;

SELECT *

FROM DupsNoPK;

Result:

Col1 Col2 PK

----------- ----- -----------

1 abc 74

2 abc 75

2 abc 76

2 abc 77

7 xyz 78

7 xyz 79

To search and destroy the duplicate data, the next query finds and deletes all the rows with matching Col1 and Col2 data but higher primary key values:

DELETE DupsNoPK

WHERE EXISTS ( SELECT *

FROM DupsNoPK AS D1

WHERE D1.Col1 = DupsNoPK.Col1

AND D1.Col2 = DupsNoPK.Col2

AND D1.PK > DupsNoPK.PK );

SELECT *

FROM DupsNoPK;

Result:

Col1 Col2 PK

----------- ----- -----------

1 abc 74

2 abc 77

7 xyz 79

Approach 2:

DELETE
FROM DupsNoPK
WHERE PK NOT IN
(
SELECT MAX(PK)
FROM DupsNoPK
GROUP BY col1,col2)

Approach 3:

DELETE
FROM DupsNoPK
WHERE PK NOT IN
(
SELECT MIN(PK)
FROM DupsNoPK
GROUP BY col1,col2)

Deleting duplicate rows using select distinct into:

The third method of removing duplicate data may seem crude, but if the goal is to remove duplicates while creating a new table — perhaps as part of an ETL process — it may be the best choice.

A SELECT DISTINCT will automatically pass the data through a filter that eliminates duplicate rows.

The INTO option causes the results of the select to be placed into a new table instead of going to the client application. Mix the two options together and you have an instant duplicate row remover.Again, assuming the DupsNoPK table is reset with its original duplicate data, the following query generates a new table without duplicates and then examines the contents of the new table:

SELECT distinct Col1, Col2 INTO NoDups

FROM DupsNoPK;

SELECT Col1, Col2

FROM NoDups;

Result:

Col1 Col2

----------- -----

1 abc

2 abc

7 xyz

Using ROWCOUNT and TOP:

To delete the duplicate record with SQL Server 2000 and 2005 we can use the SET ROWCOUNT command to limit the number of rows affected by a query.

By setting it to 1 we can just delete one of these rows in the table.

Note: The select commands are just used to show the data prior and after the delete occurs.

Note: If there are only a few sets of duplicate record values, the best procedure is to delete these manually on an individual basis. For example:

SELECT * FROM dbo.duplicateTest
SET ROWCOUNT 1
DELETE FROM dbo.duplicateTest WHERE ID = 1
SET ROWCOUNT 0
SELECT * FROM dbo.duplicateTest

The ROWCOUNT 1 will change depends on number of duplicate record this table contain, this syntax applicable only for table has two duplicate records.

With SQL Server 2005 we can also use the TOP command when we issue the delete, such as the following.

Note: the select commands are just used to show the data prior and after the delete occurs.

SELECT * FROM dbo.duplicateTest

DELETE TOP (1) FROM dbo.duplicateTest WHERE ID = 1

SELECT * FROM dbo.duplicateTest

The TOP(1) will change depends on number of duplicate record this table contain, this syntax applicable only for table has two duplicate records.

COMPLETE APPROACH USING TOP:

CREATE TABLE #Tempnodups

( Col1 VARCHAR(20),

Col2 VARCHAR(20),

Count INT

)

INSERT INTO # Tempnodups (col1,col2, Count)

SELECT col1,col2, Count(*) FROM DupsNoPK

GROUP BY col1,col2

DECLARE @col1 VARCHAR(20), @col2 VARCHAR(20), @Count INT

DECLARE DuplicateCursor CURSOR

FOR SELECT clo1,col2, Count FROM #Tempnodups

OPEN DuplicateCursor

FETCH NEXT FROM DuplicateCursor INTO @col1, @col2, @Count

WHILE @@FETCH_STATUS = 0

BEGIN

DELETE TOP (@Count-1) FROM DupsNoPK

WHERE col1 = @col1 AND col2 = @col2

FETCH NEXT FROM DuplicateCursor INTO @col1, @col2, @Count

END

CLOSE DuplicateCursor

DEALLOCATE DuplicateCursor

DROP TABLE #Tempnodups

SELECT * FROM DupsNoPK

My Sql Note

Pages

Wednesday, 16 October 2013

Deleting Duplicate Rows

No comments:

Post a Comment