This seems to be a self-explanatory title and before you start raising your eyes about the need for a blog post on this, let me set the context. I am not going to teach how to do some thing which is mentioned in a KB Article:
How to remove duplicate rows from a table in SQL Server
What I am going to cover in a post is “when you have inconsistencies in a SQL Server database table regarding duplicate entries in a column which has a Primary Key constraint defined on it, you need to be aware of a gotcha before you start tracking down the duplicate rows”.
For this I have table with the following definition:
create table t (c1 int primary key)
When I do a SELECT on the table, I get the following output:
Yes, there are duplicate rows in the Primary Key column! And no, I am not going to show how I made that happen. There are some pretty smart people out there who already know this! And for the rest who are still in the dark as to how this can be accomplished, I am not covering any undocumented stuff here!
A DBCC CHECKTABLE on the table yields the following error:
DBCC results for ‘t’.
Msg 2512, Level 16, State 2, Line 1
Table error: Object ID 85575343, index ID 1, partition ID 72057594038976512, alloc unit ID 72057594039894016 (type In-row data). Duplicate keys on page (1:192) slot 0 and page (1:192) slot 1.
There are 2 rows in 1 pages for object “t”.
CHECKTABLE found 0 allocation errors and 1 consistency errors in table ‘t’ (object ID 85575343).
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (dbCorrupt.dbo.t).
When I use the query from the above KB Article to find the duplicate rows, I get back no records:
select c1,COUNT(*) as counts
group by c1
having COUNT(*) > 1
In the above query, if you remove the HAVING clause, you will find that both the rows are returned with a counts column value of 1 each. This is because one of the values was changed to 100 without the knowledge of the Storage Engine.
When I execute the query below, I get back a single row:
select c1 from t where c1 = 100
If you look at the Scan Count from the Statistics IO output of the query, you will find that the Scan Count is 0. The logic behind this is explained on a previous post that I had written on the MSDN SQLServerFAQ blog some time back. This is because SQL Server doesn’t expect to retrieve more than 1 row when searching for a value belonging to a primary key column.
Now, I put all the contents of the table “t” into another table “t1” and the above query to find out duplicates returned the c1 = 100 having 2 records. Also, when I drop the primary key column on the table, I was able to retrieve the duplicate rows. The catch now is to determine which duplicate row needs to be deleted if the non-primary key columns of the table have different values. This requires understanding of the application business logic and tracing down the prior inserts done in the table which have been logged at a different layer. NEVER delete duplicate rows without understanding the repercussions of deleting a row from a production database table unless and until the duplicate rows have the same data. Eg. In the above table, I can use a DELETE TOP 1 command to delete from the table “t” because I only have one column in the table and both the duplicate rows have the same values.
To summarize, if your Primary Key column has duplicate values, then you need to either put all the rows into another table or remove the Primary Key constraint and then find out the duplicates. Just thought I would put this out for everyone in case you weren’t able to identify why duplicate rows were not showing up using a SELECT..GROUP BY.. HAVING query where as DBCC CHECK commands were reporting errors.