Removing duplicate rows from a table


This seems to be a self-explanatory title and before you start raising your eyes about the need for a blog post on this, let me set the context. I am not going to teach how to do some thing which is mentioned in a KB Article:

How to remove duplicate rows from a table in SQL Server
http://support.microsoft.com/kb/139444

What I am going to cover in a post is “when you have inconsistencies in a SQL Server database table regarding duplicate entries in a column which has a Primary Key constraint defined on it, you need to be aware of a gotcha before you start tracking down the duplicate rows”.

For this I have table with the following definition:

create table t (c1 int primary key)

When I do a SELECT on the table, I get the following output:

c1


100

100

Yes, there are duplicate rows in the Primary Key column! And no, I am not going to show how I made that happen. There are some pretty smart people out there who already know this! Smile And for the rest who are still in the dark as to how this can be accomplished, I am not covering any undocumented stuff here!

A DBCC CHECKTABLE on the table yields the following error:

DBCC results for ‘t’.

Msg 2512, Level 16, State 2, Line 1

Table error: Object ID 85575343, index ID 1, partition ID 72057594038976512, alloc unit ID 72057594039894016 (type In-row data). Duplicate keys on page (1:192) slot 0 and page (1:192) slot 1.

There are 2 rows in 1 pages for object “t”.

CHECKTABLE found 0 allocation errors and 1 consistency errors in table ‘t’ (object ID 85575343).

repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKTABLE (dbCorrupt.dbo.t).

 

When I use the query from the above KB Article to find the duplicate rows, I get back no records:

select c1,COUNT(*) as counts
from t
group by c1
having COUNT(*) > 1

In the above query, if you remove the HAVING clause, you will find that both the rows are returned with a counts column value of 1 each. This is because one of the values was changed to 100 without the knowledge of the Storage Engine.

When I execute the query below, I get back a single row:

select c1 from t where c1 = 100

If you look at the Scan Count from the Statistics IO output of the query, you will find that the Scan Count is 0. The logic behind this is explained on a previous post that I had written on the MSDN SQLServerFAQ blog some time back. This is because SQL Server doesn’t expect to retrieve more than 1 row when searching for a value belonging to a primary key column.

Now, I put all the contents of the table “t” into another table “t1” and the above query to find out duplicates returned the c1 = 100 having 2 records. Also, when I drop the primary key column on the table, I was able to retrieve the duplicate rows. The catch now is to determine which duplicate row needs to be deleted if the non-primary key columns of the table have different values. This requires understanding of the application business logic and tracing down the prior inserts done in the table which have been logged at a different layer. NEVER delete duplicate rows without understanding the repercussions of deleting a row from a production database table unless and until the duplicate rows have the same data. Eg. In the above table, I can use a DELETE TOP 1 command to delete from the table “t” because I only have one column in the table and both the duplicate rows have the same values.

To summarize, if your Primary Key column has duplicate values, then you need to either put all the rows into another table or remove the Primary Key constraint and then find out the duplicates. Just thought I would put this out for everyone in case you weren’t able to identify why duplicate rows were not showing up using a SELECT..GROUP BY.. HAVING query where as DBCC CHECK commands were reporting errors.

Find FullText Search settings information for all catalogs & databases (in a jiffy)


Good morning folks – Ever worked on full text search in SQL Server and found it frustrating due to lack of information when anything goes bad & you are left poking around on what information to collect. Here is a script that you can use to collect Full-text (FTS) information for catalogs in ALL your databases. This will tell you the following,

  • Catalog Name & catalog ID
  • Item count
  • Population Status
  • Is Change Tracking enabled or not
  • Last Crawl – was it FULL, Incremental or UPDATE. Also gives you time of last crawl
  • Path of catalog storage location
EXEC sp_msforeachdb 'IF EXISTS (select * from ?.sys.fulltext_catalogs) BEGIN PRINT ''In Database: ?'' SELECT cat.name AS [CatalogName], cat.fulltext_catalog_id AS [CatalogID], FULLTEXTCATALOGPROPERTY(cat.name,''LogSize'') AS [ErrorLogSize], FULLTEXTCATALOGPROPERTY(cat.name,''IndexSize'') AS [FullTextIndexSize], FULLTEXTCATALOGPROPERTY(cat.name,''ItemCount'') AS [ItemCount], FULLTEXTCATALOGPROPERTY(cat.name,''UniqueKeyCount'') AS [UniqueKeyCount], FULLTEXTCATALOGPROPERTY(cat.name,''PopulateStatus'') AS [PopulationStatus], tbl.change_tracking_state_desc AS [ChangeTracking], tbl.crawl_type_desc AS [LastCrawlType], tbl.crawl_start_date AS [LastCrawlSTARTDate], tbl.crawl_end_date AS [LastCrawlENDDate], ISNULL(cat.path,N'''') AS [RootPath], CAST((select(casewhen exists(select distinct object_id from sys.fulltext_indexes fti where cat.fulltext_catalog_id = fti.fulltext_catalog_id and OBJECTPROPERTY(object_id, ''IsTable'')=1)then 1 else 0 end)) AS bit) AS [HasFullTextIndexedTables] FROM ?.sys.fulltext_catalogs AS catLEFT OUTER JOIN ?.sys.filegroups AS fg ON cat.data_space_id = fg.data_space_id LEFT OUTER JOIN ?.sys.database_principals AS dp ON cat.principal_id=dp.principal_idLEFT OUTER JOIN ?.sys.fulltext_indexes AS tbl ON cat.fulltext_catalog_id = tbl.fulltext_catalog_id PRINT '''' END'

How to check database statistics last updated date/time & Percent Rows modified in SQL Server


Well, here goes my first post – I will keep it simple. As a DBA you’ve probably heard people talking about statistics ALL the time. You probably have jobs/maintenance plans setup to update statistics. But if you ever ran into an issue with slow query performance and wanted to do a quick check on the statistics for ALL indexes in the database, you would use either dbcc show_statistics or sp_autostats. But they work per index/table. Here is a way to do this for the entire database using TSQL query,

— For SQL 2005/SQL 2008

select a.id as 'ObjectID', isnull(a.name,'Heap') as 'IndexName', b.name as 'TableName',
stats_date (id,indid) as stats_last_updated_time
from sys.sysindexes as a
inner join sys.objects as b
on a.id = b.object_id
where b.type = 'U'

Note: Rowmodctr values displayed above are not entirely reliable starting with SQL 2005 onwards. But, you can rely on last updated time to make a decision if the statistics need to be updated.
Read here for more

Many of you might have lot of tables or have large databases where you only want to update statistics using ‘UPDATE STATISTICS’ or ‘sp_updatestats’, but only for those tables that have large row modifications. Here is a script that you can use to get the output of each index that has significant row modifications. You can pipe this to a temp table and choose to update statistics on only these tables. You can do this by looking at the “ModifiedPercent” column.

select
schemas.name as table_schema,
tbls.name as table_name,
i.name as index_name,
i.id as table_id,
i.indid as index_id,
i.rowmodctr as modifiedRows,
(select max(rowcnt) from sysindexes i2 where i.id = i2.id and i2.indid < 2) as rowcnt,
convert(DECIMAL(18,8), convert(DECIMAL(18,8),i.rowmodctr) / convert(DECIMAL(18,8),(select max(rowcnt) from sysindexes i2 where i.id = i2.id and i2.indid < 2))) as ModifiedPercent,
stats_date( i.id, i.indid ) as lastStatsUpdate
from sysindexes i
inner join sysobjects tbls on i.id = tbls.id
inner join sysusers schemas on tbls.uid = schemas.uid
inner join information_schema.tables tl
on tbls.name = tl.table_name
and schemas.name = tl.table_schema
and tl.table_type='BASE TABLE'
where 0 < i.indid and i.indid < 255
and table_schema <> 'sys'
and i.rowmodctr <> 0
and i.status not in (8388704,8388672)
and (select max(rowcnt) from sysindexes i2 where i.id = i2.id and i2.indid < 2) > 0

— For SQL 2000

SELECT 'Index Name' = i.name, 'Table Name' = o.name,
'Statistics Date' = STATS_DATE(i.id, i.indid)
FROM sysobjects o, sysindexes i
where o.id = i.id

You can put this in a cursor for every database if you want to and play around with it. Will post back another script which will help you decide what criteria you want to pick when updating indexes. Stay tuned….

Addition: February 4, 2012: An updated version of the script is available here.