Pages

Monday, 30 September 2013

Primary Key

The PRIMARY KEY constraint uniquely identifies each record in a database table.

Every table should have a primary key. If the primary key is the same data used by humans to identify the item in reality, then it’s a natural key, e.g., ssn, vehicle vin, aircraft tail number, part serial number. A natural key is helps to bridge between the real-world and the database.

The alternative to the natural key is the surrogate key , surrogate meaning artificial or a stand-in replacement.For databases, a surrogate key means an artificial, computer-generated value is used to uniquely identify the row. SQL Server supports identity columns and globally unique identifiers (GUIDs) as surrogate keys.

When a PRIMARY KEY constraint is added to an existing column or columns in the table, the Database Engine examines the existing column data and metadata to make sure that the following rules for primary keys:

The columns cannot allow for null values:Columns of PRIMARY KEY constraints that are specified when a table is created are implicitly converted to NOT NULL. A sparse column cannot be used as a part of a primary key because sparse columns must allow null values.
create table id (id int NULL)
alter table id add constraint pk_id primary key(id)

Msg 8111, Level 16, State 1, Line 2
Cannot define PRIMARY KEY constraint on nullable column in table 'id'.
Msg 1750, Level 16, State 0, Line 2
Could not create constraint. See previous errors.

If you attempt to add a primary key constraint on a column that allows NULL values, the statement will immediately fail, even if the column contains no NULL values.

There can be no duplicate values:If a PRIMARY KEY constraint is added to a column that has duplicate values or allows for null values, the Database Engine returns an error and does not add the constraint.

You cannot add a PRIMARY KEY constraint that violates above rules.

The Database Engine automatically creates a unique index to enforce the uniqueness requirement of the PRIMARY KEY constraint.

If a clustered index does not already exist on the table or a nonclustered index is not explicitly specified, a unique, clustered index is created to enforce the PRIMARY KEY constraint.

If clustered index is there on the table then unique, non clustered index is created to enforce the PRIMARY KEY constraint.

If a PRIMARY KEY constraint is defined on more than one column, values may be duplicated within one column, but each combination of values from all the columns in the PRIMARY KEY constraint definition must be unique.

If a PRIMARY KEY constraint already exists, you can modify or delete it. To modify a PRIMARY KEY constraint, you must first delete the existing PRIMARY KEY constraint and then re-create it with the new definition.

For example, you may want the PRIMARY KEY constraint of the table to reference other columns, or you may want to change the column order, index name, clustered option, or fill factor of the PRIMARY KEY constraint. However, you cannot change the length of a column defined with a PRIMARY KEY constraint.

A PRIMARY KEY constraint cannot be deleted if the following exist:
If it is referenced by a FOREIGN KEY constraint in another table; the FOREIGN KEY constraint must be deleted first.

Limitations and Restrictions:
A table can contain only one PRIMARY KEY constraint.

All columns defined within a PRIMARY KEY constraint must be defined as NOT NULL. If nullability is not specified, all columns participating in a PRIMARY KEY constraint have their nullability set to NOT NULL.

For both unique key and primary key constraints, the columns chosen for the key must be of a datatype that is allowed for an index. This precludes columns that are can’t be compared, such as ntext, text, image, varchar(max), nvarchar(max), varbinary(max), XML, geography, geometry, and CLR user-defined data types that do not support binary ordering.

Permissions:
Creating a new table with a primary key requires CREATE TABLE permission in the database and ALTER permission on the schema in which the table is being created.

Creating a primary key in an existing table requires ALTER permission on the table.

DISABLE PRIMARY KEY

You can disable a primary key using the ALTER TABLE statement in SQL Server (Transact-SQL).

The syntax to disable a primary key using the ALTER INDEX statement in SQL Server (Transact-SQL) is: 
          ALTER INDEX constraint_name ON table_name DISABLE;
ENABLE PRIMARY KEY
You can enable a primary key using the ALTER INDEX statement in SQL Server (Transact-SQL).
Syntax
          ALTER INDEX constraint_name ON table_name REBUILD;


Columns level Constraint creation:

CREATE TABLE Persons(
P_Id int NOT NULL PRIMARY KEY,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Address varchar(255),
City varchar(255)
)

CREATE TABLE Persons
(
P_Id int NOT NULL PRIMARY KEY NONCLUSTERED,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Address varchar(255),        
City varchar(255)
)

Table level Constraint creation:

CREATE TABLE Persons
(
P_Id int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Address varchar(255),
City varchar(255),
CONSTRAINT pk_PersonID PRIMARY KEY (P_Id,LastName)
)
CREATE TABLE Persons
(
P_Id int NOT NULL,
LastName varchar(255) NOT NULL,
FirstName varchar(255),
Address varchar(255),
City varchar(255),
CONSTRAINT pk_PersonID PRIMARY KEY NONCLUSTERED (P_Id,LastName)
)

Constraint creation on existing table:

ALTER TABLE Persons ADD CONSTRAINT pk_PersonID PRIMARY KEY (P_Id,LastName)

It is important to note, that by default a primary key is clustered. This may or may not be the preferred method of creation. Either way, you can specify the clustered / non clustered option when creating a primary key.

To create a clustered primary key on an existing table:
ALTER TABLE dbo.Person ADD CONSTRAINT PK_Person
PRIMARY KEY CLUSTERED (PersonID);

To create a non clustered primary key on an existing table:
ALTER TABLE dbo.Person ADD CONSTRAINT PK_Person
PRIMARY KEY NONCLUSTERED (PersonID);

Drop an Constraint:
ALTER TABLE Persons
DROP CONSTRAINT pk_PersonID




Recommendation for the Clustering Key:
·         Unique
·         Narrow
·         Static

Why Unique?
A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes.

Take for example an index in the back of a book – if you need to find the data that an index entry points to – that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index – it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).

Although SQL Server assigns a row identifier (RID) to each record in a file, users and user programs can't use an RID as a uniqueidentifier, because you can't guarantee that an RID will remain constant over time. Therefore, an RID doesn't meet the criterion that a primary key's value must never change. An RID is composed of a file number, a page number, and a page's row number. As the position of each record shifts in the file, the record's associated RID changes. The primary key you choose must have a constant and unchanging set of values. When you assign a primary key value to a record, the value must not change for the life of that record and you can't reuse the value—even after the record is deleted from the table. In addition to these logical reasons for not using an RID as a primary key, you can't access it through any supported programming interface.

Why Narrow?
A clustering key should be narrow for some of the same reasons it should be unique. If the clustering key is used as the lookup key from all non-clustered indexes, then the clustering key is duplicated in all non-clustered indexes. If the clustering key is really wide, then all of the non-clustered indexes will be [unnecessarily] wide. This will waste disk space, create additional costs on INSERTs and UPDATEs, and require more time (because of size) when rebuilding these index structures. So, what does narrow mean – as few bytes as possible to help uniquely define your rows. A narrow numeric when possible.
Why Static?
A clustering key should be static for some of the same reasons it should be unique and narrow. If the clustering key is used as the lookup key from all non-clustered indexes, then the clustering key is duplicated in all non-clustered indexes. In fact, for a given table the clustering key will be the most duplicated data. If this data changes then they'll need to update the value in the base table as well as in EVERY non-clustered index. And, if the key changes, it will cause the record to move. When a record moves, it creates fragmentation. This will waste disk space, create additional costs on INSERTs and UPDATEs, and require more time (because of record relocation and [the likely] subsequent splits) and require more maintenance.
OK, so it sounds like I want a narrow, unique and static value… What about a guid?
Typically,i recommend a numeric IDENTITY column as the clustering key but I always get this question. In fact, I often wait to see how long it's going to take before I get this question ;). Anyway, a guid does meet the criteria fairly well – it's certainly unique, it's usually static and it's relatively narrow. So, what's wrong with it? In SQL Server 2000, the guid function (newid()) is built using a value that does not create an ever increasing pattern (an IDENTITY column would). But wait, I didn't say that you needed to have an ever-increasing pattern…..
OK, so the final criteria I look for in a clustering key is: an ever-increasing pattern!
If the clustering key is ever-increasing then new rows have a specific location where they can be placed. If that location is at the end of the table then the new row needs space allocated to it but it doesn't have to make space in the middle of the table. If a row is inserted to a location that doesn't have any room then room needs to be made (e.g. you insert based on last name then as rows come in space will need to be made where that name should be placed). If room needs to be made, it's made by SQL Server doing something called a split. Splits in SQL Server are 50/50 splits – simply put – 50% of the data stays and 50% of the data is moved. This keeps the index logically intact (the lowest level of an index – called the leaf level – is a douly-linked list) but not physically intact. When an index has a lot of splits then the index is said to be fragmented. Good examples of an index that is ever-increasing are IDENTITY columns (and they're also naturally unique, natural static and naturally narrow) or something that follows as many of these things as possible – like a datetime column (or since that's NOT very likely to be unique by itself datetime, identity). But wait, what about that a guid.
Well, in SQL Server 2000 the only SQL Server function for guids is newid – that does not create an ever increasing pattern.
In SQL Server 2005, you can use a new guid function called newsequentialid() to populate your uniqueidentifier column. Here's an example of how you can use it:
CREATE TABLE Test
(
TestID uniqueidentifier CONSTRAINT Test_TestID_Default DEFAULT newsequentialid(),
Inserted datetime CONSTRAINT Test_Inserted_Default DEFAULT getdate()
)
go
INSERT Test DEFAULT VALUES
go
SELECT * FROM Test
go


Need Further Study:
  1. Is there a way to create a sequential guid in SQL Server 2000
  2. Should I use a natural or surrogate key?
There are two strategies for assigning keys to tables:
Natural keys: A natural key is one or more existing data attributes that are unique to the business concept.  For the Customer table there was two candidate keys, in this case CustomerNumber and SocialSecurityNumber or Invoice-Numbers, Tax-Ids, SSN etc..
Surrogate key: Surrogate keys are keys that have no “business” meaning and are solely used to identify a record in the table. Such keys are either database generated (example: Identity in SQL Server, Sequence in Oracle, Sequence/Identity in DB2 UDB etc.) or system generated values (like generated via a table in the schema).
Surrogate Key
I prefer surrogate keys to be DB controlled rather than being controlled via a next-up table in the schema since that is a more scalable approach.
Pros:
Business Logic is not in the keys.
Small 4-byte key (the surrogate key will most likely be an integer and SQL Server for example requires only 4 bytes to store it, if a bigint, then 8 bytes).
Joins are very fast.
No locking contentions because of unique constraint (this refers to the waits that get developed when two sessions are trying to insert the same unique business key) as the surrogates get generated by the DB and are cached – very scalable.
Cons:
An additional index is needed.  In SQL Server, the PK constraint will always creates a unique index, in Oracle, if an index already exists, PK creation will use that index for uniqueness enforcement (not a con in Oracle).
Cannot be used as a search key.
If it is database controlled, for products that support multiple databases, different implementations are needed, example: identity in SS2k, before triggers and sequences in Oracle, identity/sequence in DB2 UDB.
Always requires a join when browsing the child table(s).
Natural Key
Pros:
No additional Index.
Can be used as a search key.
 Cons:
If not chosen wisely (business meaning in the key(s)), then over a period of time additions may be required to the PK and modifications to the PK can occur.
If using strings, joins are a bit slower as compared to the int data-type joins, storage is more as well.  Since storage is more, less data-values get stored per index page.  Also, reading strings is a two step process in some RDBMS: one to get the actual length of the string and second to actually perform the read operation to get the value.
Locking contentions can arise if using application driven generation mechanism for the key.
Can’t enter a record until value is known since the value has some meaning.
Choosing Surrogate vs. Natural Keys:
There is no rule of thumb in this case. It has to be evaluated table by table:
If we can identify an appropriate natural key that meets the three criteria for it to be a PK column, we should use it.  Look-up tables and configuration tables are typically ok.

Data-Type for the PK: the smaller the better, choose an integer or a short-character data type.   It also ensures that the joins will be faster.  This becomes even more important if you are going to make the PK as a clustered index since non-clustered indexes are built off the clustered index.  RDBMS processes integer data values faster than the character data values because it converts characters to ASCII equivalent values before processing, which is an extra step.

No comments:

Post a Comment