Surrogate
Key Implementation Strategies:
There are several common options for implementing
surrogate keys:
Key
values assigned by the database:
Most of the leading database vendors – companies such as Oracle, Sybase, and
Informix – implement a surrogate key strategy called incremental keys. The
basic idea is that they maintain a counter within the database server, writing
the current value to a hidden system table to maintain consistency, which they
use to assign a value to newly created table rows. Every time a row is created the counter is
incremented and that value is assigned as the key value for that row. The implementation strategies vary from
vendor to vendor, sometimes the values assigned are unique across all tables
whereas sometimes values are unique only within a single table, but the general
concept is the same.
MAX() +
1: A common strategy is to
use an integer column, start the value for the first record at 1, then for a
new row set the value to the maximum value in this column plus one using the
SQL MAX function. Although this approach
is simple it suffers from performance problems with large tables and only
guarantees a unique key value within the table.
Universally
unique identifiers (UUIDs): UUIDs
are 128-bit values that are created from a hash of the ID of your Ethernet
card, or an equivalent software representation, and the current datetime of
your computer system. The algorithm for doing this is defined by the Open
Software Foundation.
Globally
unique identifiers (GUIDs): GUIDs
are a Microsoft standard that extend UUIDs, following the same strategy if an
Ethernet card exists and if not then they hash a software ID and the current
datetime to produce a value that is guaranteed unique to the machine that
creates it.
High-low
strategy: The basic idea is that
your key value, often called a persistent object identifier (POID) or simply an
object identified (OID), is in two logical parts: A unique HIGH value that you
obtain from a defined source and an N-digit LOW value that your application
assigns itself. Each time that a HIGH
value is obtained the LOW value will be set to zero.
For example, if the application that you're
running requests a value for HIGH it will be assigned the value 1701. Assuming
that N, the number of digits for LOW, is four then all persistent object
identifiers that the application assigns to objects will be combination of
17010000,17010001, 17010002, and so on until 17019999. At this point a new value for HIGH is
obtained, LOW is reset to zero, and you continue again. If another application
requests a value for HIGH immediately after you it will given the value of
1702, and the OIDs that will be assigned to objects that it creates will be
17020000, 17020001, and so on. As you
can see, as long as HIGH is unique then all POID values will be unique.
The fundamental issue is that keys are a
significant source of coupling within a relational schema, and as a result they
prove difficult to refactor. The implication is that you want to avoid keys
with business meaning because business meaning changes. However, at the same
time you need to remember that some data is commonly accessed by unique
identifiers, for example customer via their customer number and American
employees via their Social Security Number (SSN). In these cases you may want to use the
natural key instead of a surrogate key such as a UUID or POID.
Tips for Effective Keys:
How can you be effective
at assigning keys? Consider the
following tips:
Avoid "smart" keys.
A "smart" key is one that contains one or more subparts which
provide meaning. For example the first
two digits of an U.S. zip code indicate the state that the zip code is in. The
first problem with smart keys is that have business meaning. The second problem is that their use often
becomes convoluted over time. For example some large states have several codes,
California has zip codes beginning with 90 and 91, making queries based on state
codes more complex. Third, they often
increase the chance that the strategy will need to be expanded. Considering that zip codes are nine digits in
length (the following four digits are used at the discretion of owners of
buildings uniquely identified by zip codes) it's far less likely that you'd run
out of nine-digit numbers before running out of two digit codes assigned to
individual states.
Consider assigning natural keys for simple "look
up" tables. A "look up" table is one that is used to relate codes
to detailed information. For example, you might have a look up table listing
color codes to the names of colors. For example the code 127 represents
"Tulip Yellow". Simple look up tables typically consist of a code
column and a description/name column whereas complex look up tables consist of
a code column and several informational columns.
Natural keys don't always work for "look up"
tables. Another example of a look up table is one that contains a row for each
state, province, or territory in North America.
For example there would be a row for California, a US state, and for
Ontario, a Canadian province. The primary goal of this table is to provide an
official list of these geographical entities, a list that is reasonably static
over time (the last change to it would have been in the late 1990s when the
Northwest Territories, a territory of Canada, was split into Nunavut and
Northwest Territories). A valid natural key for this table would be the state
code, a unique two character code – e.g. CA for California and ON for
Ontario. Unfortunately this approach
doesn't work because Canadian government decided to keep the same state code,
NW, for the two territories.
Your applications must still support "natural key
searches". If you choose to take a surrogate key approach to your database
design you mustn't forget that your applications must still support searches on
the domain columns that still uniquely identify rows. For example, your Customer table may have a
Customer_POID column used as a surrogate key as well as a Customer_Number
column and a Social_Security_Number column.
You would likely need to support searches based on both the customer
number and the social security number.
Searching is discussed in detail in Best Practices for Retrieving Objects
from a Relational Database.
Don't naturalize surrogate keys. As soon as you display the
value of a surrogate key to your end users, or worse yet allow them to work
with the value (perhaps to search), you have effectively given the key business
meaning. This in effect naturalizes the key and thereby negates some of the
advantages of surrogate keys.
The advantage of natural keys is that they exist already, you
don't need to introduce a new "unnatural" value to your data schema.
However, the disadvantage of natural keys is that because
they have business meaning they are effectively coupled to your business: you
may need to rework your key when your business requirements change. For
example, if your users decide to make CustomerNumber alphanumeric instead of
numeric then in addition to updating the schema for the Customer table (which
is unavoidable) you would have to change every single table where
CustomerNumber is used as a foreign key.
There are several advantages to surrogate keys. First, they
aren't coupled to your business and therefore will be easier to maintain
(assuming you pick a good implementation strategy).
For example, if the Customer table instead used a surrogate key then the change would
have been localized to just the Customer table itself (CustomerNumber in this
case would just be a non-key column of the table). Of course, if you needed to make a similar
change to your surrogate key strategy, perhaps adding a couple of extra digits
to your key values because you've run out of values, then you would have the
exact same problem. Second, a common key
strategy across most, or better yet all, tables can reduce the amount of source
code that you need to write, reducing the total cost of ownership (TCO) of the
systems that you build. The fundamental disadvantage of surrogate keys is that
they're often not "human readable", making them difficult for end
users to work with. The implication is that you might still need to implement
alternate keys for searching, editing, and so on.
No comments:
Post a Comment