From SQLSKILLS.com By Paul
S.Randal
In one of the sessions Kimberly and I taught this week at SQL Connections, we discussed how to choose
efficient data-types – I’d like to share the discussion here with an example.
The crux of the matter is in the amount of space required to store the
data-types in the schema, and ensuring that the data-type choice matches the
data being stored without incurring any unnecessary overhead. Here are a couple
of examples:
·
A field that stores a person’s age will usually be
able to use a TINYINT type
that can store a maximum value of 256 – unless its an archeological or
historical database, for instance. Without putting a lot of thought into it,
however, someone may choose to use an INTEGER type.
What’s the difference? A TINYINT take a single byte to store, whereas an INTEGER takes
*4* bytes to store – wasting 3 bytes per record.
·
A field that stores a person’s city of residence in
the USA
needs to be able to hold 24 characters (see myprevious post on how
long fields have to be) – so what data-type should you use? A CHAR
(24) will be
able to store all the possible values, but will *always* take up 24 bytes in
the record as it’s a fixed-length column. AVARCHAR (24), on the other hand will
only store the number of bytes equal to the number of characters in the city
name, so using the fixed-length type will waste a varying number of bytes per
record.
·
In the USA ,
the names of the various States are commonly abbreviated to two characters –
for instance, Washington = WA and California = CA. So is
the best type to use CHAR (2) or VARCHAR (2)? Given that the abbreviations are
always 2 characters, they’ll both store 2 bytes all the time. BUT, a variable
length column has a two-byte overhead (the two-byte pointer for the column
that’s stored in the variable-length column offset array in the record) – so in
this case the best data-type to use is CHAR (2).
You can see how choosing data-types without considering whether it’s the best type can lead to wasted space. Let’s look at an example schema to support a population census of the
CREATE TABLE CensusSchema1 (
SSN CHAR (256),
StateName CHAR (256),
Sex INTEGER,
Employed INTEGER,
SelfEmployed INTEGER,
EthnicOrigin INTEGER,
MaritalStatus INTEGER,
NumberOfDependents INTEGER,
Age INTEGER,
CountryOfBirth INTEGER,
HouseholdIncomeGroup INTEGER,
ZipCode5 INTEGER);
GO
·
At first glance this may look reasonable, but
digging in you’ll see that many of the columns are over-sized. Here’s a
cleaned-up schema to compare against, with notes on the per-column savings:
CREATE TABLE CensusSchema2 (
SSN CHAR (9), – saving 244
bytes
StateName VARCHAR (256), -- saving at least 240 bytes (longest state name is 14 +
2-byte varchar overhead)
Sex BIT,
Employed BIT,
SelfEmployed BIT, -- saving 11 bytes
altogether over these three fields
EthnicOrigin TINYINT, -- saving 3 bytes
MaritalStatus TINYINT, -- saving 3 bytes
NumberOfDependents TINYINT, -- saving 3 bytes
Age TINYINT, -- saving 3 bytes
CountryOfBirth TINYINT, -- saving 3 bytes
HouseholdIncomeGroup TINYINT, -- saving 3 bytes
ZipCode5 INTEGER); -- no saving
GO
·
The bad schema gives a per-row size of 574 bytes
and the cleaned-up schema is no more than 48 bytes per-row. I designed these
two schemas to support a US
census. The population of the USA
is approx. 300 million. This means the bad schema would take around 190GB to
store all the info, and the cleaned-up schema only takes around 15GB – more
than 12 times more efficient! Now we’re starting to see how poor data-type
choice can lead to poor performance.
Wider rows means:
·
Fewer rows can fit on an 8k page.
·
More CPU is necessary to crack open a record due to
the number of CPU data cache line invalidations necessary to read the record
into the CPU’s various caches (every time a cache line is invalidated it takes
a bunch of CPU cycles – see here for an explanation
of CPUs and cache lines).
·
Less rows per page means:
·
More pages are needed to store the data
·
Indexes could have a smaller fan-out (if the index
keys are wider than then need to be), leading to more level in the index and
less efficient index navigation from the index root page to the index
leaf-level.
More pages means:
·
More IOs are necessary to read the same amount of
actual data
·
More buffer pool memory is necessary to hold the
same amount of actual data
·
More disk space is necessary to hold the same
amount of actual data
·
Clearly there’s a link between various aspects of
workload performance and badly chosen data-types.
No comments:
Post a Comment