Defination : Normalization is the
process of efficiently organizing data in a database. There are two goals of
the normalization process: eliminating
redundant data (for example, storing the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a table). Both of these are worthy
goals as they reduce the amount of space a database consumes and ensure that
data is logically stored. There are several benefits for using Normalization in
Database.
Benefits:
Eliminate data redundancy
Improve performance
Query optimization
Faster update due to less number of columns in
one table
Index improvement
First normal form (1NF):
We begin with a
definition of the first normal form:
A relation is in the first normal form
(1NF) if all its attributes are atomic (i.e., each attribute is defined on a
single domain), and each record in the relation is characterized by a unique
primary key value.
In defining atomic attributes, Codd was
deliberate about eliminating repeating groups (the scenario where a
column (attribute) could store an array or list of possible values); this was
prevalent in CODASYL and COBOL. Atomicity means that each attribute is defined
on a single domain and stores a single value for any tuple in the relation.
Moreover, implicit in this definition is the enforcement of the entity
integrity constraint (section 4.1).
As such, tuples that have null values in their primary key will not be allowed
to exist; neither will there be tuples with duplicate primary key values in the
relation. (These are problems that persisted and perhaps still do outside of
the context of 1NF.)
By this definition, all relations are in 1NF.
This is by no means coincidental, but by design: we define a relation to
consist of atomic attributes, and subject to the entity integrity constraint
and the referential integrity constraint. However, as you will soon see, having
relations in 1NF only is often not good enough.
Example 4-4 provides
an example of a relation that is in 1NF but is poorly designed. It may surprise
you to learn that this problem was once very widespread in accounting software
systems. As an exercise, try proposing a more efficient design for the relation
described in the example.
4.5.1 Problems with Relations in 1NF Only
In the previous
section, R0 was presented as a non-normalized file to be replaced with three
other relations. Let's revisit that discussion for further insight. For ease of
reference, R0 is repeated here:
R0 {Suppl#, SupplName, Item#, ItemName,
Quantity, SupplStatus, Location} PK [Suppl#, Item#]
Relation R0 is in
1NF only. However, it is undesirable to store it in its current form due to a
number of problems. For ease of reference, the functional dependencies of R0 as
illustrated in Figure 4-2 are repeated here:
FD1:
Suppl# → {SupplName, SupplStatus, Location}
FD2:
Item# → ItemName
FD3:
[Suppl#, Item#] → {Quantity, SupplName, SupplStatus, Location, ItemName}
Figure 4-3 shows
some sample data depicting various shipments of inventory items for R0. A
cursory analysis of the data will amplify the following data anomalies:
Replication
of Data: Every time we record a supplier-item
pair, we also have to record the supplier name and item name.
Insertion
Anomaly: We cannot insert a new item until it
is supplied; neither can we insert a new supplier until that supplier supplies
some item.
Deletion
Anomaly: We cannot delete an item or a
supplier without destroying an entire shipment, as well as information about a
supplier's location.
Update
Anomaly: If we desire to update a supplier's
location or item name, we have to update several records (in fact, an entire
shipment) due to the duplication problem.
Insertion,
deletion, and update anomalies constitute modification anomalies, caused by
duplication of data due to improper database design. As stated earlier, these
problems are exacerbated as more data are added, thus leading to other problems
of data access and database performance — a tangled hellish web.
Second Normal Form: The second normal
form draws in the concept of functional dependence to shape an elevated
benchmark beyond mere 1NF requirement. Here is the definition:
A
relation is in the second normal form (2NF) if it is in 1NF and every non-key
attribute is fully functionally dependent on the primary key.
By
non-key attribute, we mean that the attribute is not part of the primary key.
Relation R0 (of the previous section), though in 1NF, is not in 2NF, due to FD1
and FD2. Using Heath's theorem, we may decompose relation R0 as follows (note
that the abbreviation PK is used to denote the primary key):
R1{Suppl#, SuppName, Location,
SupplStatus} PK[Suppl#]
R2{Item#, ItemName} PK[Item#]
R3{Suppl#, Item#, Quantity} PK[Suppl#,
Item#]
We
then check to ensure that the resulting relations are in 2NF. Relation R1 has a
single attribute as its primary key, and so does R2; there is therefore no
possibility of either relation being in violation of 2NF. As for relation R3,
there is only one non-key attribute and it is dependent on the primary key. We
may therefore conclude with confidence that all three relations (R1, R2, and
R3) are in 2NF.
So
based on the definition of 2NF, and on the authority of Heath's theorem, we
would replace R0 with R1, R2, and R3. Please note the consequences of our
treatment of R0 so far:
The
problems with relations in 1NF only have been addressed.
By
decomposing, we have introduced foreign keys in relation R3.
JOINing
is the opposite of PROJecting. We can rebuild relation R0 by simply JOINing R3
with R1 and R3 with R2, on the respective foreign keys.
From
the definition of 2NF, two observations should be obvious. Firstly, if you have
a relation with a single attribute as the primary key, it is automatically in
2NF. Secondly, if you have a relation with n attributes and n-1 of them form
the primary key, the relation may very well be in 2NF but you must first verify
this.
4.6.1
Problems with Relations in 2NF Only
In
this example, relations R2 and R3 are in 2NF (in fact, they are in 3NF), but we
still have potential problems with R1. What if we have a situation where there are several suppliers from a
given location? Or what if we want to keep track of locations of interest?
In either case, we would have modification anomalies as described below:
Insertion Anomaly: We cannot record
information about a location until we have at least one supplier from that
location.
Deletion Anomaly: We cannot delete
a particular location without also deleting supplier(s) from that location.
Update Anomaly: If we wish to
update information on a location, we have to update all supplier records from
that location.
These
problems can be addressed if we take the necessary steps to bring R1 into the
third normal form (3NF). But first, we must define 3NF.
Third Normal Form:
The third normal
form extends constraints related to functional dependence to non-key attributes
(i.e. attributes that are not part of the primary key). The following are three
alternate definitions:
A relation is in the third normal form (3NF)
if it is in 2NF and no non-key attribute is fully functionally dependent on
other non-key attribute(s).
Put another way, a relation is in 3NF if
non-key attributes are mutually independent and fully functionally dependent on
the primary key. (Two or more attributes are mutually independent if none of
them is functionally dependent on any combination of the others.)
Put another way, a relation is in 3NF if it is
in 2NF and every non-key attribute is non-transitively dependent on the primary
key. (Non-transitivity implies mutual independence.)
Transitive
dependence refers to dependence among non-key attributes. In particular, if A →
B and B → C, then C is transitively dependent on A (i.e., A → C transitively).
Let's revisit
relations R1, R2, and R3 of the previous section (and recall that these are
decompositions of R0). Relation R1 is problematic because it is not in 3NF. If
it is desirable to store additional information about the locations as
indicated in the previous section, then we must be smart enough to discern that
location is to be treated as an entity with attributes such as location code,
location name, and perhaps others. We may therefore rewrite R1 as follows:
R1{Suppl#,
SupplName, LocationCode, LocationName, SupplStatus} PK[Suppl#]
Now observe that
LocationCode → LocationName! R1 therefore violates 3NF and must be decomposed.
Using Heath's theorem, we may therefore decompose R1 as follows:
R4{Suppl#, SupplName, LocationCode} PK[Suppl#]
R5{LocationCode, LocationName}
PK[LocationCode]
We now check to
ensure that the relations are in 3NF (and they are). Again, please take careful
notice of the consequences of our actions to this point:
The problems with relations in 2NF only have
been addressed.
Again, by
decomposing, we have introduced a foreign key in relation R4.
We can obtain the
information that relation R1 represented by simply joining R4 with R5 on the
foreign key.
From the definition
of 3NF, it should be obvious that if you have a 2NF relation with one candidate
key and n mutually independent non-key attributes, or only one non-key
attribute, that relation is in 3NF.
4.7.1 Problems with Relations in 3NF Only
Relations R2, R3,
R4, and R5 above are all in 3NF. However, it has been found that 3NF-only
relations suffer from certain inadequacies. It is well known that 3NF does not
deal satisfactorily with cases where the following circumstances hold:
There are
multiple composite candidate keys in a relation.
The candidate
keys overlap (they have at least one attribute in common).
For these
situations, the Boyce-Codd normal form (BCNF) provides the perfect solution. As
you shall soon see, the BCNF is really a refinement of 3NF. In fact, where the
above-mentioned conditions do not hold, BCNF reduces to 3NF.
Boyce-Codd Normal Form:
The Boyce-Codd
normal form (BCNF) was developed thanks to the effort of Raymond F. Boyce and
Edgar F. Codd. This normal form is really a refinement of the third normal
form. Simply, the BCNF requirement states
A relation is in
BCNF iff every determinant in the relation is a candidate key.
Drawing from
subsection 4.4.1, a determinant is an attribute (or group of attributes) on
which some other attribute(s) is (are) fully functionally dependent.
Examination of R2, R3, R4, and R5 above will quickly reveal that they are in
BCNF (hence 3NF). We therefore need to find a different example that
illustrates the importance of BCNF.
Consider the
situation where it is desirous to keep track of animals in various zoos,
multiple keepers, and the assigned keepers for these animals. Let's tentatively
construct the relation R6 as
R6{Zoo,
Animal, Keeper}
Assume further
that that a keeper works at one and only one zoo. We can therefore identify the
following FDs:
[Zoo, Animal] → Keeper
Keeper → Zoo
Given the above,
we conclude that [Zoo, Animal] is the primary key. Observe that R6 is in 3NF
but not in BCNF, since Keeper is not a candidate key but is clearly a
determinant. Using Heath's theorem, we may decompose R6 as follows:
R7{Animal, Keeper} PK[Animal]
R8{Keeper, Zoo} PK[Keeper]
Finally, let's
clean up by making the observation that Zoo, Keeper, and Animal are really
disguised entities, not mere attributes (similar to Location in the previous
section), and should be treated as such. Minimally, each would require a code
and a name. Below is the revised list of normalized relations for the case:
R7{ZooID, ZooName, …} PK[ZooID]
R8{KeeperID, KeeperName, …ZooID} PK[KeeperID]
R9{AnimalCode, AnimalName, …KeeperID}
PK[AnimalCode]
Notice that the
test for BCNF is quite straightforward: you simply check to ensure that the
only determinant in each relation is a candidate key. As on previous occasions,
let's examine the consequences of our action:
By achieving
BCNF, we benefit from further reduction in data duplication and modification
anomalies.
A further
advantage is that we can now store dangling tuples. In our example, a keeper
can be assigned to a zoo even before he/she is assigned an animal.
One drawback of
this schema is that we cannot store an animal that has not been assigned a
keeper or zoo. The schema can be further refined but that requires a discussion
of the fourth normal form (4NF); this will be done in the upcoming section.
One possible
drawback with BCNF is that more relations have to be accessed (joined) in order
to obtain useful information. Again referring to the example, R9 must be joined
with R8 in order to derive Zoo-Animal pairs; and to obtain the name of the zoo
(and no doubt other information), a second join must be made with R7. In light
of the fact that logical views can facilitate a seamless end user perspectives
that obscure these underlying joins (more on this in Chapter 13), and
processing power continues on its upswing (according to Moore's Law), the
benefits of achieving BCNF tend to outweigh the drawbacks by a large margin.
Observe:
The principle of
BCNF is very simple but profound. By being guided by it, you can, in many
circumstances, actually bypass obtaining 2NF and 3NF relations, and move
directly into a set of BCNF relations. Adopting this approach will significantly
simplify the analysis process. Moreover, in most practical situations, you will
not be required to normalize beyond BCNF. This approach will be further
clarified in the next chapter.
Fourth Normal Form:
The fourth normal
form (4NF) relates to the situation where mutually independent but related
attributes form a relation and the inefficient arrangement causes duplication
and hence modification anomalies. Consider the relation CTT-Schedule,
representing course-teacher-text combinations in an educational institution.
Assume the following
business rules:
A course can be taught by several teachers.
A course can require any number of texts.
Teachers and texts are independent of each
other; in other words, the same texts are used irrespective of who teaches the
course.
A teacher can teach several courses.
Figure 4-4
provides some sample data for the purpose of illustration. Notice the high
level of duplication due to the structure of the relation and the prevailing
business rules.
Figure 4-4: CTT
schedule
Note that the
theory so far does not provide a method of treating such a situation, except
flattening the structure (by making each attribute part of the primary key) as
shown below:
R10{Course,
Teacher, Text} PK[Course, Teacher, Text]
Since R10 is
keyed on all its attributes, it is in BCNF. Yet, two potential problems are
data redundancy and modification anomalies (the former leading to the latter).
In our example, in order to record that Calculus II is taught by both Professor
Cross and Professor Maitland, four records are required. In fact, if a course
is taught by p professors and requires n texts, the number of records required
to represent this situation is p*n. This is extraordinary, and could prove to
be very demanding on storage space, especially as the number of records increases.
Relation R10,
though in BCNF, is not in 4NF, because it has a peculiar dependency called a
multi-valued dependency (MVD). In order to state the 4NF, we must first define
MVD.
4.9.1 Multi-Valued Dependency
A multi-valued
dependency (MVD) is defined as follows:
Given a relation
R(A, B, C), the MVD A -» B (read "A multi-determines B") holds iff
every B-value matching a given (A-value, C-value) pair in R depends only on the
A-value and is independent of the C-value.
Further, given
R(A B C), A-» B holds iff A -» C also holds. MVDs always go together in pairs
like this. We may therefore write A -» B/C.
MVDs do not occur
as frequently as FDs. Moreover, as a consequence of the definition of MVD, it
is important to note the following points:
For MVD, at least
three attributes must exist.
All FDs are MVDs
but all MVDs are not necessarily FDs.
A -» B reads
"A multi-determines B" or "B is multi-dependent on A."
Let's get back to
R10: Course -» Text/Teacher. Note that Course is the pivot of the MVD. Course
-» Teacher since Teacher depends on Course, independent of Text. Course -» Text
since Text depends on Course, independent of Teacher. So how do we resolve
this? Fagin's theorem provides the answer.
4.9.2 Fagin's Theorem:
Fagin's theorem
(named after Ronald Fagin, who proposed it) may be stated as follows:
Relation R{A, B,
C} can be non-loss decomposed into projections R1{A, B} and R2{A, C} iff the
MVDs A -» B/C both hold.
Note that like
Heath's theorem, which prescribes how to treat FDs, Fagin's theorem states
exactly how to treat MVDs. With this background, we can proceed to defining the
4NF:
A relation is in
4NF iff whenever there exists an MVD, say A -» B, then all attributes of R are
also functionally dependent on A.
Put another way,
R{A, B, C…} is in 4NF iff every MVD satisfied by R is implied by the candidate
key of R.
Put another way,
R{A, B, C…} is in 4NF iff the only dependencies are of the form [candidate key]
→ [other non-key attribute(s)].
Put another way,
R{A, B, C…} is in 4NF iff it is in BCNF and there are no MVDs (that are not
FDs).
The last
formulation is particularly instructive. To paraphrase, whenever you encounter
a relation that has an unresolved MVD, use Fagin's theorem to resolve by
replacing the MVD with equivalent FDs. In the current example, R10 is not in
4NF. This is so because although it is in BCNF, an MVD exists. Using Fagin's
theorem, we may decompose it as follows:
R11{Course, Text}
PK[Course, Text]
R12{Course,
Teacher} PK[Course, Teacher]
Note
Fagin's theorem
prescribes a method of decomposing a relation containing an MVD that is
slightly different from the decomposition of an FD as prescribed by Heath's
theorem. Figure 4-5 clarifies this.
Figure 4-5:
Treating MVDs
As an additional
step to resolving the CTT-Schedule problem, it is a good idea to refine the
solution by applying the following strategies: replace euphemistic relation
names (R11 and R12) with more meaningful names; recognize that Course, Text,
and Teacher are actually entities, not mere attributes; return to the use of
unique attribute names by using the prefixing strategy that was introduced
toward the end of Chapter 3. A more refined solution is as follows:
Course{Crs#,
CrsName, CrsCredits, …} PK[Crs#]
Teacher{TeacherEmp#,
Emp_FName, Emp_LName, …} PK[TeacherEmp#]
Text{TextCode,
TextTitle, …} PK[TextCode]
CourseTextMap{CT_Crs#,
CT_TextCode} PK[CT_Crs#, CT_TextCode]
CourseTeacherMap{CR_Crs#,
CR_TeacherEmp#} PK[CR_Crs#, CR_TeacherEmp#]
From this
proposed solution, it should be obvious to you that the attributes of
CourseTextMap and CourseTeacherMap are foreign keys. You could then construct a
RAL (review the latter part of Chapter 3) that clearly outlines the database
conceptual schema for the stated problem; this is left as an exercise for you.
4.9.3 The Zoo
Revisited
Let's revisit the
zoo problem of the previous section. It was mentioned that further refinement
was needed to allow for storing animals not yet assigned to keepers or zoos. To
facilitate this, we need to recognize the presence of an MVD:
Animal -»
Keeper/Zoo
The partial
solution given in section 4.8 is repeated here, with meaningful relation names
introduced to replace the more cryptic (euphemistic) names used up to this
point in the chapter.
ZoojZooID,
ZooName, …} PK[ZooID]
KeeperjKeeperID,
KeeperName, …ZooID} PK[KeeperID]
AnimaljAnimalCode,
AnimalName,…KeeperID} PK[AnimalCode]
Given the
presence of the MVD Animal → Keeper/Zoo, we need to refine the three relations
by removing prematurely inserted FK from Animal, and then apply Fagin's theorem
to introduce two new relations. The revised set of normalized relations follows
(note the use of continued use of meaning relation names and unique attribute
names):
ZoojZooID,
ZooName, …} PK[ZooID]
KeeperjKeeperID,
KeeperName, …KeeperZooID} PK[KeeperID]
AnimaljAnimalCode,
AnimalName, …} PK[AnimalCode]
AnimalKeeperMapjAK_AnimalCode,
AK_KeeperID} PK [AK_AnimalCode, AK_KeeperID]
AnimalZooMap
jAZ_AnimalCode, AZ_ZooID} PK [AZ_AnimalCode, AZ_ZooID]
As in the
previous subsection (with the CTT-Schedule problem), it should be obvious to
you what the foreign keys are in entities Keeper, AnimalKeeperMap, and
AnimalZooMap; construction of a detailed RAL for the stated problem is left as
an exercise for you.
Fifth Normal Form
So far we have
been treating relations that are decomposable into two other relations. In
fact, there are relations that cannot be so decomposed, but can be decomposed
into n other relations where n > 2. They are said to be n-decomposable
relations (n > 2). The fifth normal form (5NF) is also commonly referred to
as the projection-join normal form (PJNF) because it relates to these (n >
2) projections (of a relation not in 5NF) into decompositions that can be
rejoined to yield the original relation.
Recall the
SupplierSchedule relationship (linking suppliers, inventory items, and
projects) mentioned in Chapter 3 (Figures 3-3, 3-5, 3-13, and 3-17);it is
represented here:
SupplierSchedulejSuppl#,
Item#, Proj#} PK[Suppl#, Item#, Proj#]
Let's assume that
specific suppliers can supply specific items for specific projects. The
relation represents a M:M relationship involving Suppliers, Items, and
Projects. Observe the following features about the relation:
SupplierSchedule
is keyed on all attributes and therefore by definition is in BCNF. By
inspection, it is also in 4NF. There is no unresolved MVD as were the cases for
the CTT-Schedule problem and the zoo case of earlier discussions. Here, the
attributes are dependent on each other: suppliers supply inventory items for
various projects.
It is not
possible to decompose this relation into two other relations without losing
critical information.
If there are S
suppliers, N items, and J projects, then theoretically, there may be up to
S*N*J records. Not all of these may be valid: a supplier may supply specific
item(s) for specific project(s); not every item may be applicable to every
project; and a supplier does not necessarily support every project.
If we consider S
suppliers, each supplying N items to J projects, then it does not take much
imagination to see that a fair amount of duplication will take place, despite
the fact that the relation is in 4NF.
Let's examine a
possible decomposition of SupplierSchedule as shown in Figure 4-6. If we employ
the first two decompositions only, this will not result in a situation that
will guarantee us the original SupplierSchedule. In fact, if we were to join
these two decompositions (SI and IP), we would obtain a false representation of
the original relation. The third projection (PS) is absolutely necessary if we
are to have any guarantee of obtaining the original relation after joining the
projections.
Figure 4-6:
Illustrating possible decompositions of SupplierSchedule
As you examine
the figure, observe that first join produces SupplierSchedule plus additional
spurious tuples. The effect of the second join is to eliminate the spurious
tuples. To put it into perspective, SupplierSchedule is subject to a (time
independent) 3-decomposable (3D) constraint, namely
If (s, i) is in
SI and (i, p) is in IP and (p, s) is in PS, then (s, i, p) is in
SupplierSchedule.
This is an
example of a join dependency (JD) constraint. This constraint exists because of
the assumption made at the outset. If the prevailing business rules dictate
otherwise, then the constraint would likely be different.
4.10.1 Definition
of Join Dependency
A join dependency
(JD) constraint may be defined as follows:
Relation R
satisfies the JD P1, P2,… Pn iff R = P1 JOIN P2 JOIN… JOIN Pn where the
attributes of P1 … Pn are subsets of the attributes of R.
Relations that
are in 4NF, but not in 5NF (such as SupplierSchedule) suffer from duplication,
which in turn leads to modification anomalies. These problems are directly
related to the presence of the JD constraint(s) in such relations. Fagin's
theorem for 5NF relations provides the solution.
4.10.2 Fagin's
Theorem
Here now is
Fagin's theorem for the fifth normal form (5NF):
A relation R is
in 5NF (also called PJNF) iff every JD in R is a consequence of the candidate
keys of R.
In layman's
terms, if a relation R is in 4NF and it contains a JD that renders it
n-decomposable into P1, P2… Pn, such that R = P1 JOIN P2 … JOIN Pn where n >
2, then such relation is not in 5NF. It may therefore be decomposed to achieve
5NF relations.
Put another way,
a relation R is in 5NF iff it is in 4NF and it is not decomposable, except the
decompositions are based on a candidate key of R, and the minimum number of
projections is 3.
Now examine
relation SupplierSchedule. SupplierSchedule is not in 5NF because it has a JD
(i.e., the JD constraint) that is not a consequence of its candidate key. In
other words, SupplierSchedule can be decomposed, but this is not implied by its
candidate key [Supl#, Item#, Proj#]. We should therefore proceed with the
decomposition represented in Figure 4-6 in order to achieve 5NF. Finally, we
can refine the solution by observing the following strategies: assigning
meaningful relation names; treating suppliers, projects, and items as entities
instead of mere attributes; and observing the principle of unique attribute
names. Accordingly, a revised list of relations is provided below:
SupplierjSuppl#,
SupplName, …} PK [Suppl#]
InventoryItemjItem#,
ItemName, …} PK [Item#]
ProjectjProj#,
ProjName, …} PK [Proj#]
SuppItemMapjSI_Supp#,
SI_Item#, SI_Ref#} PK [SI_Ref#]
ItemProjMapjIP_Item#,
IP_Proj#, IP_Ref#} PK[IP_Ref#]
ProjSuppMapjPS_Supp#,
PS_Proj#, PS_Ref#} PK [PS_Ref#]
As for the
previous cases (for instance sections 4.9.2 and 4.9.3), you should be able to
easily identify the foreign keys; you should also be able to construct a
detailed RAL for the case (review Figure 3-17 of Chapter 3). Also note that for
relations SuppItemMap, ItemProjMap, and ProjSuppMap, a surrogate primary key
has been introduced in each case as an alternative to using a composite
candidate key consisting of the foreign keys.
Note
For most
practical purposes, you only have to worry about 5NF if you are trying to
implement an M:M relationship involving more than two relations. Once in 5NF,
further decompositions would share candidate keys and are therefore to no avail
(recall the corollary of Heath's theorem). Notwithstanding this, other normal
forms have been proposed, as will be discussed in section 4.12.




No comments:
Post a Comment