My Sql Note: normalisation

Defination : Normalization is the process of efficiently organizing data in a database. There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored. There are several benefits for using Normalization in Database.

Benefits:

Eliminate data redundancy

Improve performance

Query optimization

Faster update due to less number of columns in one table

Index improvement

First normal form (1NF):

We begin with a definition of the first normal form:

A relation is in the first normal form (1NF) if all its attributes are atomic (i.e., each attribute is defined on a single domain), and each record in the relation is characterized by a unique primary key value.

In defining atomic attributes, Codd was deliberate about eliminating repeating groups (the scenario where a column (attribute) could store an array or list of possible values); this was prevalent in CODASYL and COBOL. Atomicity means that each attribute is defined on a single domain and stores a single value for any tuple in the relation. Moreover, implicit in this definition is the enforcement of the entity integrity constraint (section 4.1). As such, tuples that have null values in their primary key will not be allowed to exist; neither will there be tuples with duplicate primary key values in the relation. (These are problems that persisted and perhaps still do outside of the context of 1NF.)

By this definition, all relations are in 1NF. This is by no means coincidental, but by design: we define a relation to consist of atomic attributes, and subject to the entity integrity constraint and the referential integrity constraint. However, as you will soon see, having relations in 1NF only is often not good enough.

Example 4-4 provides an example of a relation that is in 1NF but is poorly designed. It may surprise you to learn that this problem was once very widespread in accounting software systems. As an exercise, try proposing a more efficient design for the relation described in the example.

4.5.1 Problems with Relations in 1NF Only

In the previous section, R0 was presented as a non-normalized file to be replaced with three other relations. Let's revisit that discussion for further insight. For ease of reference, R0 is repeated here:

R0 {Suppl#, SupplName, Item#, ItemName, Quantity, SupplStatus, Location} PK [Suppl#, Item#]

Relation R0 is in 1NF only. However, it is undesirable to store it in its current form due to a number of problems. For ease of reference, the functional dependencies of R0 as illustrated in Figure 4-2 are repeated here:

FD1: Suppl# → {SupplName, SupplStatus, Location}

FD2: Item# → ItemName

FD3: [Suppl#, Item#] → {Quantity, SupplName, SupplStatus, Location, ItemName}

Figure 4-3 shows some sample data depicting various shipments of inventory items for R0. A cursory analysis of the data will amplify the following data anomalies:

Replication of Data: Every time we record a supplier-item pair, we also have to record the supplier name and item name.

Insertion Anomaly: We cannot insert a new item until it is supplied; neither can we insert a new supplier until that supplier supplies some item.

Deletion Anomaly: We cannot delete an item or a supplier without destroying an entire shipment, as well as information about a supplier's location.

Update Anomaly: If we desire to update a supplier's location or item name, we have to update several records (in fact, an entire shipment) due to the duplication problem.

Insertion, deletion, and update anomalies constitute modification anomalies, caused by duplication of data due to improper database design. As stated earlier, these problems are exacerbated as more data are added, thus leading to other problems of data access and database performance — a tangled hellish web.

Second Normal Form: The second normal form draws in the concept of functional dependence to shape an elevated benchmark beyond mere 1NF requirement. Here is the definition:

A relation is in the second normal form (2NF) if it is in 1NF and every non-key attribute is fully functionally dependent on the primary key.

By non-key attribute, we mean that the attribute is not part of the primary key. Relation R0 (of the previous section), though in 1NF, is not in 2NF, due to FD1 and FD2. Using Heath's theorem, we may decompose relation R0 as follows (note that the abbreviation PK is used to denote the primary key):

R1{Suppl#, SuppName, Location, SupplStatus} PK[Suppl#]

R2{Item#, ItemName} PK[Item#]

R3{Suppl#, Item#, Quantity} PK[Suppl#, Item#]

We then check to ensure that the resulting relations are in 2NF. Relation R1 has a single attribute as its primary key, and so does R2; there is therefore no possibility of either relation being in violation of 2NF. As for relation R3, there is only one non-key attribute and it is dependent on the primary key. We may therefore conclude with confidence that all three relations (R1, R2, and R3) are in 2NF.

So based on the definition of 2NF, and on the authority of Heath's theorem, we would replace R0 with R1, R2, and R3. Please note the consequences of our treatment of R0 so far:

The problems with relations in 1NF only have been addressed.

By decomposing, we have introduced foreign keys in relation R3.

JOINing is the opposite of PROJecting. We can rebuild relation R0 by simply JOINing R3 with R1 and R3 with R2, on the respective foreign keys.

From the definition of 2NF, two observations should be obvious. Firstly, if you have a relation with a single attribute as the primary key, it is automatically in 2NF. Secondly, if you have a relation with n attributes and n-1 of them form the primary key, the relation may very well be in 2NF but you must first verify this.

4.6.1 Problems with Relations in 2NF Only

In this example, relations R2 and R3 are in 2NF (in fact, they are in 3NF), but we still have potential problems with R1. What if we have a situation where there are several suppliers from a given location? Or what if we want to keep track of locations of interest? In either case, we would have modification anomalies as described below:

Insertion Anomaly: We cannot record information about a location until we have at least one supplier from that location.

Deletion Anomaly: We cannot delete a particular location without also deleting supplier(s) from that location.

Update Anomaly: If we wish to update information on a location, we have to update all supplier records from that location.

These problems can be addressed if we take the necessary steps to bring R1 into the third normal form (3NF). But first, we must define 3NF.

Third Normal Form:

The third normal form extends constraints related to functional dependence to non-key attributes (i.e. attributes that are not part of the primary key). The following are three alternate definitions:

A relation is in the third normal form (3NF) if it is in 2NF and no non-key attribute is fully functionally dependent on other non-key attribute(s).

Put another way, a relation is in 3NF if non-key attributes are mutually independent and fully functionally dependent on the primary key. (Two or more attributes are mutually independent if none of them is functionally dependent on any combination of the others.)

Put another way, a relation is in 3NF if it is in 2NF and every non-key attribute is non-transitively dependent on the primary key. (Non-transitivity implies mutual independence.)

Transitive dependence refers to dependence among non-key attributes. In particular, if A → B and B → C, then C is transitively dependent on A (i.e., A → C transitively).

Let's revisit relations R1, R2, and R3 of the previous section (and recall that these are decompositions of R0). Relation R1 is problematic because it is not in 3NF. If it is desirable to store additional information about the locations as indicated in the previous section, then we must be smart enough to discern that location is to be treated as an entity with attributes such as location code, location name, and perhaps others. We may therefore rewrite R1 as follows:

R1{Suppl#, SupplName, LocationCode, LocationName, SupplStatus} PK[Suppl#]

Now observe that LocationCode → LocationName! R1 therefore violates 3NF and must be decomposed. Using Heath's theorem, we may therefore decompose R1 as follows:

R4{Suppl#, SupplName, LocationCode} PK[Suppl#]

R5{LocationCode, LocationName} PK[LocationCode]

We now check to ensure that the relations are in 3NF (and they are). Again, please take careful notice of the consequences of our actions to this point:

The problems with relations in 2NF only have been addressed.

Again, by decomposing, we have introduced a foreign key in relation R4.

We can obtain the information that relation R1 represented by simply joining R4 with R5 on the foreign key.

From the definition of 3NF, it should be obvious that if you have a 2NF relation with one candidate key and n mutually independent non-key attributes, or only one non-key attribute, that relation is in 3NF.

4.7.1 Problems with Relations in 3NF Only

Relations R2, R3, R4, and R5 above are all in 3NF. However, it has been found that 3NF-only relations suffer from certain inadequacies. It is well known that 3NF does not deal satisfactorily with cases where the following circumstances hold:

There are multiple composite candidate keys in a relation.

The candidate keys overlap (they have at least one attribute in common).

For these situations, the Boyce-Codd normal form (BCNF) provides the perfect solution. As you shall soon see, the BCNF is really a refinement of 3NF. In fact, where the above-mentioned conditions do not hold, BCNF reduces to 3NF.

Boyce-Codd Normal Form:

The Boyce-Codd normal form (BCNF) was developed thanks to the effort of Raymond F. Boyce and Edgar F. Codd. This normal form is really a refinement of the third normal form. Simply, the BCNF requirement states

A relation is in BCNF iff every determinant in the relation is a candidate key.

Drawing from subsection 4.4.1, a determinant is an attribute (or group of attributes) on which some other attribute(s) is (are) fully functionally dependent. Examination of R2, R3, R4, and R5 above will quickly reveal that they are in BCNF (hence 3NF). We therefore need to find a different example that illustrates the importance of BCNF.

Consider the situation where it is desirous to keep track of animals in various zoos, multiple keepers, and the assigned keepers for these animals. Let's tentatively construct the relation R6 as

R6{Zoo, Animal, Keeper}

Assume further that that a keeper works at one and only one zoo. We can therefore identify the following FDs:

[Zoo, Animal] → Keeper

Keeper → Zoo

Given the above, we conclude that [Zoo, Animal] is the primary key. Observe that R6 is in 3NF but not in BCNF, since Keeper is not a candidate key but is clearly a determinant. Using Heath's theorem, we may decompose R6 as follows:

R7{Animal, Keeper} PK[Animal]

R8{Keeper, Zoo} PK[Keeper]

Finally, let's clean up by making the observation that Zoo, Keeper, and Animal are really disguised entities, not mere attributes (similar to Location in the previous section), and should be treated as such. Minimally, each would require a code and a name. Below is the revised list of normalized relations for the case:

R7{ZooID, ZooName, …} PK[ZooID]

R8{KeeperID, KeeperName, …ZooID} PK[KeeperID]

R9{AnimalCode, AnimalName, …KeeperID} PK[AnimalCode]

Notice that the test for BCNF is quite straightforward: you simply check to ensure that the only determinant in each relation is a candidate key. As on previous occasions, let's examine the consequences of our action:

By achieving BCNF, we benefit from further reduction in data duplication and modification anomalies.

A further advantage is that we can now store dangling tuples. In our example, a keeper can be assigned to a zoo even before he/she is assigned an animal.

One drawback of this schema is that we cannot store an animal that has not been assigned a keeper or zoo. The schema can be further refined but that requires a discussion of the fourth normal form (4NF); this will be done in the upcoming section.

One possible drawback with BCNF is that more relations have to be accessed (joined) in order to obtain useful information. Again referring to the example, R9 must be joined with R8 in order to derive Zoo-Animal pairs; and to obtain the name of the zoo (and no doubt other information), a second join must be made with R7. In light of the fact that logical views can facilitate a seamless end user perspectives that obscure these underlying joins (more on this in Chapter 13), and processing power continues on its upswing (according to Moore's Law), the benefits of achieving BCNF tend to outweigh the drawbacks by a large margin.

Observe:

The principle of BCNF is very simple but profound. By being guided by it, you can, in many circumstances, actually bypass obtaining 2NF and 3NF relations, and move directly into a set of BCNF relations. Adopting this approach will significantly simplify the analysis process. Moreover, in most practical situations, you will not be required to normalize beyond BCNF. This approach will be further clarified in the next chapter.

Fourth Normal Form:

The fourth normal form (4NF) relates to the situation where mutually independent but related attributes form a relation and the inefficient arrangement causes duplication and hence modification anomalies. Consider the relation CTT-Schedule, representing course-teacher-text combinations in an educational institution. Assume the following

business rules:

A course can be taught by several teachers.

A course can require any number of texts.

Teachers and texts are independent of each other; in other words, the same texts are used irrespective of who teaches the course.

A teacher can teach several courses.

Figure 4-4 provides some sample data for the purpose of illustration. Notice the high level of duplication due to the structure of the relation and the prevailing business rules.

Figure 4-4: CTT schedule

Note that the theory so far does not provide a method of treating such a situation, except flattening the structure (by making each attribute part of the primary key) as shown below:

R10{Course, Teacher, Text} PK[Course, Teacher, Text]

Since R10 is keyed on all its attributes, it is in BCNF. Yet, two potential problems are data redundancy and modification anomalies (the former leading to the latter). In our example, in order to record that Calculus II is taught by both Professor Cross and Professor Maitland, four records are required. In fact, if a course is taught by p professors and requires n texts, the number of records required to represent this situation is p*n. This is extraordinary, and could prove to be very demanding on storage space, especially as the number of records increases.

Relation R10, though in BCNF, is not in 4NF, because it has a peculiar dependency called a multi-valued dependency (MVD). In order to state the 4NF, we must first define MVD.

4.9.1 Multi-Valued Dependency

A multi-valued dependency (MVD) is defined as follows:

Given a relation R(A, B, C), the MVD A -» B (read "A multi-determines B") holds iff every B-value matching a given (A-value, C-value) pair in R depends only on the A-value and is independent of the C-value.

Further, given R(A B C), A-» B holds iff A -» C also holds. MVDs always go together in pairs like this. We may therefore write A -» B/C.

MVDs do not occur as frequently as FDs. Moreover, as a consequence of the definition of MVD, it is important to note the following points:

For MVD, at least three attributes must exist.

All FDs are MVDs but all MVDs are not necessarily FDs.

A -» B reads "A multi-determines B" or "B is multi-dependent on A."

Let's get back to R10: Course -» Text/Teacher. Note that Course is the pivot of the MVD. Course -» Teacher since Teacher depends on Course, independent of Text. Course -» Text since Text depends on Course, independent of Teacher. So how do we resolve this? Fagin's theorem provides the answer.

4.9.2 Fagin's Theorem:

Fagin's theorem (named after Ronald Fagin, who proposed it) may be stated as follows:

Relation R{A, B, C} can be non-loss decomposed into projections R1{A, B} and R2{A, C} iff the MVDs A -» B/C both hold.

Note that like Heath's theorem, which prescribes how to treat FDs, Fagin's theorem states exactly how to treat MVDs. With this background, we can proceed to defining the 4NF:

A relation is in 4NF iff whenever there exists an MVD, say A -» B, then all attributes of R are also functionally dependent on A.

Put another way, R{A, B, C…} is in 4NF iff every MVD satisfied by R is implied by the candidate key of R.

Put another way, R{A, B, C…} is in 4NF iff the only dependencies are of the form [candidate key] → [other non-key attribute(s)].

Put another way, R{A, B, C…} is in 4NF iff it is in BCNF and there are no MVDs (that are not FDs).

The last formulation is particularly instructive. To paraphrase, whenever you encounter a relation that has an unresolved MVD, use Fagin's theorem to resolve by replacing the MVD with equivalent FDs. In the current example, R10 is not in 4NF. This is so because although it is in BCNF, an MVD exists. Using Fagin's theorem, we may decompose it as follows:

R11{Course, Text} PK[Course, Text]

R12{Course, Teacher} PK[Course, Teacher]

Note

Fagin's theorem prescribes a method of decomposing a relation containing an MVD that is slightly different from the decomposition of an FD as prescribed by Heath's theorem. Figure 4-5 clarifies this.

Figure 4-5: Treating MVDs

As an additional step to resolving the CTT-Schedule problem, it is a good idea to refine the solution by applying the following strategies: replace euphemistic relation names (R11 and R12) with more meaningful names; recognize that Course, Text, and Teacher are actually entities, not mere attributes; return to the use of unique attribute names by using the prefixing strategy that was introduced toward the end of Chapter 3. A more refined solution is as follows:

Course{Crs#, CrsName, CrsCredits, …} PK[Crs#]

Teacher{TeacherEmp#, Emp_FName, Emp_LName, …} PK[TeacherEmp#]

Text{TextCode, TextTitle, …} PK[TextCode]

CourseTextMap{CT_Crs#, CT_TextCode} PK[CT_Crs#, CT_TextCode]

CourseTeacherMap{CR_Crs#, CR_TeacherEmp#} PK[CR_Crs#, CR_TeacherEmp#]

From this proposed solution, it should be obvious to you that the attributes of CourseTextMap and CourseTeacherMap are foreign keys. You could then construct a RAL (review the latter part of Chapter 3) that clearly outlines the database conceptual schema for the stated problem; this is left as an exercise for you.

4.9.3 The Zoo Revisited

Let's revisit the zoo problem of the previous section. It was mentioned that further refinement was needed to allow for storing animals not yet assigned to keepers or zoos. To facilitate this, we need to recognize the presence of an MVD:

Animal -» Keeper/Zoo

The partial solution given in section 4.8 is repeated here, with meaningful relation names introduced to replace the more cryptic (euphemistic) names used up to this point in the chapter.

ZoojZooID, ZooName, …} PK[ZooID]

KeeperjKeeperID, KeeperName, …ZooID} PK[KeeperID]

AnimaljAnimalCode, AnimalName,…KeeperID} PK[AnimalCode]

Given the presence of the MVD Animal → Keeper/Zoo, we need to refine the three relations by removing prematurely inserted FK from Animal, and then apply Fagin's theorem to introduce two new relations. The revised set of normalized relations follows (note the use of continued use of meaning relation names and unique attribute names):

ZoojZooID, ZooName, …} PK[ZooID]

KeeperjKeeperID, KeeperName, …KeeperZooID} PK[KeeperID]

AnimaljAnimalCode, AnimalName, …} PK[AnimalCode]

AnimalKeeperMapjAK_AnimalCode, AK_KeeperID} PK [AK_AnimalCode, AK_KeeperID]

AnimalZooMap jAZ_AnimalCode, AZ_ZooID} PK [AZ_AnimalCode, AZ_ZooID]

As in the previous subsection (with the CTT-Schedule problem), it should be obvious to you what the foreign keys are in entities Keeper, AnimalKeeperMap, and AnimalZooMap; construction of a detailed RAL for the stated problem is left as an exercise for you.

Fifth Normal Form

So far we have been treating relations that are decomposable into two other relations. In fact, there are relations that cannot be so decomposed, but can be decomposed into n other relations where n > 2. They are said to be n-decomposable relations (n > 2). The fifth normal form (5NF) is also commonly referred to as the projection-join normal form (PJNF) because it relates to these (n > 2) projections (of a relation not in 5NF) into decompositions that can be rejoined to yield the original relation.

Recall the SupplierSchedule relationship (linking suppliers, inventory items, and projects) mentioned in Chapter 3 (Figures 3-3, 3-5, 3-13, and 3-17);it is represented here:

SupplierSchedulejSuppl#, Item#, Proj#} PK[Suppl#, Item#, Proj#]

Let's assume that specific suppliers can supply specific items for specific projects. The relation represents a M:M relationship involving Suppliers, Items, and Projects. Observe the following features about the relation:

SupplierSchedule is keyed on all attributes and therefore by definition is in BCNF. By inspection, it is also in 4NF. There is no unresolved MVD as were the cases for the CTT-Schedule problem and the zoo case of earlier discussions. Here, the attributes are dependent on each other: suppliers supply inventory items for various projects.

It is not possible to decompose this relation into two other relations without losing critical information.

If there are S suppliers, N items, and J projects, then theoretically, there may be up to S*N*J records. Not all of these may be valid: a supplier may supply specific item(s) for specific project(s); not every item may be applicable to every project; and a supplier does not necessarily support every project.

If we consider S suppliers, each supplying N items to J projects, then it does not take much imagination to see that a fair amount of duplication will take place, despite the fact that the relation is in 4NF.

Let's examine a possible decomposition of SupplierSchedule as shown in Figure 4-6. If we employ the first two decompositions only, this will not result in a situation that will guarantee us the original SupplierSchedule. In fact, if we were to join these two decompositions (SI and IP), we would obtain a false representation of the original relation. The third projection (PS) is absolutely necessary if we are to have any guarantee of obtaining the original relation after joining the projections.

Figure 4-6: Illustrating possible decompositions of SupplierSchedule

As you examine the figure, observe that first join produces SupplierSchedule plus additional spurious tuples. The effect of the second join is to eliminate the spurious tuples. To put it into perspective, SupplierSchedule is subject to a (time independent) 3-decomposable (3D) constraint, namely

If (s, i) is in SI and (i, p) is in IP and (p, s) is in PS, then (s, i, p) is in SupplierSchedule.

This is an example of a join dependency (JD) constraint. This constraint exists because of the assumption made at the outset. If the prevailing business rules dictate otherwise, then the constraint would likely be different.

4.10.1 Definition of Join Dependency

A join dependency (JD) constraint may be defined as follows:

Relation R satisfies the JD P1, P2,… Pn iff R = P1 JOIN P2 JOIN… JOIN Pn where the attributes of P1 … Pn are subsets of the attributes of R.

Relations that are in 4NF, but not in 5NF (such as SupplierSchedule) suffer from duplication, which in turn leads to modification anomalies. These problems are directly related to the presence of the JD constraint(s) in such relations. Fagin's theorem for 5NF relations provides the solution.

4.10.2 Fagin's Theorem

Here now is Fagin's theorem for the fifth normal form (5NF):

A relation R is in 5NF (also called PJNF) iff every JD in R is a consequence of the candidate keys of R.

In layman's terms, if a relation R is in 4NF and it contains a JD that renders it n-decomposable into P1, P2… Pn, such that R = P1 JOIN P2 … JOIN Pn where n > 2, then such relation is not in 5NF. It may therefore be decomposed to achieve 5NF relations.

Put another way, a relation R is in 5NF iff it is in 4NF and it is not decomposable, except the decompositions are based on a candidate key of R, and the minimum number of projections is 3.

Now examine relation SupplierSchedule. SupplierSchedule is not in 5NF because it has a JD (i.e., the JD constraint) that is not a consequence of its candidate key. In other words, SupplierSchedule can be decomposed, but this is not implied by its candidate key [Supl#, Item#, Proj#]. We should therefore proceed with the decomposition represented in Figure 4-6 in order to achieve 5NF. Finally, we can refine the solution by observing the following strategies: assigning meaningful relation names; treating suppliers, projects, and items as entities instead of mere attributes; and observing the principle of unique attribute names. Accordingly, a revised list of relations is provided below:

SupplierjSuppl#, SupplName, …} PK [Suppl#]

InventoryItemjItem#, ItemName, …} PK [Item#]

ProjectjProj#, ProjName, …} PK [Proj#]

SuppItemMapjSI_Supp#, SI_Item#, SI_Ref#} PK [SI_Ref#]

ItemProjMapjIP_Item#, IP_Proj#, IP_Ref#} PK[IP_Ref#]

ProjSuppMapjPS_Supp#, PS_Proj#, PS_Ref#} PK [PS_Ref#]

As for the previous cases (for instance sections 4.9.2 and 4.9.3), you should be able to easily identify the foreign keys; you should also be able to construct a detailed RAL for the case (review Figure 3-17 of Chapter 3). Also note that for relations SuppItemMap, ItemProjMap, and ProjSuppMap, a surrogate primary key has been introduced in each case as an alternative to using a composite candidate key consisting of the foreign keys.

Note

For most practical purposes, you only have to worry about 5NF if you are trying to implement an M:M relationship involving more than two relations. Once in 5NF, further decompositions would share candidate keys and are therefore to no avail (recall the corollary of Heath's theorem). Notwithstanding this, other normal forms have been proposed, as will be discussed in section 4.12.

My Sql Note

Pages

Thursday, 2 March 2017

normalisation

No comments:

Post a Comment