Pages

Wednesday, 8 July 2015

Joins

Joins:

In relational algebra, a join is the multiplication of two data sets followed by a restriction of the result so that only the intersection of the two data sets is returned.

The whole purpose of the join is to horizontally merge two data sets and produce a new result set from the combination by matching rows in one data source to rows in the other data source.
By merging the data using the join, the rest of the SQL SELECT statement, including the column expressions, aggregate groupings, and WHERE clause conditions, can access any of the columns or rows from the joined tables. These capabilities are the core and power of SQL.

Inner join:                                   Includes only matching rows

Left outer join:                           Includes all rows from the left table regardless of whether a match exists, and matching rows from the right table

Right outer join:                        Includes all the rows from the right table regardless of whether a match exists, and matching rows from the left table

Full outer join:                            Includes all the rows from both tables regardless of whether a match exists

Theta join:                                    Matches rows using a non-equal condition — the symbol shows the actual theta condition         (<,>,<=,>=,<>)

Cross join:                                     Produces a Cartesian product — a match between each row in data source one with each row from data source two without any conditions or restrictions



Inner Joins:

The inner join is by far the most common join. In fact, it’s also referred to as a common join, and was originally called a natural join by E. F. Codd. The inner join returns only those rows that represent a match between the two data sets. An inner join is well named because it extracts only data from the inner portion of the intersection of the two overlapping data sets.

Creating inner joins within SQL code:

Using T- SQL code, joins are specified within the FROM portion of the SELECT statement. The keyword JOIN identifies the second table, and the ON clause defines the common ground between the two tables.
The default type of join is an inner join, so the keyword INNER is optional. For clarity, however, I recommend always including it:

SELECT *
FROM Table1
[INNER] JOIN Table2
ON Table1.column = Table2.column;

Because joins pull together data from two data sets, it makes sense that SQL needs to know how to match up rows from those sets. SQL Server merges the rows by matching a value common to both tables. Typically, a primary key value from one table is being matched with a foreign key value from the secondary table. Whenever a row from the first table matches a row from the second table, the two rows are merged into a new row containing data from both tables.

Number of rows returned:

create table id1 (id int)
create table id2 (id int)

insert into id1 values(1),(1),(1),(1),(1),(2)

insert into id2 values(1),(1),(1),(1),(1),(3)

select COUNT(*) 'No of Rows Returned'
from id1 inner join id2 ON id1.id = id2.id

No of Rows Returned
-------------------
25

ANSI SQL 89 joins

A join is really nothing more than the act of selecting data from two tables for which a condition of equality exists between common columns. Join conditions in the ON clause are similar to WHERE clauses. In fact, before ANSI SQL 92 standardized the JOIN...ON syntax, ANSI SQL 89 joins (also called legacy style joins, old style joins, or even grandpa joins) accomplished the same task by listing the tables within the FROM clause and specifying the join condition in the WHERE clause.
The previous sample join between could be written as an ANSI 89 join as follows:
SELECT *
FROM Table1 , Table2
WHERE Table1.column = Table2.column;

 Best Practice
Always code joins using the ANSI 92 style. ANSI 92 joins are cleaner, easier to read, and easier to debug than ANSI 89 style joins, which leads to improved data integrity and decreases maintenance costs. With ANSI 89 style joins it’s possible to get the wrong result unless it’s coded very carefully. ANSI 89 style outer joins are deprecated in SQL Server 2008, so any ANSI 89 outer joins will generate an error.

Multiple data source joins

As some of the examples have already demonstrated, a SELECT statement isn’t limited to one or two data sources (tables, views, CTEs, subqueries, etc.); a SQL Server SELECT statement may refer to up to 256 data sources. That’s a lot of joins.



 Outer Joins:

Whereas an inner join contains only the intersection of the two data sets, an outer join extends the inner join by adding the nonmatching data from the left or right data set.
Some of the data in the result set produced by an outer join will look just like the data from an inner join. There will be data in columns that come from each of the data sources, but any rows from the outer-join table that do not have a match in the other side of the join will return data only from the outer-join table. In this case, columns from the other data source will have null values.

T-SQL code and outer joins

In SQL code, an outer join is declared by the keywords LEFT OUTER or RIGHT OUTER before the JOIN (technically, the keyword OUTER is optional):
SELECT *
FROM Table1
LEFT|RIGHT [OUTER] JOIN Table2
ON Table1.column = Table2.column;

Several keywords (such as INNER, OUTER, or AS) in SQL are optional or may be abbreviated (such as PROC for PROCEDURE). Although most developers (including me) omit the optional syntax, explicitly stating the intent by spelling out the full syntax improves the readability of the code.
There’s no trick to telling the difference between left and right outer joins. In code, left or right refers to the table that will be included regardless of the match. The outer-join table (sometimes called the driving table) is typically listed first, so left outer joins are more common than right outer joins. I suspect any confusion between left and right outer joins is caused by the use of graphical-query tools to build joins, because left and right refers to the table’s listing in the SQL text, and the tables’ positions in the graphical-query tool are moot.

Earlier versions of SQL Server extended the ANSI SQL 89 legacy join syntax with outer joins by adding an asterisk to the left or right of the equals sign in the WHERE clause condition. While this syntax worked through SQL Server 2000, it has been deprecated since SQL Server 2005. ANSI SQL 89 inner joins will still work, but outer joins require ANSI SQL 92 syntax.

Having said that, SQL Server supports backward compatibility, so if the database compatibility level is set to 80 (SQL Server 2000), then the ANSI 82 style outer joins still work.

Best Practice:

When coding outer joins, always order your data sources so you can write left outer joins. Don’t use right outer joins, and never mix left outer joins and right outer joins.
Note: Depending on the nullability of the keys and the presence of rows on both sides of the join, it’s easy to write a query that misses rows from one side or the other of the join.

Number of rows returned: 

create table id1 (id int)
create table id2 (id int)

insert into id1 values(1),(1),(1),(1),(1),(2)

insert into id2 values(1),(1),(1),(1),(1),(3)


Left Outer Join
select COUNT(*) 'No of Rows Returned'
from id1 left outer join id2 ON id1.id = id2.id

No of Rows Returned
-------------------
26

Right Outer Join:

select COUNT(*) 'No of Rows Returned'
from id1 right outer join id2 ON id1.id = id2.id

No of Rows Returned
-------------------
26

Placing the conditions within outer joins (ON Vs WHERE)

When working with inner joins, a condition has the same effect whether it’s in the JOIN clause or the WHERE clause, but that’s not the case with outer joins:
When the condition is in the JOIN clause, SQL Server includes all rows from the outer table and then uses the condition to include rows from the second table.
When the restriction is placed in the WHERE clause, the join is performed and then the WHERE clause is applied to the joined rows.
The following two queries demonstrate the effect of the placement of the condition.
In the first query, the left outer join includes all rows from table One and then joins those rows from table Two where OnePK is equal in both tables and Thing1’s value is New Thing. The result is all the rows from table One, and rows from table Two that meet both join restrictions:
SELECT Thing1, Thing2
FROM dbo.One
LEFT OUTER JOIN dbo.Two 
ON One.OnePK = Two.OnePK
AND One.Thing1 = ‘New Thing’;

Result:
Thing1 Thing2
--------------- ---------------
Old Thing NULL
New Thing Train
Red Thing NULL
Blue Thing NULL

The second query first performs the left outer join, producing the same four rows as the previous query but without the AND condition. The WHERE clause then restricts that result to those rows where Thing1 is equal to New Thing1. The net effect is the same as when an inner join was used (but it might take more execution time):
SELECT Thing1, Thing2
FROM dbo.One
LEFT OUTER JOIN dbo.Two
ON One.OnePK = Two.OnePK
WHERE One.Thing1 = ‘New Thing’;



Result:
Thing1 Thing2
--------------- ---------------
New Thing Train


Full outer joins

A full outer join returns all the data from both data sets regardless of the intersection. It is functionally the same as taking the results from a left outer join and the results from a right outer join, and unioning them together.


 Example:
The following example is a mock-up of such a situation and compares the full outer join with an inner and a left outer join. Table one is the primary table. Table Two is a secondary table with a foreign key that refers to table One. There’s no foreign-key constraint, so there may be some non matches for the outer join to find:
CREATE TABLE dbo.One (
OnePK INT,
Thing1 VARCHAR(15)
);

CREATE TABLE dbo.Two (
TwoPK INT,
OnePK INT,
Thing2 VARCHAR(15)
);
The sample data includes rows that would normally break referential integrity. The foreign key (OnePK) for the plane and the cycle in table Two do not have a match in table One; and two of the rows in table One do not have related secondary rows in table Two. The following batch inserts the eight sample data rows:
INSERT dbo.One(OnePK, Thing1)
VALUES (1, 'Old Thing'),
 (2, 'New Thing'),
 (3, 'Red Thing'),
 (4, 'Blue Thing');

INSERT dbo.Two(TwoPK, OnePK, Thing2)
VALUES(1,0, 'Plane'),
(2,2, 'Train'),
(3,3, 'Car'),
(4,NULL, 'Cycle');

An inner join between table One and table Two will return only the two matching rows:
SELECT COUNT(*) 'Count'
FROM dbo.One
INNER JOIN dbo.Two
ON One.OnePK = Two.OnePK;

Count
-----------
2

A left outer join will extend the inner join and include the rows from table One without a match:
SELECT COUNT(*) 'count'
FROM dbo.One
LEFT OUTER JOIN dbo.Two
ON One.OnePK = Two.OnePK;

count
-----------
4

A full outer join will retrieve every row from both tables, regardless of a match between the tables:
SELECT COUNT(*) 'Count'
FROM dbo.One
FULL OUTER JOIN dbo.Two
ON One.OnePK = Two.OnePK;

Count
-----------
6

Self-Joins

A self-join is a join that refers back to the same table. This type of unary relationship is often used to extract data from a reflexive (also called a recursive) relationship, such as organizational charts (employee to boss). Think of a self-join as a table being joined with a temporary copy of itself.


 Cross (Unrestricted) Joins

The cross join, also called an unrestricted join, is a pure relational algebra multiplication of the two source tables. Without a join condition restricting the result set, the result set includes every possible combination of rows from the data sources. Each row in data set one is matched with every row in data set two — for example, if the first data source has five rows and the second data source has four rows, a cross join between them would result in 20 rows. This type of result set is referred to as a Cartesian product.

(Theta) joins

A theta join (depicted throughout as [1]) is a join based on a non-equal on condition. In relational theory,conditional operators (=, >, <, >=, <=, <>) are called [1] operators. While the equals condition is technically a [1] operator, it is commonly used, so only joins with conditions other than equal are referred to as [1] joins.

Non-key joins:

Joins are not limited to primary and foreign keys. The join can match a row in one data source with a row in another data source using any column, as long as the columns share compatible data types and the data match.


 Set Difference Queries
A query type that’s useful for analyzing the correlation between two data sets is a set difference query, sometimes called a left (or right) anti-semi join, which finds the difference between the two data sets based on the conditions of the join. In relational algebra terms, it removes the divisor from the dividend, leaving the difference. This type of query is the inverse of an inner join. Informally, it’s called a find unmatched rows query.

Left set difference query
A left set difference query finds all the rows on the left side of the join without a match on the right side of the joins.

Using the One and Two sample tables, the following query locates all rows in table One without a match in table Two, removing set two (the divisor) from set one (the dividend). The result will be the rows from set one that do not have a match in set two.
The outer join already includes the rows outside the intersection, so to construct a set difference query use an OUTER JOIN with an IS NULL restriction on the second data set’s primary key. This will return all the rows from table One that do not have a match in table Two:

USE tempdb;
SELECT Thing1, Thing2
FROM dbo.One
LEFT OUTER JOIN dbo.Two
ON One.OnePK = Two.OnePK
WHERE Two.TwoPK IS NULL;

Table One’s difference is as follows:

Thing1 Thing2
--------------- ---------------
Old Thing NULL
Blue Thing NULL

Full set difference queries
I often use a modified version of this technique to clean up bad data during conversions. A full set difference query is the logical opposite of an inner join. It identifies all rows outside the intersection from either data set by combining a full outer join with a WHERE restriction that accepts only nulls in either primary key:

SELECT Thing1, Thing2
FROM One
FULL OUTER JOIN Two
ON One.OnePK = Two.OnePK
WHERE Two.TwoPK IS NULL
OR One.OnePK IS NULL;

  
Migration 2000 Sql Server To 2008 Sql Server:
Joins and unions are at the heart of SQL, so change here occurs slowly. The only item to watch for with joins and unions is the ANSI 89 style outer joins.
If you’re upgrading from SQL Server 2000 directly to SQL Server 2008, you should be warned that ANSI 89 style outer joins (*=, =*) were removed from SQL Server with version 2005. ANSI 89 style inner joins may be a legitimate syntax, but I still don’t recommend using them.

Additional Notes related to JOIN:

·         The following are three classic examples to display where Outer Join is useful. You will notice several instances where developers write query as given below.
SELECT t1.*
FROM Table1 t1
WHERE t1.ID NOT IN (SELECT t2.ID FROM Table2 t2)
GO
The query demonstrated above can be easily replaced by Outer Join. Indeed, replacing it by Outer Join is the best practice. The query that gives same result as above is displayed here using Outer Join and WHERE clause in join.
/* LEFT JOIN - WHERE NULL */
SELECT t1.*,t2.*
FROM Table1 t1
LEFT JOIN Table2 t2 ON t1.ID = t2.ID
WHERE t2.ID IS NULL


·         Tables cannot be joined directly on ntext, text, or image columns. However, tables can be joined indirectly on ntext, text, or image columns by using SUBSTRING. For example, SELECT * FROM t1 JOIN t2 ON SUBSTRING(t1.textcolumn, 1, 20) = SUBSTRING(t2.textcolumn, 1, 20) performs a two-table inner join on the first 20 characters of each text column in tables t1 and t2. In addition, another possibility for comparing ntext ortext columns from two tables is to compare the lengths of the columns with a WHERE clause, for example: WHERE DATALENGTH(p1.pr_info) = DATALENGTH(p2.pr_info)



No comments:

Post a Comment