Blog.jooq.org

A Probably Incomplete, Comprehensive Guide to the Many Different Ways to JOIN Tables in SQL

2017-01-12

Perhaps the most powerful SQL feature is the JOIN operation. It is the envy of all non-relational databases, because the concept is so simple, yet so universally applicable, when you want to “combine” two data sets.

Put simply, when joining two tables, you’re combining every row from one table with every row from another table, for which a certain predicate is true. An illustration from our SQL Masterclass shows it.

See also our recent article about using Venn diagrams to illustrate JOINs. The above illustration compares INNER JOIN with different OUTER JOIN operations, but those are far from all possibilities that we have. Let’s look at things from a more systematic angle.

Do note that whenever in this article I say “X happens before Y”, I mean that “X happens logically before Y”. The database optimiser may still choose to execute Y before X, because that is faster without changing the outcome. More information about the syntactic / logical order of operations here.

But let’s look into all the join types, individually!

CROSS JOIN

The most basic JOIN operation is really a cartesian product. It just combines every row from one table with every row from another table. The best example for cartesian products is given by Wikipedia, using a deck of cards for illustration where we’re “cross joining” the ranks table with the suits table:

In real-world scenarios, CROSS JOIN can be very useful when doing reports, for instance you could generate a set of dates (e.g. days in a month) and cross join that with all the departments in the database, to create a complete days/departments table. Using PostgreSQL syntax:

Imagine we have the following data:

The result would now look like this:

Now, for each days/departments combination, you could calculate the daily revenue for that department, or something similar.

Characteristics

A CROSS JOIN is a cartesian product, i.e. a product as in “multiplication”. The mathematical notation uses the multiplication sign to express this operation: A × B, or in our case: days × departments.

Just like with “ordinary” arithmetic multiplication, if one of the two tables is empty (size zero), then the result will also be empty (size zero). This makes total sense. If we combine the previous 31 days with 0 departments, we’ll get 0 days/departments combinations. Likewise, if we combine an empty date range with any number of departments, we also get 0 days/departments combinations.

In other words:

size(result) = size(days) * size(departments)

Alternative syntaxes

In the old days, before the ANSI JOIN syntax was introduced to SQL, people just wrote comma-separated lists of tables in the FROM clause to write a CROSS JOIN. The above query is equivalent to this one:

In general, I strongly recommend to use the CROSS JOIN keyword instead of comma-separated lists of tables, because if you intentionally want to do a CROSS JOIN, nothing communicates this intent (to the next developer!) better than using the actual keywords. So many things can go wrong, accidentally, with comma-separated lists of tables, including accidental cross joins. You don’t want those!

INNER JOIN (Theta-JOIN)

Building on top of the previous CROSS JOIN operation, INNER JOIN (or just simply JOIN, sometimes also called "THETA" JOIN) allows for filtering the outcome of the cartesian product by some predicate. Most of the times, we’re putting that predicate in an ON clause, and it could be something like this:

In most databases, the INNER keyword is optional, so I’m just omitting it in this article.

Notice how the INNER JOIN operation allows for putting arbitrary predicates in the ON clause, which is again very useful when doing reporting. Just like in the previous CROSS JOIN example, we’re combining all days with all departments, but then we retain only those days/departments combinations for which the department already existed, i.e. for which the department creation preceded the day.

Again, using this data:

The result would now look like this:

So, we’d get no results prior to January 10, those rows were filtered out.

Characteristics

INNER JOIN operations are filtered CROSS JOIN operations. This means that if one of the two tables is empty, then the result is also guaranteed to be empty. But unlike with CROSS JOIN, because of the predicate, we can always get less results than the CROSS JOIN would deliver.

In other words:

size(result) <= size(days) * size(departments)

Alternative syntaxes

While the ON clause is mandatory with the INNER JOIN operation, you’re not required to place JOIN predicates in there (although it is highly recommended from a readability perspective). Most databases will optimise the following, equivalent query in just the same way:

Of course, again, that is just obfuscating the query for the reader, but you may have your reasons, right? If we take this one step further, the following query is also equivalent, because most optimisers can figure out the equivalence and execute an INNER JOIN instead:

… and, as seen before, CROSS JOIN is just syntax sugar for comma-separated table lists. In this case, we retain the WHERE clause to get what people have frequently done prior to the introduction of ANSI JOIN syntax:

All of these syntaxes do the same thing, usually without performance penalty, but clearly, they’re all less readable than the original INNER JOIN syntax.

EQUI JOIN

Sometimes, e.g. in literature, you will hear the term EQUI JOIN where "EQUI" isn’t really meant as a SQL keyword, but just as a specific way of writing a special kind of INNER JOIN.

In fact, it is weird that "EQUI" JOIN is the special case, because it’s what we’re doing most in SQL, also in OLTP applications where we simply JOIN by primary key / foreign key relationship. For instance:

The above query selects all actors and their films. There are two INNER JOIN operations, one connecting actors to the corresponding entries in the film_actor relationship table (because actors can have many films and films can have many actors), and the other connecting each film_actor relationship with additional information about the film itself.

Characteristics

The characteristics of this operation are the same as those of the “general” INNER JOIN operation. An "EQUI" JOIN is still a cartesian product (CROSS JOIN) with a reduced set of results, i.e. containing only those actor/film combinations for which a given actor actually played in the given film.

So again, in other words:

size(result) <= size(actor) * size(film)

The result size can only be equal to the actor size multiplied by the film size, if every actor played in every film, which is unlikely.

Alternative syntaxes: USING

Again, as before, we could write the INNER JOIN operation without an actual INNER JOIN syntax, but CROSS JOIN or comma-separated table lists instead. That’s boring, but much more interesting are the following two alternative syntaxes, one of which is very very useful:

The USING clause replaces the ON clause and allows for listing a set of columns that have to be present in both sides of the JOIN operation. If you carefully design your database in the same way as the Sakila database, namely where every foreign key column(s) have the same names as their referenced primary key column(s) (e.g. actor.actor_id = film_actor.actor_id), then you can use USING for "EQUI" JOIN in at least these databases:

Derby

Firebird

HSQLDB

Ingres

MariaDB

MySQL

Oracle

PostgreSQL

SQLite

Vertica

These databases, unfortunately, don’t support the syntax:

Access

Cubrid

DB2

H2

HANA

Informix

SQL Server

Sybase ASE

Sybase SQL Anywhere

While this produces exactly (well almost) the same result as the ON clause, it’s much quicker to read and write. I said almost because some databases (and the SQL standard) specify that any column appearing in the USING clause loses its qualification. For instance:

Also, of course, this syntax is a bit limited. Sometimes, you have several foreign keys in your tables, not all of which can have the primary key column name(s). For instance:

If you want to join by ORIGINAL_LANGUAGE_ID, you’ll have to resort to the ON clause.

Alternative syntaxes: NATURAL JOIN

An more extreme and much less useful form of "EQUI" JOIN is the NATURAL JOIN clause. The previous example could be further “improved” by replacing USING by NATURAL JOIN like this:

Notice how we no longer need to specify any JOIN criteria, because a NATURAL JOIN will automatically take all the columns that share the same name from both tables that it joins and place them in a “hidden” USING clause. As we’ve seen before, as primary keys and foreign keys have the same column name, this appears quite useful.

Truth is: It isn’t useful. In the Sakila database, every table also has a LAST_UPDATE column, which is automatically being taken into consideration by NATURAL JOIN. The NATURAL JOIN query is thus equivalent to this one:

… which of course makes absolutely no sense at all. So, forget about NATURAL JOIN immediately (except for some very rare cases, e.g. when joining Oracle’s diagnostics views like v$sql NATURAL JOIN v$sql_plan, etc., for ad-hoc analytics)

OUTER JOIN

We’ve seen INNER JOIN before, which returns results only for combinations of the left / right table, for which the ON predicate yields true.

OUTER JOIN allows us to retain rowson the left / rigth side, for which we didn’t find a matching combination. Let’s go back to the days and departments example:

Again, the OUTER keyword is optional, so I’m omitting it in the examples.

This query is very subtly different from its INNER JOIN counter part in that it will always return at least one row per day, even if for a given day, there is no department that has already been created at that day. For instance: All departments were created on January 10. The above query will still return January 1-9:

In addition to the rows we got before in the INNER JOIN example, we now also have all the days from January 1-9 with NULL departments:

In other words, each day appears at least once in the result. LEFT JOIN does this for the left table, i.e. it retains all rows from the left table in the result.

Formally, a LEFT OUTER JOIN is an INNER JOIN with a UNION like this:

We’ll talk about NOT EXISTS further down, when talking about "SEMI" JOIN

RIGHT OUTER JOIN

RIGHT OUTER JOIN does exactly the opposite. It retains all the rows from the right table in the result. Let’s add more departments

The new departments 4 and 5 will not be in the previous results at all, because they were created on a day after January 31. But it will appear in a RIGHT JOIN result, because departments is the right table of the join operation, and all the rows from the right table will be retained.

Running this query:

Will yield something like this:

In most cases (I have to yet run into a situation where this isn’t the case), Each LEFT OUTER JOIN expression can be transformed into an equivalent RIGHT OUTER JOIN expression, and vice-versa. Because RIGHT OUTER JOIN is usually a bit less readable, most people use LEFT OUTER JOIN only.

FULL OUTER JOIN

Finally, there is also FULL OUTER JOIN, which retains all rows from both sides of the JOIN operation. In our example, this means that each day appears at least once in the result, just like each department appears at least once in the result.

Let’s take again this data:

And now, let’s run this query:

The result will now look something like this:

If you insist, formally, a LEFT OUTER JOIN is an INNER JOIN with a UNION like this:

Alternative syntaxes: “EQUI” OUTER JOIN

The above examples again used the “cartesian product with filter” kind of JOIN. Much more common, however, is the "EQUI" OUTER JOIN approach, where we join on a primary key / foreign key relationship. Let’s go back to the Sakila database example. Some actors haven’t played in any film, and we might want to query them like this:

This query will return all actors at least once, regardless if they played in a film. If we also want all films that do not have any actors (yet), we could use a FULL OUTER JOIN to combine the results:

And of course, this also works with NATURAL LEFT JOIN, NATURAL RIGHT JOIN, NATURAL FULL JOIN, but again, these aren’t useful at all, as we’d be joining again USING (..., LAST_UPDATE), which makes no sense at all.

Alternative syntaxes: Oracle and SQL Server style OUTER JOIN

These two databases had OUTER JOIN before the ANSI syntax was established. It looked like this:

That’s nice, given that at some point in time (in the 80s??), ANSI didn’t specify OUTER JOIN yet. But the 80s are more than 30 years ago. So, it’s safe to say this stuff is obsolete.

SQL Server did the right thing and deprecated (and later removed) that syntax a long time ago. Oracle still supports it for backwards compatibility reasons.

But nothing about this syntax is reasonable or readable. Don’t use it. Replace it with ANSI JOIN.

PARTITIONED OUTER JOIN

This is Oracle specific, but I must say, it’s a real shame none of the other databases have stolen the feature yet. Remember the CROSS JOIN operation that we used to combine each day with each department? Because, sometimes, that’s what we want in the result: All combinations, and if there is a match put also the matching values in the row.

That’s hard to describe in words, much more easy with an example. Here’s the query with Oracle syntax:

Unfortunately, PARTITION BY is used in various contexts with different meanings (e.g. for window functions). In this case, it means that we “partition” our data by the departments.department column, creating a “partition” for each department. Now, each partition will get a copy for each day, regardless if there’s a match in our predicate (unlike in the ordinary LEFT JOIN case, where we had a bunch of “department-less” days). The result of the above query is now this:

As you can see, I’ve visually created 5 partitions for the 5 departments. Each partition combines the department with each day, but unlike when doing a CROSS JOIN, we’re now getting actual LEFT JOIN .. ON .. results in case there is a match for the predicate. This is a really nice feature for reporting in Oracle!

SEMI JOIN

In relational algebra, there is a notion of a semi join operation, which unfortunately doesn’t have a syntax representation in SQL. If it did, the syntax would probably be LEFT SEMI JOIN and RIGHT SEMI JOIN, just like the Cloudera Impala syntax extension offers.

What is a "SEMI" JOIN?

When writing something like the fictional query below:

What we really mean is we want all actors that played in films. But we don’t want any films in the results, just the actors. More specifically, we don’t want each actor several times, once per film. We want each actor only once (or zero times) in the result.

Semi is latin for “half”, i.e. we implement only “half the join”, in this case, the left half.

In SQL, there are two alternative syntaxes that can emulate "SEMI" JOIN

Alternative syntax: EXISTS

This is the more powerful and a bit more verbose syntax

We’re looking for all the actors for which there exists a film, i.e. actors that played in a film. With this syntax (i.e. "SEMI" JOIN placed in the WHERE clause), it’s immediately clear that we can get each actor at most once in the result. There’s no actual JOIN in the syntax.

Nonetheless, most databases will be able to recognise that what’s going on here is really a "SEMI" JOIN, and not just an ordinary EXISTS() predicate. For instance, consider the Oracle execution plan for the above query:

Note how Oracle calls the operation “HASH JOIN (SEMI)” – the SEMI keyword is present here. The same is true for PostgreSQL:

Or SQL Server:

Apart from being the optimal solution in terms of correctness, there are also some performance benefits when using "SEMI" JOIN rather than INNER JOIN, as the database can stop looking for matches as soon as it found the first!

Alternative syntax: IN

IN and EXISTS are exactly equivalent "SEMI" JOIN emulations. The following query will produce the same plan in most databases (not MySQL), as the previous EXISTS query:

If your database supports both syntaxes for the "SEMI" JOIN operation, you may choose whichever you prefer from a stylistic point of view.

This is different with …

ANTI JOIN

In principle, "ANTI" JOIN is just the opposite of "SEMI" JOIN. When writing something like the fictional query below:

… what we’re doing is we take all the actors that didn’t play in any film. Unfortunately, again, SQL doesn’t have a built-in syntax for this operation, but we can emulate it with EXISTS:

Alternative syntax: NOT EXISTS

The following query has exactly the expected semantics:

(Dangerous) alternative syntax: NOT IN

Watch out! While EXISTS and IN are equivalent, NOT EXISTS and NOT IN are not equivalent. This is because of NULL values.

In this particular case, the following NOT IN query is equivalent to the previous NOT EXISTS query, because our film_actor table has a NOT NULL constraint on film_actor.actor_id

If, however, actor_id became nullable, then the query would be wrong. Don’t believe it? Try running this:

It will not return any records. Why? Because NULL is the UNKNOWN value in SQL. So, the predicate above is the same as this:

And because we cannot be sure whether actor_id is in a set of values of which one value is UNKNOWN (is it 4? Or 5? Or -1?), the whole predicate becomes UNKNOWN

Here’s a nice article about three-valued logic by Joe Celko, if you want to learn more.

So, I cannot say this enough:

Do not use NOT IN predicates in SQL, ever, unless you put constant, non-null values there.

— Lukas Eder. Just now.

Don’t even gamble on the presence of a NOT NULL constraint. Perhaps, some DBA might temporarily turn off the constraint to load some data and your query will be wrong for the time being. Just use NOT EXISTS. Or, in some cases…

(Dangerous) alternative syntax: LEFT JOIN / IS NULL

Strangely enough, some people prefer the following syntax:

It’s correct as we’re:

Joining films to their actors

Retaining all actors without films (LEFT JOIN)

Retaining only actors without films (film_id IS NULL)

Now, I personally don’t like this syntax, as it doesn’t communicate the intent of an "ANTI" JOIN at all. And chances are, it is slower, because your optimiser doesn’t recognise this as an "ANTI" JOIN operation (or in fact, it cannot formally prove that it probably is). So, again, use NOT EXISTS instead.

An interesting (but a bit outdated) blog post comparing the three syntaxes is this one here:
https://explainextended.com/2009/09/15/not-in-vs-not-exists-vs-left-join-is-null-sql-server

LATERAL JOIN

LATERAL is a relatively new keyword in the SQL standard, and it is supported in PostgreSQL and Oracle. SQL Server folks have had a vendor-specific alternative syntax using the APPLY keyword forever (which I personally much prefer). Let’s look at an example, using the PostgreSQL / Oracle LATERAL keyword:

Indeed, instead of a CROSS JOIN between all departments and all days, why not just generate the necessary days for each department directly? That’s what LATERAL does. It is a prefix to the right-hand side of any JOIN operation (including INNER JOIN, LEFT OUTER JOIN, etc.) that allows the right-hand side to access columns from the left hand side.

This of course has nothing to do with relational algebra anymore, because it imposes a JOIN order (from left to right). But sometimes, that’s OK and sometimes, your table-valued function (or subquery) is so complex, that’s the only way you can actually use it.

Characteristics:

The LATERAL keyword doesn’t really change the semantics of the JOIN type that it is applied to. If you’re running a CROSS JOIN LATERAL, the result size is still

size(result) = size(left table) * size(right table)

Even if the right table is produced on a per-row-of-the-left-table basis.

You can use LATERAL also with OUTER JOIN, in case of which the left rows will be retained even if your table function doesn’t return any rows on the right side.

Alternative syntax: APPLY

SQL Server didn’t choose the confusing LATERAL keyword, they introduced the APPLY keyword (more specifically: CROSS APPLY and OUTER APPLY) a long time ago, which makes more sense, because we’re applying a function to each row of a table. Let’s assume for a moment that we have a generate_series() function in SQL Server:

Then, we could use CROSS APPLY to call that function for each department:

What’s so nice about this syntax is that – again – we’re applying a function to each row of a table, and that function produces rows. Does it ring a bell? In Java 8, we would use Stream.flatMap() for that! Consider the following stream usage:

What’s going on here?

The DEPARTMENTS table is just a Java departments stream

We flatMap the departments stream using a function that produces tuples for each department

Those tuples include the department itself, and a day generated from a series of days starting with the department’s createdAt day

Same story! SQL CROSS APPLY / CROSS JOIN LATERAL is the same thing as Java’s Stream.flatMap(). In fact, SQL and streams aren’t too different anyway. For more info, read this blog post here.

MULTISET

Few databases implement this (in fact, only Oracle), but if you think about it, it’s an awesome JOIN type. One that creates nested collections. If all databases implemented it, we wouldn’t need ORMs!

A hypothetical example (that uses SQL standard syntax, not Oracle’s) looks like this:

The MULTISET operator takes a correlated subquery argument, and aggregates all its resulting rows in a nested collection. This works in a similar fashion as a LEFT OUTER JOIN (we’re getting all actors, and if they have films, we’re getting all their films too), but instead of duplicating all the actors in the result set, we’re collecting them into a nested collection.

Just like what we would be doing in an ORM, when fetching stuff into this structure:

Ignore the incompleteness of used JPA annotations, I just want to show the strength of nested collections. Unlike as in ORMs, the SQL MULTISET operator allows for collecting arbitrary results from correlated subqueries into nested collections – not just actual entities. This is million times more powerful than what any ORM could ever dream of.

Alternative syntaxes: Oracle

As I said, Oracle actually supports MULTISET, but you cannot create ad-hoc nested collections. For some reason, Oracle chose to implement nominal typing for these nested collections, rather than the usual SQL-style structural typing. So you have to declare your types in advance:

A bit more verbose, but still does the trick! Excellent!

Alternative syntaxes: PostgreSQL

For once, the awesome PostgreSQL is missing out on an excellent SQL standard feature, but there’s a workaround: arrays! And this time, we can use structural types, yay! So the following query will return a nested array of rows in PostgreSQL:

The result is everyone’s ORDBMS dream! Nested records and collections everywhere (in only two columns):

Tell me that you don’t find this exciting and I don’t know what is.

Conclusion

This was A Probably Incomplete, Comprehensive Guide to the Many Different Ways to JOIN Tables in SQL. I hope you’ve found 1-2 new tricks in this article. JOIN is only one of many very interesting SQL operations. If you’ve found this information useful, you will also like:

Our SQL Masterclass – where we go into these features (and many other features) in depth!

You Probably don’t Use SQL INTERSECT or EXCEPT Often Enough

Add LATERAL Joins or CROSS APPLY to Your SQL Tool Chain

Are You Using SQL PIVOT Yet? You Should!

Impress Your Coworkers by Using SQL UNPIVOT!

Filed under: sql Tagged: cross join, INNER JOIN, JOIN, JOIN USING, NATURAL JOIN, outer join, sql