Working with SQL
The course is part of this learning path
In this course, we first learn how to use SQL to alter and delete tables and how to use primary and foreign keys. We examine common joins between tables in a relational database. We then delve into inner, outer, left and right joins before learning to manipulate data using the INSERT, UPDATE and DELETE functions.
This course is built from content created and delivered by Ian Wallington of QA.
Hello and welcome to this course where we'll introduce and explore data manipulation using joins in SQL. Now let's imagine somebody asked us to report on sales people. They would like to include the sales managers in the report which is not immediately possible from just the sales person table. Departments have managers, managers manage departments and sales people are in a department. That's how they are associated with a manager. So what we need to do is launch a query to retrieve data not just from the sales person table, but also from the department table. And a way to do this is as follows. There is a sub clause of the from clause which is called join. Join is part of the from clause because the from clause is needed to join the department table into this query. However, it is insufficient just to say that we need to specify what the link is between those two tables. The question is, what is the link between a sales person and a department tables? Every sales person is associated with a department number so we need to specify that as a condition. Now in this particular case there's a second join key word and it is "on". It is easiest to think about it as a join on condition that, so think about the on as an on condition that this join happens. It is a slightly strange wording again, a bit like the "in" operator. Now the condition is that the department numbers from the two tables must match or be equal, however, we have a new issue here, we're using the same name in each table. If I simply refer to dept number in here, it's going to be what's known as ambiguous. Its not a problem we've hit before because so far we've only been using one table. Any column in a query can be prefixed or qualified with it's table name. This particular case is utterly unnecessary but that is syntactically correct, dept dot table name. It's possible to fully qualify the name in Microsoft Sequel Server, I.e. database name dot table name, dot column name dot and so on. For my SQL, the database name dot table name dot column name applies. In Oracle it would be username because that would be the name of the schemer, so username dot table name dot column name. It is better to use this notation sooner rather than later. I can use the shorthand for this if I defined a table alias for that column. Now just noting that the as keyword here is optional. You generally don't use as in Oracle for example. Because this is Microsoft Sequel Server we will put it in. This notation is fine in a small schemer. A database with five tables with very different names. You still need to come up with a short fall and you might end up having to use an abstract type name. Some people like to use A, B, C, D or E or perhaps T1 for table one and T2 for table two. But here I have two tables so I'm going to use D as my table alias for department and SP as my table alias for sales person. Then in my join condition here, I can say that one of the department numbers is equal to the other one. It is unfortunately all too easy to put the same column on both sides of the equality operator which ends up in a rubbish result. So what we're saying here is as you work through the sales person table for each sales person, go and get the corresponding row from the department table and then you can access column values from that. So first of all I'm going to put dept number into this column list. Just to show you what happens if you refer to an ambiguous column. You will get an ambiguous column name error here, the wording will vary between the database management systems, but basically they say the same thing. It's not just in here where I have to qualify the department number, it's anywhere where I refer to any column. It means as long as you get the right table name you won't get errors. If you get to the wrong table name it will be an invalid column name.
The beauty of this is that without reference to any form of documentation, you can see which table is giving you which columns. Now strictly speaking, you could do that with the full table names for every column and not even have a table alias, but that's a lot more typing. One thing I might substitute by the way is to start using a more vertical format. People argue whether SQL is a programming language. It's not a programming language in the same sense as general purpose programming languages but it is a language and so that is a form of programming, and that is what I would call self documenting code. It helps us understand where things are coming from and this is telling me that it's coming from table called D. D is the dept table, that's listing the sales people and information from the department. I can now pick any column from either of these two tables for each of these sales people we want to get their sales. Now I should point out here that we have a one to many relationship. A department has potentially many sales people and then there's going to be a one to many relationship between sales person and sale as each sales person can potentially make many sales. So when I include a sale table, the number of rows in the result set is going to be the number of sales for those sales people. Department three is turning up three times because there are three sales people in department three. This is because SQL is known as a declarative language, it is not a series of instructions, do this, do this, do this, it's kind of a specification of what you want therefore it needs to be read as a whole, you need to read the whole thing. As an example, take the casserole out of the oven using the gloves provided. You need to read the entire statement if you want to avoid pain. You actually start with a select clause to make it clear that it is a query which is bringing data back. Logically, what is going to do is join the tables together and then say, which columns do you want? As a select clause, it's actually going to be evaluated after the from clause. So next step is to include the sales table. I'm going to start by including something from the sales table and clearly heading towards using SA as a table alias. The database system is complaining because I haven't included the sale table. The join subclause and from subclause can be repeated unlike other clauses. Occasionally, someone will try and put a where clause into a query and it doesn't work. I've now included the sale table and the sale table needs to be linked to one of the tables that are already in this from clause. As a rule about the order in which you put tables in the from clause, I could equally have written that as from dept as D, join sales person as SP. The order in which we specify the tables when joining two together is irrelevant. So strictly speaking in my SQL you can force it to access the tables in the the order that you specify. It's what's known as a natural join. In the simplest form, the optimizer in the database will choose an approach based on heuristics or heuristic data, the number of rows et cetera. But at this point I need to be able to join the sale to one or other of those tables with another "on" clause, so I can sort. So sale would be related to or linked to department items. There's nothing there that's direct related to sale. On the sale table on the other hand, there is an employee number and that is the primary key of the sales person table. That part of a diagram has the sale and contact tables in it the question is, what do we need to do to link sale to contact? Contact code is part of the answer. One way you can think of it is this is going to be a one to many relationship. A sale is made to one contact. Contact may have many orders so there may be many sales for one contact, a one to many relationship. One contact, many sales. At one end of the relationship you're looking for a unique key, probably the primary key, and at the other end you're looking for a foreign key. In the sale table, a sales person makes many sales so the employee number, the primary key of the sales person table, is a foreign key in the sale table. What we probably want now is to be able to accumulate the many sales to the contact as a foreign key. That represents the columns in the primary key of the contact table. So in order to identify a contact you need to specify both the company number and the contact code so the condition is ON CT dot company number is equal to SA dot company number and CT dot contact code is equal to SA dot contact code, and it is both and therefore and, you have to join the company table. The condition is that the company numbers are equal. For each table that you join, you have to specify the join criteria. Now we could choose to put department number into the sale table then we could join sales to dept but it would be a duplication that it could lead to inconsistencies where the sale person said it was one department but actually the sales person department was different. If a sales person was to move from one department to another you have to go through the sales, you would historically have information that would get rather complicated. What's happening here is that the contact key is computed and includes company number.
So we can link sale to company directly because of this computed key. The general recommendation is if you're writing queries that are going to be repeatedly run, especially if you are going to embed those queries in some kind of tool or program, then you want a specific list of columns anyway and that's what happens if you don't correctly specify the join condition in this composite key. Lets try just contact. First of all I want to highlight the number of rows in the result set, it is eight. There are eight rows in the sale table. Each sale appears once, as I think and hope you will expect from the query on sales. If I remove the company number equating condition, I end up with nine rows. Apparently order number 800, was a sale for Both Robber Red and Ricky Rambo. This is known as a partial cartesion product. Basically what it's done here is because there are two contracts with RR as the initials it's failed around that and linked things together that aren't actually related. This is what happens when you have composite keys and you don't fully specify the composite key in the relationship. If I joined on company number alone. Okay to help us summarize, lets just review some of our slides. In our scenario, somebody asked for the manager table to be included in our query, so we could see the sales manager name. We couldn't get that from the sales table alone so we joined the department table. But that didn't work alone either as we needed to specify what the criteria or criterion is on that join. That didn't work initially because we now had dept no and it appears in both tables. One workaround was that we could qualify the select with the table name, qualify or prefixed with the table name. That is a good practice to get into, particularly if you work in an environment where there are many people and lots of tables, coming up with meaningful yet short names is not easy, it's easy for us with our little five table schemer but that is a challenge with aliases. For the from clause think join as part of the from clause. It doesn't matter which way around you put them when there's two tables, when joining three or more tables you have to do it in some kind of logical sequence. You couldn't start with department and then join sale. The on clause relates one row of one table to one another one or more columns. That is effectively de-normalizing the data. Real time de-normalization. And yes, when you're relating tables together when you're linking tables together you are looking for a one to many relationship. You're looking for a foreign key in the table at the many end of the relationship, which will ultimately refer to a unique key. Ideally the primary key on the table at one end of the relationship. You end up with duplication in your result set but not in your base data. You can join multiple tables together starting with sales and then joining sales person, then department, then company. Each table occurs once and therefore each table you need to specify what the join condition is. Composite joins need specified keys otherwise you will end up with extra unwanted rows. Okay, that brings us to end of our introduction to joins. Close this lecture, I'll see you in the next one.
About the Author
Head of Content
Andrew is an AWS certified professional who is passionate about helping others learn how to use and gain benefit from AWS technologies. Andrew has worked for AWS and for AWS technology partners Ooyala and Adobe. His favorite Amazon leadership principle is "Customer Obsession" as everything AWS starts with the customer. Passions around work are cycling and surfing, and having a laugh about the lessons learnt trying to launch two daughters and a few start ups.