This is an intermediate level course suited to DBAs, system analysts, programmers, or anyone wanting to learn how to use SQL to solve business scenarios. It will help to have an understanding of Structured Query Language syntax and functions and regular expressions, or some experience with script-based programming languages. Our learning objectives for this course are to recognize and explain use cases where Structured Query Language, SQL, can be applied in business situations, explain and apply SQL procedures to import and parse text files, define and apply automation processes to parse and sort data to specific databases. We welcome all comments and feedback. So please feel free to reach out to us at support@cloudacademy.com with any questions or comments.
Data Source
The data used in this course can be found in this GitHub repository: https://github.com/cloudacademy/using-sql-data-files.
Transcript
Let's have a look at the database structure. There are four databases in this example. ATM_Central is where all SQL will be executed and contains our reference data like ATMs and their configurations, as well as, master account data. ATM_AKC, ATM_CHC, and ATM_WNC are regional databases that represent remote data centers where order transactions for one of our customers will be created. In a production system these would be on remote linked servers. You will see how we will use dynamic SQL to specify which database tables to work with. And now we need to get our data from the files. So the first thing we need to do is get the files to import. We will have a source directory where the files have been placed. Maybe that's for our FTP or saved attachments from emails. We will use XP_CmdShell to run a directory listing command in our source directory, looking for the files with the CSV extinction before we can run XP_CmdShell, we need to make sure that it's enabled.
For security reasons it's disabled by default. Let's run sp_configure to enable XP_CmdShell. There we go. Now, let's declare a varchar variable, source directory to hold our source directory path. And a table variable called files, with one field, file name, to hold the output from our directory listing. We will use the slash B for BI option in the DII command to return just the file name, that is the bare minimum. So if we just run that and into a select from our files table variable, here we have the files that we are interested in. Now, that we have a list of files we want to process them one at a time. We will need a variable for holding a file name and a way of looping through the files table, one record at a time. SQL and not just Microsoft SQL, has a construct called a CURSOR. This is the same as a while loop for stopping or for stepping one record at a time through a data set produced by a select statement. For each field in the select statement, you must declare a variable to hold its value in the loop where you work with that value. The variables are read only. Any changes you make to a CURSOR variable are not written magically back to the data set. On each iteration of the while loop the select fields will be copied into the variable by the FETCH INTO clause. The while loop condition can be any boolean condition but typically relies on FETCH_STATUS, which will retune a non-zero value when the loop has reached the end of the dataset. When the while loop is finished you must close the CURSOR and then de-allocate the CURSOR. If the file name is not known, let us Import it. We can use the file name to determine its origin that is which customer sent it and what its format is and then use the appropriate method for importing it. So if we now go and add our while loop with the CURSOR and just to prove that it works, we will print out the file name variable. And there we go. We're just printing the varchar variable. Now, there are two file formats we need to deal with. The first file is from the bank of Mum and Dad and is comma separated. It's a CSV file with a carriage return line feed as its row separator. Here are some screenshots from notepad plus plus where you can view the row ending characters. It has one ATM line and the denomination amounts in columns of 50, 20, and 10 dollar columns. The second file is from World Bank and it has one line per ATM denomination combination. So if an ATM needs to have 20s and 50s there will be two lines representing that information because the files are quite different, We all need two processes, two different processes to import them and obviously two different tables. So for our CSV files where it's one line per ATM with the different denomination columns we've got this table and then for our World Bank files we've got this table and to make life even easier, We've made the tables analogies to the files so they have the same columns and we've also made all the fields varchars so essentially text fields so there's no data conversion, has to go on during the initial import process. We will use the built in Bulk Insert command input our file data. At a bare minimum the bulk insert statement requires us to specify destination table, a source file, a field delimiter, a row delimiter, in which row our data starts on.
As our data file has header the columns, data we're interested in starts on row two, hence the first row parameter equals two. Unfortunately, we won't be able to supply the file name as a parameter so we will have to use dynamic SQL to build a command using the source directory and file name variables. Dynamic SQL in its simplest form is just building the SQL command as a string of text and executing that text with the SP execute SQL function. We shall see later that SP execute SQL can accept parameters for input and output variables. So for its simplest form, you've got our source directory and file name, an SQL text variable declared as invachar which you need and we're just going to run that. So it's been imported into our ATM_Schedule_Import_CSV table. As we have two file formats we will need to generalize our Bulk Insert statement to cope with both of them. The best way to do this is to create a stored procedure which takes the necessary parameters to do both files. I'm gonna put Bulk Insert into a stored procedure to make this procedure as generic as possible, I'm going to give it five parameters. The file name, which is the full path of the file, which is a combination of SourceDir and file name, the table name, the destination table. Now, the file type can be char or widechar. Now, char is what you'd normally use for just got and variety NT file and widechar is for a unicode files. A FIELDTERMINATOR, so we've got two different types. It's either going to be a comma or a tab character. And as we saw before our ROWTERMINATER, is either going to be a carriage return line feed or just a line feed. So if we go and have a look at what this looks like, we can see here all our parameters that we've inserted or concatenated as text into our SQL text to be executed by our SP underscore, SQL, XQL function. But also wrap this in a try catch block. So if there's any problem with the file, like the file is completely different to what we expect and this fails, then we're going to catch the exception and we are going to log the error into our Error table with the file name, so you can easily know which file it is. And then return a result of zero, so your stored procedure by default can return a single integer value saying that this has failed. And before we start manipulating the data we need to make sure that is what we assume it to be, i.e. the dates, dates the numbers and the numbers, and we get rid of blank lines. This means carrying out a few basic tests, logging any records that fail the tests and removing them from the data set. Delete lines without an ATM ID. The assumption is that they are blank lines. If the file says the file was produced manually with Excel, and someone deleted some lines at the end, then you could end up in the situation with lines of just commas. Record and delete lines with a date can't be converted into a date type. We will use the is date function to let SQL server tell us if it recognizes the date text as a date. This is partly determined by the date format. You can have that sit for a session with the set date format command. As the dates are all year, month, day, I have used set date format Y/M/D. You can run the set for date format command at any time in your code if you need to change the format for different files. and then we need to record and delete lines were the numeric values can't be converted into a number type. If we look at this, check integrity, you can see the is date command if it's zero, i.e false, not a date, log it into our error table, and then we delete it, and we do the same for the efforts. So we're checking the 50s to see if it's a number. Obviously, if there's nothing there, we want to coalesce it, okay. So if it's a null value, we want to make it a zero before we submit it to the is numeric function. Once again, log it. If it fails, and then delete it. Same for the $20. Same for the $10 columns, and we also want to check the total column.
The World Bank files are slightly different in that they don't have a column for each denomination, but have a denomination column and an amount column. So in some ways, it is a little easier to check. So you just need to check denomination as numeric, and the amount as numeric as well as the date fields being dates. However, before we can check that the denomination has a valid numerical value, we need to pull the dollar symbol off the front of it. If you recall here, this is our denomination field. So we will need to use the substring function and start from character two, and we'll just take the next three because the denomination is always going to be less than $1,000. Next, we want to move our data into a generic staging table. We delete any previous records for this file first, so if we've seen the file before delete them. We also want to start casting our varchar variables as the correct type of field they should be. We also need to take into account the different false structures in that the World Bank has a team spread over several lines, depending on the denominations. So we will need a F block to do the World Bank transformation. In this code block we will flatten out the World Bank rows into a single row per ATM using update statements. And we will get the... as the World Bank ATMs come in files that are specific for their geographical destination. We can determine which, in this case vaults, so that's going to be either AKC, CHC, or WNC by stripping that out of the file name because we want to store that information separately. So the first thing we do is we forget about the denominations and we just insert records for each ATM. Each distinct file name in ATM reference, and we have a 100, 50, 20, and 10 columns as zeros, and then we move through our World Bank staging table checking on each denomination type and then we just update the appropriate field for the Bank of Mum and Dad. It's much simpler as that initial staging table conforms to the generic staging table. So we just do it insert on that. I just want to talk about the load schedule table structure. Instead of inserting schedule records, we will be updating existing blank records. A load schedule record represents one week. The week is denoted by a load start period, which is the date and a week number. There four weeks to a load period. Each schedule record is uniquely identified by an ATM number, a load start date, and a week number. Each record has canister fields for each day of the week. For example, Monday 1 amount one. Monday 1, amount eight, through to Sunday 1 amount one, to Sunday 1 amount eight. These amount fields are varchar and hold a canister denomination plus the amount to put in to the canister. So if we just have a quick look here, and select on that, we can see the ATM number, the loads start period, the week number and then we have these canister amounts. So here we have one that's a schedule for a Tuesday. We have the denomination in the amount. So three canisters of $20 and one canister of $50.
Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.