This course explores geographic information system (GIS) topics and how to query and analyze GIS data within the database environment of BigQuery GIS. We'll start off by defining what GIS is, discussing some key GIS concepts, and introducing some common GIS data types.
We'll then move on to common types of maps and introduce the concept of map projections. You'll get an introduction to Google's BigQuery tool, and finally, we'll put all these topics together and look at how you can perform analysis and visualization using SQL and Python in conjunction with BigQuery GIS.
If you have a use case analyzing and mapping geospatial data or anticipate one in the future, especially if your data is in a relational format (or already in BigQuery), then this course is ideal for you!
Please feel free to contact us at support@cloudacademy.com with any feedback you may have related to this course.
Learning Objectives
- Learn about Google BigQuery GIS and its concepts, as well as common GIS data formats
- Understand the common types of maps and the concept of map projections
- Learn about spatial query functionality and spatial visualization
- Understand how BigQuery GIS can be integrated with Python
Intended Audience
This course is intended for anyone who wants to:
- Leverage BigQuery GIS for their geospatial analytics needs
- Learn how to visualize data on maps
Prerequisites
To get the most out of this course, you should have basic familiarity with SQL, Python, and cloud computing, ideally Google Cloud Platform.
Now let's move on to spatial measurement functions which is where we start to get into some of the analysis that you can do with these geography types within BigQuery GIS. These functions allow you to compute measurements of one or more geography objects. So for example, you can compute the distance between two point geography values or you can compute the area or perimeter of a polygon. You can also compute the length of the geography object.
So if we return for a minute to our Amtrak train route example, we could compute the length of the entire route from Boston to Washington. One important note about these spatial measurement functions is that the units returned are metric so the values are given in meters or square meters, depending on the type of measurement being computed.
Let's return once again to our New York landmarks example. We now have both the Empire State Building and the Statue of Liberty in our table and we can compute the distance between these two points. The ST_DISTANCE function takes two geometry data points as its arguments separated by a comma.
In the SQL query, which is shown in the query editor portion of the Web UI, you can see that what we're actually doing is using subqueries to get the two geometry data points. The first subquery is getting the landmark_geo field for the Empire State Building and the second subquery is getting the landmark_geo field for the Statue of Liberty.
These subqueries are almost identical, the only difference is which landmark we're pulling. As you can see from the results, the distance between these two landmarks is 8939 meters which is equivalent to 8.24 kilometers. Of course, this is as the crow flies or should we say as the pigeon flies since it's New York? If you're actually traveling on subways, taxis or ferries to get from one place to the other, the traveled distance will likely be different.
For the next example, we're gonna look at how you would use the ST_AREA function to calculate the relative size of ZIP codes within Los Angeles County. Again, we're using the table of zip code information from Google's public data. The ZIP code geom field is a geometry polygon field which is the input to the area function. I've filtered the ZIP code table for just Los Angeles County using a where clause and I have sorted the results in descending order of zip code area.
As you can see in the results pane, the Zip code 90704 is the largest zip code by area in Los Angeles County with 7.6 square meters. Now this seems kind of small to me but I do know that zip codes tend to be much smaller in high density or high population density areas than they are in areas where population density is more spread out. And since Los Angeles can be quite dense population wise, this could perhaps be correct. But it is good to use your intuition and to look at these results and say, does this make sense? Does this pass the sniff test?
Our next type of spatial functions is spatial joints which are also known as predicates. These functions determine the relationship between two or more geometries and return Boolean output, which means if the relationship exists, the function returns true and if the relationship doesn't exist, the function returns false. And because the output of these functions is a vector of true or false values, these functions are often used in the where clause of a SQL statement. Don't worry if this seems a bit confusing now, it will make more sense when we start diving into examples.
This next slide lists the predicate functions and hopefully just looking at the names, it is a little more apparent what these functions do. For example, ST_EQUALS returns true if the geographies being compared are the same and false otherwise. Similarly, ST_INTERSECTS returns true if the point set intersection of the two geographies being compared is not empty. This means that true is returned if there is at least one point that appears in both input geographies. ST_DISJOINT is the opposite of ST_INTERSECTS. So it returns true if there are no points that appear in both of the geographies being compared and false otherwise. This means that by definition, if ST_INTERSECTS is true, ST_DISJOINT must be false.
Diving right into an example, let's use the ST_CONTAINS function to determine how many crimes occur in each zip code in each year in Chicago. We can do this by using the public data set of Chicago crimes and combining that with our now familiar public data set of zip codes. The first thing to notice about this query is that the ST_CONTAINS function is in the where clause. The inputs for this function are two geographies. A value of true will be returned if all points of the second geography listed are inside of the first geography and false otherwise.
So in our case, the first geography is the polygon that represents each zip code. And the second geography is the point location of each crime. If a crime occurred within a boundary of a particular zip code, a true is returned. A false is returned for that crime for all other zip codes.
Something worth noting here is that we're using a constructor function to create the geography point values for each crime because our crimes database doesn't have a geography field so we have to construct it from the longitude and latitude values using the ST_GEOG point function. This is a good example of how different types of spatial analysis functions can be used together to build out a complex query. This query counts each of the true values for each zip code when also grouping on year which gives us our final results of the number of crimes in each zip code for each year which we have sorted descending first by year and then by crime count.
So we have the most recent year with the ZIP code with the most crimes for that year on the top of our results table. And as you can see in the results pane, for the year 2020, the ZIP code in Chicago with the most crimes according to this database was zip code 60620 which had almost 7,000 crimes as of October 2020.
Next, we're gonna talk about spatial transformation functions which are functions which will generate a new geography data type based on the input. ST_BOUNDARY and ST_CENTROID are functions that take a single geometry object as input while some of the other functions on this list such as ST_UNION take multiple geometry objects as input.
We're gonna do a pretty simple example for this class of functions which is to input a vector of polygons and have the centroid of those polygons returned. We're going to use a new public data set which is the census tract data for the state of Hawaii. It's a little bit aspirational because Hawaii is a fun place to visit and a fun place to think about visiting and being, especially when you're sitting at your computer watching a course on BigQuery GIS.
So in case you don't know, census tracts are subdivisions of counties and are designed to contain roughly the same number of inhabitants per census track. But of course, the relative size can vary significantly based on population density.
So within this dataset, we have the tract geom fields which is the geometry polygon for each census tract. We can use the ST_CENTROID function to get the geography point of the centroid for each tract. This is a bit of a cheat because this field is actually already in the public data as the internal point geo fields. But it's a good way to check that the ST_CENTROID function is doing what we expected it to do.
Our final spatial analysis example will be to do a clustering analysis using geography data. Clustering is a machine learning methodology that groups data based on common attributes. This is considered an unsupervised algorithm because there is no quote right answer you are trying to predict. The clusters are discovered in the course of running the analysis with no guidance on how they should form beyond the algorithm itself.
The built-in algorithm for clustering in BigQuery GIS is called DBScan which is a density-based clustering algorithm. And this is actually where it gets its name. The DB stands for Density Based, not database as you might think since we've been talking about database query. This means while data points in high-density areas get clustered together, data points in low-density areas are classified as noise.
One nice thing about the DB scan algorithm is that you don't need to specify the number of clusters in advance. This is in contrast to the K-means clustering which is a method you may have heard of where K is the number of clusters and it is a required input for the analysis.
So on to our example, the public data has a table of street trees in San Francisco which we will use to create our clusters. First, we need to specify a few parameters of our clusters so we can assign. This is a bit of a complex query so let's take a minute or so to break it down. The query that's actually creating the clusters is in a subquery that we've called tree clusters using the with tree clusters as formation. This is something that you may not have seen but it allows you to create a subquery and then call it later on with that name.
So here we're selecting the trees IDs, the species and some other information including the legal status and caretaker and then we're creating the geography point value using the ST_geog point constructor function, taking the longitude and latitude of the tree because in this example, we don't have a geography field in our tree table so we have to construct it. Then that geography is the first input into our ST_CLUSTER DBScan function.
The second input is 100 which is the 100-meter radius that we're using for each of our clusters. And the third input is 10 which is that we're specifying that we want a minimum of 10 trees in each cluster. Any other cluster that would be smaller than 10 trees or grouping, those trees are gonna be considered noise.
Then we take that and the output of that subquery is for every tree we get assigned to a cluster or it gets assigned to noise. And then we take that subquery and we group the trees together by counting how many trees we have in each cluster. And we wanna look at the clusters that are greater than zero because the zero cluster I believe is the noise cluster although not 100% sure about that. And then we group on the cluster ID and then we order by the tree count and we're ordering it descending so you see the cluster with the most trees which is 524 trees at the top of our results table.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.