This course will focus on the skills required to manage and maintain the indexing process for an Azure Cognitive Search solution. As data changes within a given data source, the requirement to rebuild an index or set up the schedule for an index becomes very important. Understanding all of the functions related to the indexing process is important when you know that there are going to be periodic updates to the underlying data source, and this course will teach you the skills to perform all of those functions.
Learning Objectives
- Manage re-indexing
- Rebuild indexes
- Schedule and monitor indexing
- Implement incremental indexing
- Manage concurrency
- Push data to an index
- Troubleshoot indexing for a pipeline
Intended Audience
- Developers who will be including full-text search in their applications
- Data Engineers focused on providing better accessibility to organizational data
- AI Engineers who will be providing AI combined with search functionality in their solutions
Prerequisites
Candidates for this course should have a strong understanding of data sources and the operational requirements for those data source changes. Candidates should also be able to use REST-based APIs and SDKs to build knowledge mining solutions on Azure.
Hi there. In this first video, we're gonna start by talking about how to get data into your index. Now there are two different methods: a push model as well as a pull model. In the push model, you're gonna be primarily handling data being pushed into your index via programmatic calls, SDK calls, where you have complete control over what data goes into your index and when. Any data that's composed of JSON documents can be pushed, and there are no restrictions on how often you push your data into the index.
Now on the reverse side, your pull model uses what's called an indexer. Now this is a resource object that is in your Azure search service, and it can pull data from a defined data source. Now a defined data source is going to be a standard SQL based or NOSQL based database that either exists inside of Azure as a PaaS service or exists as IaaS deployment that you have clearly defined. There are gonna be restrictions based on the indexer definition. For example, your indexer cannot run for more than 24 hours at a time if your dataset happens to be large enough.
Now in a push model example, I've given you this particular GitHub repository, which was created by Microsoft, but this particular repository has numerous different .NET-based solutions, so you're gonna wanna go to the DotNetHowTo folder or solution, and then you're gonna go to the program file, and in that program file, go down to line number 91. Take a look at the IndexDocumentsBatch class. That's where it actually takes care of uploading a set of JSON documents into the index of your choosing. Now this is a very, very simplified description of what's actually going on. This is also the creation of JSON documents inside of a code file, and obviously that is not gonna be the way that you would want to create your own solution. But this gives you a starting point for how to create your own programmatic push model.
Now when it comes to a pull model, we're gonna wanna take a look at how it's handled inside of the Azure portal. And I'm gonna start by showing you the Import Wizard, which is the easiest way to understand the pull model. And the once you have an understanding of that wizard, we can actually talk about how to create the individual resources so that you have more control. So here I am inside of my Azure portal, and I'm already in my CloudAcademy search service. And if we look here up at the top, we have a button called import data. This will kick off the import data wizard.
The very first thing it's gonna ask for is a data source. Now, you can either choose an existing data source or you can go through the process of creating a new one based on this dropdown of available options. And actually since the last course that I created, there's a new one called SharePoint Online. That's like a brand new data source that was not available three months ago. I already have a data source here which is a SQL database, so we'll go ahead, and we'll use that.
The next is going to ask for cognitive skills. We're gonna skip over that, as that is primarily designed for AI engineers, where you're gonna add enriching data on top of the data that is actually being imported in. As I said, we're gonna just skip right over the cognitive skills. And then this is where we're actually going to define the index. And I'm just gonna click through this really quickly. You're gonna choose a set of fields or a set of columns from your database. Because of the fact that this is a SQL based database, you're gonna make sure that your key is pointing to the primary key of your table. Then you're gonna choose which ones are gonna be retrievable.
Now this is a customer table, so I'm gonna choose first name, middle name, last name, company name, email address, actually not email address, and salesperson. I'll just leave it at that for now, because the big key piece to a pull-based system is the indexer. The indexer is going to actually be a scheduled job that takes all of the data from your database, and puts it into your index based off of the structure that you defined in the previous screen. So it's gonna create an indexer. You're gonna specify a one-time or schedule. We'll talk about schedules in another video, because the schedules require, as you'll see here, a high watermark column in order for you to actually create a schedule.
Now that schedule is going to use that high watermark column to specify a stopping point and a starting point should the indexer either prematurely stop or stop because of the 24-hour time limit. And then you also have the ability to specify deletions and then there are a number of advanced options. But it's this indexer or this job that's actually going to take care of pulling all of the data out of, in this case, a SQL table, to put it into your index.
Now if we go back to the search service, the Indexer can be created directly here as well. So if you wanna have more control over how to create your index, what values are put into your index, as well as some more control over the creation of your indexer, then you can do that here. And you can see, I already have access to a very different screen where I choose my specific data source, I can create skill sets or populate skill sets if I already have them created, give it a description, specify a schedule, and then all of the advanced settings are down here at the bottom.
Indexers are also available to be created and modified via the SDK, so if there is something that you want to do, such as create a more detailed schedule or update a schedule, then you may want to do that in the SDK programming model. Now the last thing to talk about is loading data incrementally into your index. Now this is something that's periodically gonna happen over time, and it's gonna be primarily required for either large datasets or highly flexible datasets, meaning datasets that are gonna change a lot.
The best way to handle this is through a push API, doing it programmatically so that you have complete control over the changes by your developer. This can also be required if you know that your indexes need to be as close to real time accurate as possible, thereby the developer can actually monitor for changes to a particular table, row, column, what have you, and then immediately push the update into the index at that time. You do have the ability though, using an indexer, to make incremental updates to your index as well. Now this is primarily gonna happen when you again have a large dataset, and you can handle it by partitioning your data into smaller data sources.
The data source does not have to be tied a table, it can be tied a SQL query, so you can create a View that is A through M, and then another one that is N through Z, and create each one of the SQL query for each one of those views. That is an example of partitioning your data. And then you would create a separate indexer for each one of those, and set up a schedule, thereby also providing incremental changes. Hopefully, this gives you a good understanding of how to get data into your indexes. Let's dive deeper into the indexing process in the upcoming videos.
Brian has been working in the Cloud space for more than a decade as both a Cloud Architect and Cloud Engineer. He has experience building Application Development, Infrastructure, and AI-based architectures using many different OSS and Non-OSS based technologies. In addition to his work at Cloud Academy, he is always trying to educate customers about how to get started in the cloud with his many blogs and videos. He is currently working as a Lead Azure Engineer in the Public Sector space.