This course will focus on the skills required to manage and maintain the indexing process for an Azure Cognitive Search solution. As data changes within a given data source, the requirement to rebuild an index or set up the schedule for an index becomes very important. Understanding all of the functions related to the indexing process is important when you know that there are going to be periodic updates to the underlying data source, and this course will teach you the skills to perform all of those functions.
Learning Objectives
- Manage re-indexing
- Rebuild indexes
- Schedule and monitor indexing
- Implement incremental indexing
- Manage concurrency
- Push data to an index
- Troubleshoot indexing for a pipeline
Intended Audience
- Developers who will be including full-text search in their applications
- Data Engineers focused on providing better accessibility to organizational data
- AI Engineers who will be providing AI combined with search functionality in their solutions
Prerequisites
Candidates for this course should have a strong understanding of data sources and the operational requirements for those data source changes. Candidates should also be able to use REST-based APIs and SDKs to build knowledge mining solutions on Azure.
Hi there, in this video we're going to be taking a look at how to go about scheduling your indexes. Now this is completely controlled by your indexer object and the indexer is what flows the data into the index itself. So first of all, why are you going to schedule your indexes? First of all, and most importantly, your source data is probably going to change over time.
Now there are absolutely use cases such as setting an index on top of historical data that may not change, or will change very infrequently, thereby, making a schedule not required. You can just do manual updates. But most data, especially live data, is going to change over time, and you're gonna need to make sure that your index reflects all of the data source changes. Another option is if your data source is large and your job runtime is going to exceed the max limit of 24 hours because you're talking about terabytes and petabytes of data, then you're going to need to set up a schedule so that you can actually import all of the data in batches because by default, the indexer will in fact set a watermark and pick back up from where it left off when a specific data source is that large.
A third option would be if your index is being populated from multiple data sources. And in order to prevent any kind of concurrency problems or conflicts, you're going to want to potentially schedule your indexes so that each data source update can be populated separately.
Now one note to keep in mind, indexers, by default, only run once, and this primarily occurs during creation, but there is also an on-demand function. However, if you wanna actually set up a schedule if you happen to be using the Import Data Wizard from the Azure Portal, you can create a schedule at runtime. However, there are specific requirements and that ties directly to your data source. Your data source has to have a ability to show modification dates, for example. And if you do define that during data source creation, then the schedule option will be available. And as you can see here in the screen, you have the ability to set it to hourly or daily and then you can give it customs.
Now one thing to keep in mind is that hourly and daily will be defined by Azure as far as what time of day they will occur. Custom gives you the ability to actually give more control over how the schedule is defined, specifying an interval as well as a start time and date. However, this is a one-time option, can only be done in the Portal via the Data Import Wizard. If you want to do any other schedule creations or updates, you're gonna have to do that using the API.
So here's an example of what the REST API call would look like. It is a PUT function, it does go against the indexers object set as I have been talking about, and you specify of course, your indexer name. And then when passing and creating this schedule, you're going to add in three specific data points. The name of the data source that the indexer is running against, the target index that the data will flow into and then of course, your schedule. A interval and a start time being the two primary pieces of data.
Now if you wanna look at this from an actual developers perspective, 'cause I doubt that you are creating your own REST API wrappers for Azure, you're probably gonna wanna use the .NET SDK or one of the others. In this, you're gonna be creating an IndexingSchedule object, which is where you define your start time, potentially an interval, and so on, and then you're gonna pass that object in to a new SearchIndexer object, or you're gonna be updating an existing SearchIndexer object, where again, you're passing in the data source name, the name of the index and the schedule object that you have created.
So these are just some examples of how you would go about creating your schedules and modifying your schedules. As this is primarily a code-based function for Azure Cognitive Search, I will provide some code samples at the end of the course. You will find a GitHub repository that will give you a code sample of how this can be done.
Brian has been working in the Cloud space for more than a decade as both a Cloud Architect and Cloud Engineer. He has experience building Application Development, Infrastructure, and AI-based architectures using many different OSS and Non-OSS based technologies. In addition to his work at Cloud Academy, he is always trying to educate customers about how to get started in the cloud with his many blogs and videos. He is currently working as a Lead Azure Engineer in the Public Sector space.