image
Creating & Testing a Search Index
Start course
Difficulty
Intermediate
Duration
1h
Students
315
Ratings
4.6/5
Description

This course focuses on the skills necessary to implement a knowledge-mining solution with a focus on the Cognitive Search solution. The course will walk through how to create a Cognitive Search solution and how to set up the process for importing data. Once the data sources have been set up properly, the course will teach you how to create a search index and then how to configure it to provide the best results possible.

Learning Objectives

  • Create a Cognitive Search solution
  • Import from data sources
  • Create, configure, and test indexes
  • Configure AutoComplete and AutoSuggest
  • Improve results based on relevance
  • Implement synonyms

Intended Audience

  • Developers who want to include full-text search in their applications
  • Data engineers focused on providing better accessibility to organizational data
  • AI engineers that provide AI combined with search functionality in their solutions

Prerequisites

To get the most out of this course, you should:

  • Have a strong understanding of data sources and how data will be needed by users consuming a Cognitive Search solution
  • Be able to use REST-based APIs and SDKs to build knowledge-mining solutions on Azure

Resources

Transcript

Hi there. In this video, we're gonna be talking about the last piece of the search service base functionality for your cognitive search solution. And that specifically is the index and the corresponding indexer. We're gonna be looking at how to create an index and then how to test it after the indexer has done its work.

So, first of all, what exactly is a search index within the context of a search service? Cognitive Search stores searchable content used for full text and filter queries in a search index, meaning that you're bringing data from your data source and making it available in a very performant way to be returned back in a search query. Indexes contain search documents which could be rows of data inside of a SQL database or they could be actual documents such as Word documents, PowerPoint documents, and so on.

Conceptually, a document is a single unit of searchable data in your index. Meaning that when you perform a search, it is gonna return a list of results back, each one of them being a document within the corresponding index. An index is defined by a schema and saved to the service where the fields collection is typically the largest part of the index. Meaning that you're actually going to be at least in the concept of a database mapping fields to fields in the index and then allowing the index to determine how best to return them back to you. 

There are a number of different attributes that you set at each field level. And you're gonna see these when we go and actually create the index in the portal. First thing is, is it searchable? Should you be able to actually search on a particular field in the index? Can it be filtered on? Meaning can you decide, "Hey, look, I want to pull only records that have, or documents that have a particular field in that result." Can you sort on that particular field? These are very, very common. Everyone should understand what they correspond to.

The next though is a little bit trickier, facetable. I actually had to look this up myself when I first started working with search services 'cause I had never really heard the term before. It's a field used as a "Hit Count" category. So let's use an example, Amazon. When you do a search inside of Amazon, on the left-hand side, you'll get a list of categories that the results will actually correspond to. And in certain circumstances, they will actually put numbers in parentheses next to each one of those categories. That is the hit count. Meaning that there was a field somewhere in the database that corresponded to the category that the different results returned.

You need to specify a key, the unique identifier for the index, just like you do in a database. It helps with the partitioning if your index is gonna be very sizeable. And then lastly, should the field actually be returned in the result? So you can actually have a field that you're returning that is not filtered on, is not sortable, and can't necessarily even be searched for, but it is data that's relevant to the search result.

Now let's take a look at how to actually do this inside of the portal. Okay, here we are back in the portal. And as you can see, we are in our overview page for the search service that we've been working with. But before we actually dive into the index and the indexer, I wanna actually talk about something that is specific to the search service, that you're not gonna find with a lot of the other services inside of Azure. And that's the fact that really the portal is not going to be your be all end all when it comes to actually managing the pieces of your search service such as the indexes, the data sources, and indexers. And that's because there are actually two distinct APIs for the search service.

So if we go and take a look at the Azure Cognitive Search documentation, and specifically look here under the reference area, and then just choose any language, any one of the supported languages. And for example, we will choose JavaScript as that's my preferred language, and you'll see that there are actually two distinct APIs here. The Management API is the one that actually creates the service, allows you to modify the service, configure the service, and so on but it does not have any ability to work with indexes, indexers, data sources, or things of that nature.

In order to actually do that, you would use the Search API. And the Search API will not only allow you to manipulate your indexes, indexers, data sources, but it will also allow you to perform queries. So at the end of the day, the Search API is gonna be the more powerful tool. I bring this up because in the portal, when you go to actually, for example, create an index, it makes some assumptions, and let's talk about that.

I already have an index and we'll take a look at that in a second. But if we go and pull up the new index screen, the first thing you'll notice is it pre-populates a key field with the field name of ID. If I were to then have that run against my product table, whose key is product ID, it would actually fail to map the fields between the index and the database, because there is no such column called ID. The portal is just a simple easy way for you to get started. You actually have to do the manual mapping yourself. However, there is one way to get around that, and that's to use the "Import Data" wizard that we talked about. It will actually pull the column names from your database and pre-populate all the fields with each one of them. And then you have the ability to change the attributes, remove them as necessary.

In a standard manual circumstance though, you're gonna have to create a code file JavaScript, C-Sharp, Java, Python, to actually do that work, because of the fact that the portal is just a very small subset of what the Search API provides. So I just wanted to make sure to call that out so that you don't believe that you can do everything that's possible inside of the Azure portal.

Now let's take a quick look at our indexes. I've already created one so that I would save time on all of the steps for adding the different fields. This is an index that has already been created, it's already been populated. Matter of fact, you can see that up here at the top, we've got 295 documents. Meaning 295 records were found in the database. We go to the fields section and this is where I have my mappings. And you'll see each one of these column names or field names correspond to the actual column in your database, because that is a requirement in order for the mapping to occur. Unless you do a manual mapping on your own using your code file. If you use a code creation for your index, then you can actually provide your own field names in the search index and specify the mapping to the column name in your data source.

So in here we've got a number of different string-based attributes. If we take a quick look and add a new field you can see that there is a large list of different data types that you can support. And of course, that's gonna be very much dependent on the data in your database. You'll see all of the attributes that we talked about, retrievable, filterable, sortable, facetable, and searchable, and you'll notice here under product category ID, I did in fact choose facetable, because just like the use case that we talked about for Amazon, that would be one that we would wanna use to show hit counts. 

The other thing I wanna show is for a new field, the minute we choose searchable, we get an option for an analyzer. There are two distinct types of analyzers that are provided by default and that's Lucene and Microsoft. And then for each one of those, there are also different languages available. So it's gonna be up to you to determine, first of all, does your data support, multi-lingual text? And second, do you wanna use the Lucene analyzer or the Microsoft analyzer? And you'll really need to do some research on the differences between the two, but I can tell you from personal experience, that Lucene is going to solve the majority of your problems and will give you the necessary analysis that you need for your index.

Lucene is the primary library that people have built search services off of, for decades now, I've used it in my own code over the years for different applications that I've written. I highly recommend Lucenes. You have all of this within the scope of your index, let me remove this field because we don't need it. And we've already set everything up for this particular index.

Now, a couple other quick things, you can take a look at the JSON definition for your index, and this would make the manipulation via code a much easier process for changing your index if you needed to do it in a code file, you can also create what are called scoring profiles to help manipulate how the results are gonna be returned to your users. So as a quick example, we can create a new scoring profile, and I'll just call this "Test." And then a scoring profile adds either weights or functions to the actual results when they're returned.

So we can add some weights here by specifying a particular field that we wanna make sure is weighted heavily, let's say color, and then we can just specify an input weight value. We can obviously add multiple weights if we wanted to. Now, functions are gonna be much more intense but allow you to add aggregation functions such as sum, average, minimum, maximum, and then first matching. This is gonna be dependent more on integer-based fields than anything else, and you can add scoring functions if you wanna provide averages or if you wanna provide mins and maxes for your different fields.

Okay. Now let's go take a quick look at the indexer, because the indexer is what actually processes the data and actually allows us to see 295 documents. So we go onto our "Indexers" tab, and you can see immediately that there has been a success and that it is processing 100 milliseconds and it's running every hour. So this is something that is already being done. Right now it is looking for differences to occur.

Now, if you'll remember, when we went through the process of creating the data source, it required that you had changed tracking turned on in order for the ability to do a recursive schedule. Well, in fact, you have more options than what the portal provides because in fact, using the API or using the "Import Data" wizard, you actually have the ability to specify a date field as being a change delimiter, meaning that you can actually say, "Hey, I have a modified date, when that modified date changes, we make a change to the data and therefore update the index accordingly." So that is something else to keep in mind and just another reason for setting up your search service using the necessary code APIs.

But you can see we've actually processed and there's the actually the first run here, of all 295 documents. And just to prove this completely out to the end, let's go ahead and do a quick test. So if you go into the "Index" screen, you actually have a very quick search explorer that you can perform tests against your index and it will return the results as if the API was returning them to you which is a JSON format. So very, very simple test. I'm just gonna use an asterisk for searching all, and there are all 295 of the JSON documents.

Okay. So that's really all that there is with respect to indexes and indexers. And I kind of say all that there is tongue in cheek because you really do need to understand the APIs in order to create your search solution and make it part of your actual application. If you know that your index is going to change over time, understanding that API to be able to manipulate that index, update the indexer accordingly, and so on is going to be very, very important. But the most important thing, and I've been talking about this all throughout, is you need to know your data. I hope that this is helpful, and I hope to see you next time.

About the Author

Brian has been working in the Cloud space for more than a decade as both a Cloud Architect and Cloud Engineer. He has experience building Application Development, Infrastructure, and AI-based architectures using many different OSS and Non-OSS based technologies. In addition to his work at Cloud Academy, he is always trying to educate customers about how to get started in the cloud with his many blogs and videos. He is currently working as a Lead Azure Engineer in the Public Sector space.