Cloud Academy Team

October 8, 2019

Azure Search: How to Search for Text in Documents and Images

What is Azure Search?

Azure Search is an artificial intelligence (AI)-powered cloud search service that enables full-text searches within documents and images. It’s a managed service, and we’ll show you how to implement a search engine in a few steps.

In addition to the ease of creating and managing the service, one of the main advantages is being able to integrate with other Azure services immediately. In this post, we will create a search engine and integrate with Azure Cognitive Services (specifically Vision API), all through the Azure Portal in a few minutes.

For an introductory tour of Azure Storage Solutions, check out Cloud Academy’s Introduction to Azure Storage Solutions. This intermediate-level course covers SQL offerings (SQL DB and third party offerings of MySQL), managed NoSQL databases (DocumentDB and MongoDB), managed Redis Cache service, Azure Backup (backup-as-a-service), Site Recovery (for handling disaster recovery), and StorSimple (a hybrid cloud storage solution).

To search for text on images and documents with Azure Search, here are the simple steps that we will follow:

Create a Storage Account
Create a search engine
Test the operation of our service
Understand service considerations

1. Create a Storage Account

The first step for our cognitive search engine is to create a storage account in Azure, in which we will store the files we want to analyze.

We create our storage account by entering the requested parameters:

Once our account is created, within the Blob Storage section, we will create a container that will store our files.

To close this stage, and do the respective tests, we can search the internet for several documents and images that will be objects of our analysis and we will store them in our new container. For this case, we will upload the following image:

2. Create a search engine

Our storage account is already ready, but we must enable our search service with Azure Search.

Like our storage, we must enter a unique name because we will have a URL associated with our service. We must select the same region in which our storage account is (if not, we cannot link both services to use the cognitive services). At this point, nothing special needs to be done additionally.

Already enabled to the search service, we enter it to create an index. The Azure Search storage structure to save the information is like a table in SQL, but with a more flexible structure.

You will have several options to create indexes. In our case, we will select the option “Import data.”

Within the options, we will select the “Azure Blob Storage” option. If this is the first time you do this, you will have to connect the sources.

The portal will ask you for a name to the connection. Ideally, select something that identifies the content you are extracting. In the “data to extract” option, select “Content and metadata.” Finally, select the storage account and the container that has your files.

Then proceed to select “Next: Add cognitive search.”

Note: Before going to the next stage, Azure Search analyzes the content of our Blob Storage, so it is important that there is already a file (document or image) that the service can analyze to generate the appropriate index scheme.

Our next step is to select the cognitive service to use.

You can use a limited and free one that comes with Azure Search (recommended for development and testing only) or an advanced production plan.

Below, in “Add enrichments” we enter a name in our Skillset and enable the “Enable OCR” option since this option is the one that is responsible for extracting the text from non-flat files, through computer vision.

The other options are kept by default.

You can also specify your cognitive service “find” patterns within the extracted text, such as people’s names, organization and location names, etc. In our case, we will leave those options unselected and select “Next: Customize target index”

Next, we must configure our index (the structure that will save the extracted information to be consulted).

The important thing here is to select a suitable name for our index, in this case, “files”, a key (like a primary key), in this case it will be the field “metadata_storage_path” (a hash of the path that each file has within the Blob Storage, since it must be unique).

As you can see, the service generates many fields that correspond to metadata of the file, but the field that matters to us is “merged_content.”

For the text fields that you want to search by entering characters, you have to select the “Searchable” option for each desired field. For the fields that you want to be returned for each query, you must check “Retrievable.” We will leave the rest of the fields blank so that our index is lighter in size.

Finally, in the last stage, you will create an indexer — an automatic process that in each period will search our files to extract the text and store it in the search engine. In this case, we will leave “Once” selected since we will execute manually each time. In production, it can be scheduled by sections of specific hours according to the load of the systems that consume it. Click “Submit.”

Note: When the indexer is created, it is automatically executed to extract the initial information.

It may take a few seconds or minutes until the index and associated processes are configured (it depends on the number of files to be analyzed initially).

3. Test the operation of our service

The last step is to verify that the service has been able to extract the text from our image and save it to be searched. To do this, go to Azure Portal > Search service > Select the “Search explorer” option.

Verify That the Service Extracted the Text

To validate that your test file was loaded correctly, enter the search engine, part of the text of our image (for example: “read it”).

To search, write the search query as a query string. In our case, it will be:

search = "read it"

If your search is correct and there is information, you will have a result in a JSON format:

Note: The field that matters to us is “merged_content” since it has all the text of the file concatenated for the search — unlike “text”, which is a string array for each paragraph detected in the file.

4. Understand service considerations

The main considerations to take into account when implementing a service like this, are:

Check the number of files that arrive periodically, since the search engine can have a lot of time extracting and saving information.
- - There is the option in the indexer to add a “batch size” to avoid problems with too much data in a short time.
Verify the appropriate Azure Search plan to store all data.
Use all services within the same Azure region to avoid latencies.