Applications of Alibaba Cloud PAI: Image Classification


Image Classification with PAI

The course is part of this learning path

Image Classification with PAI

This course covers three main tasks. The first one is image classification based on Tensorflow. The second one is sentiment classification of film reviews, and the last one is product recommendation based on a collaborative filtering algorithm, all of which can be used to build machine learning models in PAI studio, by dragging and dropping components. In each of the following sections, we will first introduce the background knowledge of these tasks and then provide you with demonstrations showing you how to carry out these activities on Alibaba Cloud.


From the beginning of this section, we will learn three practical projects. The first one is image classification based on Tensorflow. The second one is sentiment classification of film reviews, and the last one is product recommendation based on collaborative filtering algorithm, all of them can be used to build experiments in PAI studio, by dragging and dropping components. In each of the following sections, we will first introduce the background knowledge of these tasks. Then introduce the complete operation process of the experiment, and finally give a video demonstration for you to do with it. Let's start with the image classification task. Image classification, one of the most common tasks in the field of computer vision and artificial intelligence is the basis for many other visual tasks. Image classification is to distinguish different types of images. According to the different features reflected in the image information. The input of the image classification task is an image as shown in the figure. We see a cat, but in the computer's view, the image is composed of pixels. Each pixels has corresponding RGB value stored in the form of three dimensional digital matrix. At the same time, we give a fixed set of category tags in advance, such as cat, dog, hat, and mug. The process of image classification task is as follows. For each given image, the image classification model is used to predict its corresponding category label. The image classification model as shown in the figure gives the probability that the input image belongs to four categories representatively, among which the probability of belonging to cat is the highest. So the model determines that the image is a cat. For humans, cat recognition is particularly easy because we have been exposed to a large number of cat images and have a very clear understanding of the characteristics of cat. So image recognition is a very simple task for humans, but for computer vision algorithms, this is a big challenge. The traditional method of image classification is feature, description and detection. First, the image is binarised to obtain the approximate contour of the cat, and then the edge and corner detection are carried out to extract the artificial features. However, traditional image classification methods can only deal with some simple images. If the information in the image becomes complex or the features are not obvious, it's difficult to solve. The following situations are the long-term challenges to computer vision image classification. In each of these cases, our goal is the same, which is to classify them into the cat category. However, in the first image, the color and texture of the cat and the background are very close, which makes feature extraction difficult. In the second image due to the different illumination and angle, although they are both images of cats. The difference in features may be very large. In the third picture, the cat has a very strange pose. The fourth cat is partially hidden by an iron net. None of these images are perfect for the image classification task. And there are various factors that interfere with cat feature extraction. When we see these images, we will not hesitate to identify them as cats, but if we use traditional methods based on constructing features artificially, it's very likely that we cannot obtain very good results. And the generalization ability and robustness are poor. Therefore, the data driven image classification approach arises at a historic moment. The so-called data-driven method is to train a classifier with a large amount of data to carry out automatic feature extraction. It's similar to how we teach a child how to recognize objects by looking at pictures where we show a child pictures of a lot of cats and tell them that this is a cat and then show him pictures of cats that it hasn't seen before. And the child can quickly recognize them. Deep learning, a data-driven feature extraction method is a mainstream method for image classification at present. Based on learning of a large number of samples. We can obtain a deep and dataset specific feature representation, which is more efficient and accurate in the extraction of the dataset. And the extracted abstract feature is more robust and has better generalization ability. However, the deep learning method requires a large amount of data to train the model and requires a high computing power. With the advent of the era of big data and the improvement of hardware level, deep learning has achieved unprecedented development in recent years, which also confirms the requirements of deep learning algorithms for data volume and computing power. Deep learning methods serve more advanced semantic and abstract image processing tasks while traditional image algorithms are more suitable for some specific scenarios, which can be manually defined, designed and understood. In the practical application of image field, deep learning method is often combined with traditional image processing method to maximize the effectiveness of the whole task. Here we introduce several categories of image classification, according to the granularity of classification, it can be divided into three categories. Across various semantic level image classification, subclass, fine grained to the image classification and instance level image classification. The so-called across species semantic level image classification is to identify objects of different categories at level of different species, which is more common, such as cat and dog classification because each category belongs to different species or categories. Such image classification often has large interclass variance and small intraclass variance. The dataset we used in the following example is shown in the figure, which includes 10 categories, including airplane, automobile, et cetera. And animals such as bird and cat. You can think of it as two large categories, vehicle and animal, and inside the two categories are completely different species and they are completely distinguishable objects semantically with large interclass variance and small intraclass variance. Subclass fine-grained image classification is at a lower level than cross-species image classification. It is often the classification of subcategories of the same large category, such as the classification of different birds, different dogs, different car types. The following is an example of fine grain classification dataset for different birds. Caltech-UCSD Birds 200-2011. This is a dataset of birds containing 11,788 images of 200 categories. Meanwhile, each image provides 15 local regional positions as shown in the figure with hat, eyes, tail, and other parts marked. There's a label box outside the bird. And the segmentation map at semantic level. In this dataset take woodpecker as an example, there are a total of six different woodpeckers and the differences between the categories are small. Let's see the picture with two categories of woodpeckers. As we can see from the picture, the appearance of the two birds are very similar and the only way to distinguish them is by the color and texture of the hat. Therefore, in order to train such a classifier, it must be able to make the classifier recognize these areas, which is a more difficult problem than cross-species semantic level image classification. The last category is the instance level image classification. If we want to distinguish between individuals, not just species or subcategories, that's a recognition problem. In other words, the most typical tasks of instance level image classification is face recognition. In the face recognition task. A person's identity needs to be identified in order to complete tasks, such as attendance. Face recognition has always been a major topic in computer vision. Although face recognition has been developed in decades and has been widely used. There are still problems that has been completely solved. For example, it's difficult to solve blocking, illumination, multipost and other classic problems such as the face might be blocked by hair or mask. The light is too bright or too dark, side face recognition, et cetera. The following is an introduction to the basic process of deep learning image classification task. The first step is to create a dataset and process the image data into a format that can be read by a neural network. The data in deep learning is stored in a multidimensional non-peer ray also known as a tensor. Tensor is a data container that is somewhat similar to a matrix, but tensor is a generalization of a matrix to any dimension. A matrix is equivalent to a two dimensional tensor. As we mentioned in the previous course, image data for deep learning is input in the form of four dimensional tensor. At the same time, we need to divide the dataset into training set and prediction set. The second step is model training. Firstly, we built a classification model. The most commonly used model for image classification is convolutional neural network. Then we use the data of the training set to train the model through the process of input data forward propagation and error back propagation. And through several rounds of integration, we update the network parameters continuously. When the error is smaller than our expected value. The training is end, the third step is model predicting. The pictures of the prediction set are the pictures that the model has never seen before. The model will predict a label for the data in prediction set, comparing the predicted label with the real label of the prediction set, the classification performance of the model can be evaluated. The convolutional neural network model for image classification has gone through several stages of development. Here we introduced two typical convolutional neural networks LeNet and AlexNet. Let's start with LeNet. Born in 1998, this neural network is one of the earliest convolutional neural networks and has achieved good results in handwritten digit recognition task. The structure of LeNet's neural network is shown in the figure. Its input is image data of 32 multiplies 32 pixels. First it goes through a convolutional layer for feature extraction and then a subsampling layer to reduce data dimension, which is also called pooling. It's followed by a convolutional layer and the subsampling layer thus completing the feature extraction process. Finally, we add two fully connected layers. If it is the recognition of handwritten numbers, we finally get a 10 dimensional feature vector, do classification by softmax and get the probability of each class, the class with the maximum probability indicates the classification results, since then the basic architecture of CNN has been settled. Convolutional layer, pooling layer and fully connected layer, which lays the foundation for the following CNN research. However, due to the limitation of data and computing power, the research on deep learning has not made much progress. In the 21st century when the level of data and computing power has developed to a certain extent and can support deep neural network computing. Another powerful CNN model comes into being, that is AlexNet here's AlexNet. The network was designed by 2012 image net winner Hinton and his student Alex who finished with 10.9 points ahead of second place in the image classification competition, which opened the previewed of deep learning and convolutional neural network research. The first five layers of AlexNet are convolutional layers. The last three layers are fully connected layers and the final softmax output is 100 classes. AlexNet has greater depth and better performance in feature extraction. However, increasing the depth often causes problems such as slow training speed and overfeeding. In order to solve these problems, AlexNet applied the redo function as the activation function and used double GPU for calculation to improve the training speed. The local response normalization LRN layer is added after the first and second layers to improve the generalizational ability of the model. Drop out mechanism is used in fully connected layer to randomly ignore part of neurons to prevent overfeeding of the model. This is what make AlexNet unique in model building. After that a variety of convolutional neural network models with accident performance has been developed such as VGG, GoogleNet, RestNet, et cetera. The width, depth, and complexity of the network are constantly increasing and new progress is also been made in the task of image classification. Now let's introduce some applications of image classification. The first is face recognition. Face recognition is often used as a means of verifying personal identity because the features of the face have certain environments and uniqueness, is widely used in our daily life, including face recognition attendance system, identity verification for taking trains planes, face scan payment and so on. Existing face recognition technology has been relatively mature. There are many applications, but are also faced with some challenges. In the technical level, there are the influence of expression, partial blocking, face image distortion, et cetera. At the ethical level, there are also challenges such as privacy, disclosure, and there's still a long way to go in the future. The second is object detection. There may be multiple objects in the photo, such as dining table, chair, teacup, flowerpots, et cetera. The task of object detection is to identify the position of each object in the picture and classify it into corresponding category. This means that the machine can make a preliminary understanding of the content of the image. The third is pedestrian detection. It identifies and marks pedestrian in a video. Pedestrian detection is widely used in the fields of human flow statistics, automatic driving technique, intelligent robots, and so on. Finally, license plate recognition, which converts the license plate photos taken by the camera into tax information is convenient for recording and processing. The development of license plate recognition technology has been very mature and we can see its application in the parking lot, tow station and other places where many vehicles pass.

About the Author
Learning Paths

Alibaba Cloud, founded in 2009, is a global leader in cloud computing and artificial intelligence, providing services to thousands of enterprises, developers, and governments organizations in more than 200 countries and regions. Committed to the success of its customers, Alibaba Cloud provides reliable and secure cloud computing and data processing capabilities as a part of its online solutions.