Visual Development with Alibaba Cloud PAI-Studio
Alibaba Cloud PAI-Studio
The course is part of this learning path
This course introduces visual development through PAI. We will give a detailed introduction to the basic concepts and features of PAI-Studio. And conduct a practical exercise of a simple machine learning experiment based on PAI-Studio for you to understand it better.
In this lesson, we are going to introduce visual development through PAI. In the first lesson, we learned that we can use PAI-Studio for visual development. By dragging and dropping components, you can build machine learning experiments with zero code. In this section, we will give a detailed introduction to the basic concepts and features of PAI-Studio. And conduct a practical exercise of a simple machine learning experiment based on PAI-Studio for you to understand it better.
First, let's learn about the basic concepts of PAI-Studio. It provides a visual development environment for machine learning. Using PAI-Studio, you can build traditional machine learning and deep learning projects by dragging and dropping components, and achieve artificial intelligence development without the need to write code. PAI-Studio integrates abundant and mature traditional machine learning algorithms, as well as commonly used deep learning algorithms, which can meet the needs of artificial intelligence development.
PAI-Studio can also be applied to different scenarios. Mainly some tasks about data analysis, classification, and prediction, such as commodity recommendation, financial risk control, advertising prediction, news classification, disease prediction, weather prediction, et cetera. The modeling interface of PAI-Studio is shown in the picture. The left sidebar presents a various algorithm components in the list which can be directly dragged and dropped into the visual interface for modeling.
A complete model for experiment is shown on the right and linked together in order, including training and predicting data, machine learning modules, prediction, and evaluation modules. Click Run at the top of the interface to run the experiment. Click Deploy to deploy the model as a restful API. And Auto ML is use for automatic parameter adjustment of the model. The buttons on the right can adjust the display of the model, including zoom in and zoom out, full screen, and fit the screen, et cetera.
PAI-Studio's primary target user is developers who use traditional machine learning and deep learning for data analysis. In order to use PAI-Studio, we need to master some basic knowledge such as the theoretical knowledge of traditional machine learning and deep learning, and understand the basic functions of each component. Although, most of the time, writing code is not quite demanding. It still requires a certain background in Python development, which can be helpful when working with notebook and developing your own algorithm packages.
In addition, PAI-Studio provides algorithmic components that can be applied to a variety of real world projects. For example, the text processing components can be used to complete the news classification task. The graph algorithm component can be used in tasks about financial risk control. And the collaborative filtering algorithm can be used for commodity recommendation task in e-commerce platform. Let's look at the main feature of PAI-Studio.
First, it provides a variety of machine learning and deep learning algorithm components. As for traditional machine learning, there are binary classification, multiclass classification, clustering, regression, recommendation, and evaluation. In terms of deep learning, it supports some mainstream deep learning frameworks, such as TensorFlow, Caffe, MXNet, and Pytorch. In addition, there are data processing, feature engineering, statistical analysis, and other functional modules, which can meet the needs of the vast majority of machine learning tasks.
In terms of the development approach, PAI-Studio supports both building experiments with templates and manually dragging and dropping ways. PAI-Studio provides a complete set of experimental templates that can be directly applied to practical tasks, such as image classification based on TensorFlow, public opinion analysis based on takeout reviews, news task classification, et cetera. Select the corresponding experiments template in the home interface, and you can build the experiment with one key.
The advantage of building experiments with templates is that we can quickly create experiments and deploy models, and we only need to adjust some of the parameters. We can also manually build experiments in drag and drop modules. It's characterized by the freedom to match hundreds of algorithm components provided by PAI-Studio to build models suitable for the needs. While supporting multiple data sources to input different types of data such as MaxCompute table structured data and OSS unstructured data.
In terms of parameter adjustment, in addition to manual tuning, PAI-Studio also supports Auto ML automatic parameter adjustment to help users to get the optional model. We can choose the parameter adjustment method. Here we choose gause algorithm, a non-parametric Bayesian model which feeds the agent model by continuously observing the performance of hyperparameter configuration. And then, it houses the decision by the prediction ability of the model.
So as to select the appropriate over parameter results in a more targeted way, in a limited number of attempts. PAI provides a total of seven parameter adjustment methods. You can choose from one of them. In addition, you can set the data split ratio, the maximum number of interations, parameter range, and so on. Then, we configure the output parameters of the model. Here, we choose AUC as the evaluation criteria and saved the top five models according to the evaluation criteria.
Finally, we returned to the modeling interface. Click Run to run the model and the automatic parameter adjustment process will be synchronized with the running of the model. PAI-Studio also supports uploading self-defined algorithms, which can be developed using SQL, Spark 2.0, or PySpark 2.0 frameworks. Encapsulated as components and uploaded to PAI-Studio or published to the AI market for more people to use. We are not here to introduce how to develop an algorithm package in detail, but introduce how to publish it.
After we have developed a package, we select our Algorithm Management in PAI console, click Create a custom algorithm, input name, description, algorithm framework, and type. And then, upload the algorithm package we have developed. After setting up a series of parameters, we can use it as a single algorithm component, just like using other existing algorithm components. Let's compare the features of PAI-Studio with those of Python Scikit-learn.
Firstly, both support the development of traditional machine learning algorithms, such as regression, classification, clustering, collaborative filtering, et cetera. In terms of model development, PAI-Studio supports drag and drop development. While Scikit-learn only support writing code manually. In terms of parameter adjustment, PAI-Studio support automatic parameters adjustment. While Scikit-learn requires users to adjust parameters manually.
In terms of operating environment, with PAI-Studio, we can create and deploy instances in the cloud, and both CPU and GPU can be selected as computing resources. While Scikit-learn needs us to complete installing and deploying by ourselves, and it's computing efficiency depends on its own hardware conditions. At last, PAI-Studio supports visual modeling, while Scikit-learn does not. All in all, PAI-Studios development approach is simpler, more intuitive, and more efficient than that of Scikit-learn. It also saves computing resources, is least difficult to operate, and is more user-friendly.
The general process of development with PAI-Studio is shown in the picture. Create a project, prepare the data source, upload the data, construct the model visually, and run the model. Let's learn how to fully implement a PAI-Studio project with an example. We will introduce an example of binary classification task based on Linear SVM. First, let's briefly introduced the binary classification tasks.
The binary classification problem has only two outcomes, yes and no, which can positive and negative examples. As shown in the figure below, there are multiple points in the vector space, and we need to find a suitable hyperplane, determine the parameters of the play equation, and separate the positive and negative cases as far as possible. There are many algorithms that can realize binary classification, such as Perceptron, Logistic regression, GPDT, SVM, and so on. The Linear Support Vector Machine algorithm is used in our example, and we will introduce the basic concept of SVM algorithm next.
In the process of searching for the partition hyperplane, we will find that there are many hyperplanes that can divide the training samples into two categories. So, which one is the best? In other words, which hyperplane has the best generalization ability. The so called generalization ability is the tolerance to local disturbance of the training sample. If the data noise makes the same post outside, the training set easily fall along the other side of the hyperplane and easily be misclassified. Then, the generalization ability of the model is very low. So, we should choose the hyperplane that lies in the middle of the two types of the same posts. From this, we introduced the concept of support factor and margin.
The support vector is the training sample points closest to the hyperplane. And the margin is the sum of the distances from the two different support vector to the hyperplane. Our goal is to find the partition hyperplane with the largest margin, that is to maximize the margin. This is the basic idea of SVM. At the same time, the SVM algorithm can also be applied to the case that the trainee samples are linearly inseparable. The samples need to be mapped from the original space to the high dimensional feature space. And then, the linear partition is carried out, but our Liner Support Vector Machine algorithm only needs to consider the case that the training samples are linearly separable.
When we use PAI-Studio to complete the Linear SVM binary classification task, we only need to have a general understanding of the input, output, and the basic parameters of the SVM algorithm, because the algorithm itself has been encapsulated into a visual component which does not need to be implemented manually. Let's start by creating an experiment.
We have learned how to create a project in the first lesson. Click on the project name from the console, go to the homepage of PAI-Studio, and create a blank experiment. Enter the name of the experiment, then the description, and select the path to save it. Next, we upload the data. The feeder shows the training data and predicting data of the experiment. The amount of data is relatively small and the SVM algorithm is just suitable for the training of small sample of data.
In each piece of data, Y is the number of labels and represents the sample category. Positive one represents the positive example and negative one represents the negative example. F0 to f7 are characteristic quantities indicating that the sample data is an eight dimensional vector. The predicting data is also in the same format and the accuracy of the prediction can be obtained by comparing the predicted results with the real labels. We also need to pay attention to the formats of the data. PAI-Studio only supports TXT and CSV formats for table structured data, and the size cannot exceed 20 megabyte. Now, we upload the experimental data by creating a table.
In the first lesson, we looked at two types of the data supported by PAI, including table structured data stored by MaxCompute and unstructured data such as images, text, and video stored by OSS. In this experiment, we uploaded table structured data. First of all, we select the data source in the PAI-Studio interface click Create table, and enter the name and life cycle of the table in the pop-up window. While life cycle represents the longest time for the platform to save the table, below, we can click the plus sign to add columns, enter the name of each column, and determine the data type. The column names for this experiment are Y and f0 to f7, and the data type is double.
After uploading the data, we drag and drop the visualization component of training and predicting data to the operation interface. And then, add the Full-Table Statistic module respectively to view the statistical information of the data in the table. Right click View Data on the Full-Table Statistics module, and the statistics will be shown as the picture on the right. This table calculates the information of each column. For example, total count represents the total amount of data in this column. Count represents the amount of data that is not empty. And there are maximum, minimum, range and so on. There are 26 statistics in total.
After the data preparation is complete, we start modeling. The components used include training and predicting data, Full-Tables Statistics, Linear SVM algorithm, prediction, confusion matrix, and binary prediction evaluation. We connect them in the order in which we input the training data into the Linear SVM module. Output to the Prediction module is a binary classification model with optimized parameters. And then, input the predicting data and model together into the Prediction Module to get the prediction results.
The Confusion Matrix Module, connected after the Prediction Module, is used to show the accuracy of the prediction. And the Binary Classification Evaluation Module will give a detailed evaluation report. Before the training starts, we have to set it up. The first is the Fields Setting. The algorithm module cannot directly judge the function of each column of data itself. So, in the Linear SVM Module, the Prediction Module, the Confusion Matrix module, and the Binary Classification Evaluation Module.
The feature column and the label column need to be marked out representatively for training, predicting, and evaluating. We can mark by checking the corresponding column in the Fields Setting bar, on the right. Then, there is a Parameter Setting. In this experiment, only the Linear SVM Module needs Parameter Setting. The parameters that can be adjusted in this module are shown in the figure on the right. If the label value of the positive sample is not specified, the system will randomly select one from the label value of the data, that is, positive one or negative one.
If there is a large difference between the positive and negative samples, manual specification is recommended. The penalty factor is equal to the weight of positive and negative examples, ranging from zero to infinity. And the default value is 1.0. If the data amount of a certain label is obviously less than that of another label, its penalty factor can be increased appropriately.
Finally, the convergence coefficient, which range from zero to one, is equivalent to error and provides the conditions for the end of training. When the error rate of training is less than this coefficient, the training can be ended. After setting the parameters, we can right click in the SVM Module, and select Model Option and Model Description to view the specific description of the model. In the description interface on the right, we can view the basic information and parameter information of the model in detail.
When the model and parameters are configured, we start training. Click Run at the top of the operation interface to start training. In the process of training, when the module is in progress, the corresponding arrow will turn into a dotted line. The module itself will also show that it is running, so you can check the progress of the whole experiment. When running is complete, a green check mark icon is displayed on the right side of the module. When the whole experiment is run successfully, all the modules are shown with a green check mark icon.
If it fails to run, the module with the arrow will display a brack frost icon. You can check the details of the error in the Messages column on the right side of the interface for debugging. There are some other options for the module to run. Right click on the module, and you can choose to run from here, stop here, run this node, or run this small amount of data. When we are debugging the model and want to test the fact of running, if the amount of data is too large and the running time of complete task is too long, it may cause unnecessary waste of time and computing resources.
At this time, we can choose to run with a small amount of data to preliminarily verify the availability of the model. After running the experiment successfully, we enter the stage of model evaluation. We can right click Field Data in the Prediction Module to view the prediction results. The first two columns shows the comparison of the real labels and the predictive labels for the predicting set. As we can see, most of the predictions are accurate, with only two prediction being wrong. This is followed by the prediction score and some details.
Right click on the Confusion Matrix Module, and select View Evaluation to view the accuracy of the prediction using the confusion matrix. What is the confusion matrix? It refers to the number of correct classes and wrong classes in the statistical classification model. And the results are expressed in the matrix, which is called confusion matrix. And it can be used to evaluate the effect of the model. The advantages and disadvantages of the classifier. We can clearly see from the matrix, on the right, that the model we trained predicted negative one to negative one three times, positive one to positive one five times, and positive one to negative one twice. Out of 10 data predictions, the predictions were right for eight times and wrong for two times.
In addition to the confusion matrix, the Binary Classification Evaluation Module is also used to evaluate the model. Right click the View Evaluation Report in the Binary Evaluation Module to view the valuation details. In the Indexes interface, you can view AUC, KS, F1 Score, and other index value. In the Charts interface, we can view the various evaluation charts. The ROC curve is shown in the figure, and there are also KS, IFT, gain, precision recall, and other curves. We can also switch between line charts and bar charts, and save the charts locally.
In addition to Binary Classification Evaluation and Confusion Matrix, the Assessment Module provided by PAI-Studio also includes multi classification evaluation, clustering evaluation, and regression evaluation. We can choose the appropriate evaluation module according to our own needs. The above is the basic introduction, and then example of visual modeling PAI-Studio. Next, we will conduct a practical demonstration of the experiment.
Alibaba Cloud, founded in 2009, is a global leader in cloud computing and artificial intelligence, providing services to thousands of enterprises, developers, and governments organizations in more than 200 countries and regions. Committed to the success of its customers, Alibaba Cloud provides reliable and secure cloud computing and data processing capabilities as a part of its online solutions.