In this section, we will introduce the content related to sentiment analysis, introducing the basic concepts, application scenarios, and related methods of sentiment analysis as well as conducting a practical experiment of sentiment analysis task based on PAI studio.
Now, let's introduce a specific example. The sentiment analysis of IMDB movie reviews. IMDB is a large online database of movies, TV shows, and games that allows users to rate and review. The chart on the left shows the list of 2050 highest rated movies, and the chart on the right shows a number of ratings and reviews for a particular movie. User reviews generally reflect the quality of a movie, we can see from the picture that this is one of the best movies and a movie that everyone should see. Both user's comments on the movie are positive reflecting that this is a movie worth seeing.
In addition to looking at the ratings, we can also identify the best movies by analyzing the emotional tendencies of user reviews. Therefore, it's a valuable work to analyze users' review on movies, positive and negative reviews have different guiding effects on the audience tendency. We want to know which kind of review is more important for a movie. However, with the growth of internet information, the amount of review data is getting larger and larger. Manual statistics is difficult to deal with a large number of reviews and the efficiency is very low. This requires an automatic sentiment analysis method and the machine learning and deep learning can well complete this task. The following is a simple experiment to conduct the sentiment classification of IMDB movie reviews. We need to divide reviews into positive and negative ones.
First, let's introduce the IMDB large movie review dataset, the dataset of this experiment. The dataset created by Stanford University in 2011 consisted of 50,000 reviews with a clear emotional bias and both training set and testing set have 25,000 reviews each for the text sentiment binary classification task. The picture shows the file structure of the dataset with each review stored in a TXT file. In our experiment, we use a small portion of the total data set, 2000 data from the training and testing sets, 1000 positive and 1000 negative reviews each.
Before uploading the data, we clean it to remove punctuation marks and special characters in the review. The views of the data are shown in the left table where label is the comment label. One represents a positive review and zero represents a negative review and their data type is int. Review is the concrete content of the comment, and the data type is string. The right table is a partial display of the dataset with each row containing a review and their corresponding label. This is the overall process of the experiment. We entered the original review data, and first we remove meaningless stopwords to avoid their impact on the judgment of emotional tendency.
Next we use Doc2Vec to vectorize the text, mapping each review to a multidimensional vector. Then we can train a binary classification model to distinguish between reviews with different emotional tendencies, and finally, we can evaluate the performance of the model. We do this experiment with the help of PAI Studio, first, enter the working interface of PAI Studio, click new experiment in the left bar experiment enter the name, introduction and storage path of the experiment in the pop-up window and click okay to create the experiment.
Now let's upload our data set to PAI, here we can create structured table data, go to data source in the left sidebar and click create table, enter the name of the table and its lifecycle, which is the maximum time that the table can be stored. At the bottom of the window, we're going to set up the columns. Our data has only two columns and into your label column and the string content column, we then upload the raw data from the local TXT or CSV file, select the row and column delimiters. And if the data is as shown in the picture, we have successfully upload the data.
After the data is successfully uploaded, a visual component is generated for drag and drop use. In addition to uploading the dataset, we also need to upload the list of stopwords in the language, stopwords are mainly functional words that have no actual meaning such as the, is, at, which, etc. Eliminating stopwords in information retrieval can save storage space and improve search efficiency. The same is true in machine learning. We create a table in the same way, nothing that there's only one column in this table, once created, a visual component is also generated, which we will use later.
Next, we enter the core part of the experiment, model building. drag and drop a property components from the components list to connect them into a model as showing the figure, which can also be created from the template. We will introduce the model in five parts. The first two parts are the raw data of the reviews and the list of stopwords we should have just introduced. The third part is text vectorization. The film review text without stopwords is vectorized through a Doc2Vec module. And each text is transformed into a multidimensional semantic vector, which serves as the representation of the text and the input of the machine learning model.
In the fourth part, we divide the input text into training set and testing set, according to a certain proportion, and the proportion can be set by yourself. What we use for sentiment classification, is a simple binary classification model, logistic regression, which can distinguish positive samples from negative samples. The fifth part has the function of prediction and evaluation. After the model training is completed, the prediction is made on the divided testing set and the classification performance of the model is then evaluated through the confusion matrix components and the binary classification evaluation component.
Now let's look at the data after the Doc2Vec component, each row represents a vector and each column represents each component of a vector. In this way, each review is mapped to a vector in the multidimensional vector space as an input to the machine learning model. Next, we will explain the parameter setting of logistic regression classification module. Let's first introduce the basic idea of logistic regression model. Its input is a series of vectors, and then we input the weighted sum of each element of the vectors to the sigmoid activation function.
Finally, a value between zero and one is obtained, which is used to indicate the probability that the sample is a positive example or an negative example. The training process of logistic regression is to minimize the loss function by creating a loss function based on the principle of maximum likelihood estimation, and updating the weight of the model with gradient decent method. In PAI studio, the logistic regression algorithm is already encapsulated and we just need to adjust the parameters.
The first parameter is the regularization type. In machine learning, the regular term is a penalty term that is added to the loss function. L1 regularization refers to the sum of the absolute values of each element in the weight vector and L2 regularization refers to the square root of the sum of squares of each element in the weight vector. Using the regular term can prevent the model from overfeeding to a certain extent. In this module, we can choose to use L1, L2 regularization or none.
The second parameter is the maximum iterations, which is the maximum number of rounds of model training. The default value is 100. The ultimate goal of model iteration is to achieve convergence, to maximize the accuracy and minimize the loss function, therefore it's related to the conversion speed and expected go of the model and can be adjusted by yourself.
The third parameter, the regularization coefficient is the coefficient of the regular term, if there is a regular term in the loss function and is set to none, if there's no regular term. the last parameter is the minimum convergence deviance, that is when the loss function drops below this value, the model training ends and the default value can be used or modified by yourself. After model training is completed, we can right click the prediction component to view the prediction results of the model.
As shown in the table, labeled as the actual label for the text and prediction result is the label predicted by the model. Predicting score is the probability of predicting the label, and if the probability of the sample being a positive example is greater than 0.5, it is determined to be a positive sample, otherwise is a negative sample.
Finally, we evaluate the performance of the model. The main diagonal of the confusion matrix represents the correctly predicted samples. And from the confusion matrix, we can intuitively see that most of the samples were correctly predicted. The following table shows the specific prediction data for both positive and negative examples, the accuracy and f1 value of the model have reached more than 70%, and the effect is relatively satisfactory. The above is the introduction and an example of the task of text sentiment analysis.
So now let's move on to the practicing demonstration part. This is a practical demonstration for movie reviews, sentiment classification by using PAI console. You need to open the website of PAI console and then click model training and select studio modeling visualization. The right side is your project list. At the first time, you should create a project, if you don't have any project. When you already created a project, you could click machine learning, which your project matches.
And now you come in the new website, which name is machine learning platform for AI. Similar to the previous experiment, you need to create a new project firstly. In this experiment, you need to clone a preset template by PAI and modify some information in it. You should click on the icon of home to the left most column, then find a template which name is public opinion risk control based on takeaway reviews, click create button and type your experiment's name and description in the popup dialog. Choose where your projects save to in your own folder, when everything is okay, click the okay button.
The next step, what you need to do is upload movie reviews, datasets, and stopwords table. You should go to the graphical interface of data by clicking on the data source icon, then click the create table. In this interface, you should type the name of your table name and lifecycle. The first, what you need to upload is movie reviews datasets. Look at this page, these datasets consists of two columns. The first column represents the labels of the samples and the rest of the content are movie reviews. Obviously, one in label column means it is a positive sample and zero means it is a negative sample in this data set.
Back to PAI, the following part will show how to fill in schema. The first column is the popularity in this dataset in order to convenient the next part, you better named this column label. Another column is the sentences of film review. So the type of this column is string. When clicked, next button, you should modify the row delimiter and column delimiter. The row delimiter is /N and the column delimiter is comma. After modification, you should upload the dataset by the bottom of select file, and you can see the content of dataset which you uploaded, in the end click Okay button. And if there's no error message popping up, this dataset is uploaded successfully. And next you need to upload the stopword table, click the create table button type your table name and lifecycle. And these data have some characters which are arranged in columns. So the type of the only column in this data set is string. Do other things well and find and upload the correct file.
After uploading datasets, you need to modify some attributes in this experiment. You should click the top left mode and type the name of movie reviews dataset, which you already uploaded. And the dataset in the top right note should be modified with the name of the stopwords dataset, when all is done, click run button. This experiment usually lasts six minutes and you can click all the information in those notes by clicking with the right and selecting view data. The final trained model is in the model column. The practical demonstration for movie reviews sentiments classification by using PAI console is finished.
Alibaba Cloud, founded in 2009, is a global leader in cloud computing and artificial intelligence, providing services to thousands of enterprises, developers, and governments organizations in more than 200 countries and regions. Committed to the success of its customers, Alibaba Cloud provides reliable and secure cloud computing and data processing capabilities as a part of its online solutions.