Notebook Style Development with Alibaba Cloud PAI-DSW


Alibaba Cloud PAI-DSW

The course is part of this learning path

Notebook Style Development with PAI-DSW

In this section, we will look at another component that PAI provides, PAI-DSW, and the approach of development based on Notebook. We will learn the basic concepts, functional characteristics, and basic operations of PAI-DSW, and demonstrate some practical experiments of how PAI-DSW can be used.


In this section, we will look at another component that PAI provides, which is PAI-DSW and the approach of development based on Notebook. We will learn the basic concepts, functional characteristics, and basic operations of PAI-DSW, and demonstrate the practical experiments for you to do with it. First, let's take a look at a basic introduction of PAI-DSW. Its full name is Data Science Workshop. It is an integrated development environment in the cloud, providing a friendly Notebook interface for machine learning tasks. First, it integrates with the Open Source JupyterLab. You may know Jupyter Notebook as an open source web-based application for interact programming. And JupyterLab is an updated version of Jupyter Notebook that makes the interface more user-friendly. With more customizable features and extensions, we can experience the abundant functions of JupyterLab in PAI-DSW. Notebook also allows you to do all the Python work from writing code to debugging, and running it without any extra operations and maintenance. In addition, PAI-DSW provides a variety of computing resources, including different configurations of CPU and GPU resources. Users can make choices according to their own computing needs. Finally, the training model obtained by PAI-DSW can be deployed as a RESTful interface for one-stop machine learning. This is the user interface of PAI-DSW. At the top of the interface is the tool bar which is used for basic operations. On the left is the Auxiliary Tool Bar, which includes the File Browser; Running Terminals and kernels, Git, Commands, Tutorial, Property Inspector, Open Tabs, Table of contents, Code Snippet Explorer, and finally, Extension Manager. In the middle is the DSW Launcher Interface. We can create new Python code files, open the terminal to run command lines, create text files, and mark down files as documentation. Use TensorBoard to view the training process and use Show Contextual to update code documentation from the active kernel in real time. On the right side of the screen is Resource Monitor, which shows whether CPU or GPU resources are currently being used, and the usage conditions of various resources. In terms of target users, PAI-DSW is mainly aimed at developers who use deep learning and SQL statements for data analysis because PAI-DSW supports various deep learning frameworks such as TensorFlow and PyTorch, and supports the writing and running of SQL statement. In terms of the expected basic knowledge, PAI-DSW requires developers to have a certain foundation of deep learning, master the basic knowledge of neural networks, and also have some knowledge of Python development. Unlike the zero code development of PAI-Studio, PAI-DSW requires a higher level of coding ability because it requires developers to write code in Notebook. In terms of application scenarios, PAI-DSW is suitable for deep learning and big data development. At the same time, it integrates PAI-EasyVision, an enhanced visual intelligence algorithm package which can help computer vision developers build visual models and apply them to production, such as target detection and OCR. It integrates PAI-EasyTransfer, the deep learning transfer framework, to help developers on natural language processing scenarios to construct transfer learning models easily and quickly, and complete tasks such as task classification. The integrated PAI-EasyASR speech intelligence enhancement algorithm package helps the developers of speech intelligence application to build speech models and apply them to applications such as speech recognition. In short, PAI-DSW provides a more convenient framework and environment for the development of big data, image, text and speech, which are the basic theos of artificial intelligence. Let's take a look at the main characteristics of PAI-DSW. First, it supports real-time resource monitoring. During algorithm development, the usage of CPU and GPU can be displayed on the right of the interface so that users can know the usage of computing resources and find out whether any resources reach the bottleneck. Second, PAI-DSW supports a variety of data sources including Maxcompute, OSS and NAS. As for MaxCompute and OSS data storage, we covered them in the previous sections. So we mainly introduce NAS storage. Its full name is Network Attached Storage which is a distributed file system that provides several benefits. These benefits include shared access, scalability, high availability and high performance. By default, the system disk provided by PAI-DSW instance is temporary storage. After stopping or delayed in the instance the system will clear the data. If you want to store the data permanently, you need to mount your own NAS. Third, PAI-DSW supports writing and running SQL statements. SQL language is a database query and programming language used to access data inquiry, update and manage database relational systems. PAI-DSWs resets SQL Editor, supports syntax highlighting, intelligent prompts and complete functions automatically. It allows you to read the MaxCompute table data directly and perform big data development tasks with SQL statements just by configuring the data source once. Fourth, PAI-DSW supports a variety of resource models including pure CPU and a variety of GPU computing cards. When you create a project, you can select the appropriate computing resources based on your computing needs and budget. And you can even adjust them when the instance is running. Fifth, PAI-DSW supports switching among different resources. After the project is created, you can also switch between CPU and GPU at any time or change different configurations of the same computing resource. When the demand for computing is large, you can replace it by computing resources with higher configuration, and vice versa when your demand is small. This can effectively reduce the cost and prevent the waste of computing resources. Finally, PAI-DSW provides built-in big data development packages and algorithm libraries. If they are not able to meet the needs of developers, it also supports custom installation of third party libraries without complex instructions, which is very convenient and fast. If it is the first time for you to use PAI-DSW, you can start with the preset cases, each of which provides complete code to run, including auto and hyper-parameter optimization, Easy version image target detection, EasyRec recommendation system and so on. There's no need of complex operations, we just need to enter the development environment of PAI-DSW. Click tutorial from the left auxilliary tool bar, download the required cases and open the model file in Notebook as shown in the right picture. By clicking on the wrong icon, you can run the Notebook code one by one to complete the experiment. To complete a deep learning modeling task, we should go through the following steps. As the first step, we create an instance in PAI-DSW. Second, since the data used in deep learning are unstructured data such as images and texts, the OSS storage service needs to be started. Third, we upload the data to the OSS storage. Fourth, we write the code in Notebook to develop the model. After successfully running the model, we can save it for subsequent deployment. Let's look at some of the basic operations of PAI-DSW including instance creation, instance management, resource monitoring and switching, third-party library management, data uploading, reading and writing OSS data, and MaxCompute data. First, we learn how to create an instance. Click DSW Notebook Service in the console, and then click Create instance. If you are creating a PAI-DSW instance for the first time, you need to purchase the service. In the instance creation window, first, we enter the name of the instance and then select the PAI-DSW version including the individual version and on-sale version. The individual version has more function than the on-sale version. Next, select the region of the instance. Is preferably close to your location. This will reduce network latency and improve connection quality. And then we choose the payment method, which in our case is pay-as-you-go, the price is decided by your using time. Next, we choose the computing resources. Depending on the actual need, we choose CPU or GPU for computing. If you're doing deep learning experiments, then you need to use GPU. We can also choose the type of computing resources such as the number of CPU cores, memory, bandwidth, GPU graphics cards, and so on. Different types of resources are corresponding to different hourly price. However, we do not need to worry about the choice of computing resources and types being inappropriate because after the project is created, we can adjust it freely according to the actual situation. At the end of creating the instance, we configure the storage, image and Virtual Private Cloud, VPC. The system disk of a PAI-DSW instance is only temporary storage. And once the instance is stopped or deleted, the stored files will be emptied. If the instance files need to be permanent stored, then you should configure an NAS temporary storage to create your own NAS file system, click add NAS file system below. Next, we select the image. You can choose to preset image of PAI or click the custom mirror button to fill in the publicly accessible Docker Registry address. Finally, if an instance needs to be mounted on a Virtual Private Cloud, VPC, you can choose to mount the VPC you have created, or you can go to the console and create your own VPC. And now we complete the instance creation. When the instance is created, it is displayed in the instance list. We can also manage the instances we created, including start, stop, and delete and save image. Click stop to stop the running instance at which point the system stops billing for the pay-as-you-go instance. When existing PAI-DSW, we need to ensure that the instance is truly stopped. Otherwise, unnecessary costs may be incurred. If the instance is stopped, you can click start to start it manually. After the instance is started, the status changes to running. And for the pay-as-you-go instance, the system starts to charge. Similarly, it is recommended to stop the instance even after you have completed the training so as not to incur additional costs. If you're no longer training, you can delete the instance. After the instance is deleted, the data cannot be recovered which requires careful operation. Now let's look at monitoring and switching resources. As we have said before, during the operation of PAI-DSW instance, computing resources can be monitored on the premise that the instance is pay-as-you-go. Different computing resources can be freely switched according to the actual usage. Click the blue button in the right side bar on the PAI-DSW operation environment to enter the resource monitoring and switching interface. We can see the real time display of CPU, GPU, memory usage and temperature, and can switch computing resources. PAI-DSW provides a variety of CPU and GPU types for choice and different resource models are prized at different prices. After successfully switching resources, the previous resources will no longer be charged, and the system will charge according to the pricing of new resources. At the same time, the previous running results of the code will be invalid and need to be run again. Therefore, it's recommended not to switch your sources frequently unless the actual application requires you to do so. Next, we introduced the management of third party libraries. If we use Python development environment, we can install VU and un-install third party libraries in terminal. And the command is very simple. The first is the installation operation, which simply requires replacing your library name in the command with the name of third party library to be installed. Use the pip list command to view a list of all installed third-party libraries. When installing third party libraries, again, you only need to replace your library name with the name of the library you want to uninstall. It is important to note that you can only uninstall third party libraries that you have installed. The development environments provided by PAI-DSW include Python 2, Python 3, PyTorch and TensorFlow 2.0. When installing third party libraries, they are installed in Python 3 by default. If you need to install them in another environment, you must manually switch the environment before installing them. Now let's learn how to upload files in PAI-DSW. In the Jupyter lab Notebook programming environment, you can upload files locally by clicking the folder icon in the left auxiliary tool bar, and then clicking the upload icon in the shortcut toolbar. It also supports break point continuation. If the transmission is interrupted due to a failure, you can start from where it is stopped. Now let's introduce how to read invite OSS files in PAI-DSW in the file storage function of PAI-DSW. In addition to uploading files directly from LOCO, talking OSS storage is also supported. PAI-DSW has pre-installed the OSS 2 Python package, and we can directly use the Python API to easily read and write OSS data. First, we need to authenticate and initialize using the access key ID and secret we obtained before, and we need to set up the OSS bucket and store the files in it. We just need to replace the corresponding parts of the sample code with our access key ID, secret OSS path and bucket name. The following is the reading and writing of OSS data. The sample code shows how to read the entire file, how to read a certain range of data, how to write data to OSS, and how to append data. We also just need to replace the yellow part of the code with our own file path and accountant that needs to be upended. We've shown you how to read and write files in a normal Python program. The PAI-DSW also provides a data read and write API under the TensorFlow and PyTorch . In addition, the training logs and models can also be stored in the OSS. Please refer to PAI-DSW Help Documentation for details. PAI-DSW supports reading and writing to OSS files, as well as reading and writing to MaxCompute table data. We can use PyODPS to commute with data in MaxCompute of high studio. ODPS is a data processing service independently developed by Ali Cloud which is mainly used for storage and calculation of batch structured data. PyODPS is a Python SDK provided by Ali Cloud which provides basic operations on ODPS objects. First, let's install PyODPS using the commands for step one. Next, let's run the code from step two to read the data from MaxCompute. The yellow parts are modified to your own requirements including the access key ID and secret, the project name of MaxCompute to table name and the end point of your region. Finally, we will learn how to export the Notebook file. When developing based on Notebook, you might want to export Notebook file to view or share locally. PAI-DSW supports the export of Notebook files into various forms, such as HTML, Latex, PDF, and then file. The file export's operation is very simple. We can export the file locally by selecting export Notebook as from the file toolbar and selecting the form of file being exported from the following list. Above is the introduction of some basic operation of PAI-DSW.

About the Author
Learning Paths

Alibaba Cloud, founded in 2009, is a global leader in cloud computing and artificial intelligence, providing services to thousands of enterprises, developers, and governments organizations in more than 200 countries and regions. Committed to the success of its customers, Alibaba Cloud provides reliable and secure cloud computing and data processing capabilities as a part of its online solutions.