Author: 47Line Technology
Jun, 18th 2014

Cloud Computing is revolutionizing every aspect of IT landscape, and data processing is no different. Whether it is image processing, video encoding or other batch processing jobs (periodic report generation, healthcare data processing etc.), Amazon Web Services provides the ability to tap into a scalable & on-demand cloud infrastructure.

 

I am Andrea. I am the Head of Content Strategy at cloudacademy.com. I am so pleased to be your host here today, together with Paddy, who is Co-founder and Head of Strategy of 47Line Technology.

Welcome Paddy. - [Paddy] Thanks Andrea. - With me there is also Jithendra, Senior Product Head for 47Line Technology as well, and welcome to you too, Jithendra.

Okay. Today with our friends, they will be the main panelists in this webinar, where we will see how to process large data sets on AWS using this software Batchly.

Batchly is a great software. It is been crafted by the excellent team of 47Line Technology. It is great to have it on your side while performing the complex computing tasks like working of large datasets. It has a lot of features to schedule, manage, run large processing jobs on AWS, especially because it allows you to abstract all the complexities around using AWS itself. Large data processing is not an easy task, and needs some excellent efficiency, excellent performance to be done in an effective way. So you really need great tools like Batchly in your equipment.

During this webinar we will see how Batchly is fit for this job. Why should they take advantage of it. If you are about some serious work and we would see it in action. Anyway, I don't want to steal a minute of time to Paddy and Jithendra. So without further ado I'll give the floor to our friends from 47Line to let them go deeper in details and I'll ask everything about Batchly. If there are questions, feel free to write me through the chat. We will have a question-and-answer session towards the end. So Paddy, Jithendra, its over to you guys.

Thanks Andrea. It is a pleasure to be here and thanks for hosting us at this webinar. - Yeah. It's a pleasure to have you here.

So we are here to talk about processing large data sets on AWS using of course our product Batchly. How we are actually tapping into AWS is what we're going talk in this webinar. It's going to be a short webinar and we only seek about 30 minutes of the attention span of the audience in this webinar. Moving ahead, it is going to be Paddy, that is me and Jithendra from 47Line. I'll talk a little bit about large data sets and the current model, why it would not work and how AWS is a great platform and moving ahead Jittu will talk about Batchly. So what you see on this screen is some numbers there. The size of the digital universe as of last year is about 4.4 zettabytes.

So this is 4.4 trillion GBs of data that has been generated last year and one third of this data is generated by enterprises and it is expected that this data is going to grow exponentially over the next few years, 10 times actually to 44 zettabytes by 2020. So that's a lot of data that is getting generated and these are log data, financial data, medical data, images, videos and a whole bunch of other data that we are actually creating day in and day out. Enterprises have to process this data for their business needs both from operational and strategic perspective.

Everything comes in large amounts. How do enterprises handle this large data? How do they store it? How do they re-edit it? How do they process it and how do they handle failure or how do they build fault tolerance? So these are significant questions that enterprises have when it comes to processing large data and as you can see on this slide, you don't use the traditional system to process large data.

You will have to think differently. You will have to bring in scale and you will have to bring in parallelism to handle this last amount of data efficiently and to give you strategic benefit both from cost and time perspective. That is where cloud is a great platform.

Your traditional model is not going to work for processing large data and cloud is a game changer, cloud is the way for processing large data sets. We will stick to Amazon Web Services here specifically when we're talking about cloud. So you can see the attributes of a public cloud service like AWS, service oriented, scalable, shared, it's needed by use, it's outcome-focused and it is over the Internet. Basically there is no CapEx, no upfront investment unless you stick to a reserved instance, which is a different ballgame. So there is no CapEx when it comes to AWS, low ongoing costs and there are a lot of benefits in terms of flexibility and agility that it provides.

The icing on the cake with Amazon Web Services Platform when it comes to processing large data sets is spot instance. Spot instance is called unused capacity available on the spot market within AWS. So these are instances that are available at deep discounts, 10 times cheaper than a regular On-Demand instance and this is a great solution or a great way your process your data but there are trade-offs with spot instances. If somebody outbids your spot price, your instance is going to go away. Also, spot instances are not always available.

You might encounter a situation where there are no spots, spare capacity available in a particular region, so there is no spot instance that is available at a given time. So you will have to handle such failures or such unavailability issues within your actually accessing spot instances. So just to set the context here we are talking about large data that is being generated in different formats, in different forms and it is very difficult to process this large data in traditional model and cloud is a great way to process this data. But cloud Amazon Web Services is very complex for an enterprise to come and adopt it easily.

So as you can see on this screen, this is just a snapshot of different other instance types that are available on AWS.

The last I noticed there were about 34 instance types across six different instance families in AWS and that is, not even considering EBS-Optimized, all the different availability zones, different regions. To all this complexity you add spot instances in terms of the bidding strategy, in terms of handling failures.

This is a lot of complexity that is being brought in. Though AWS is a great platform to handle large data assets, there is a lot of complexities for enterprise to adopt or tap into AWS and that is where Batchly comes in.

Batchly essentially is a batch processing solution in the cloud. It offers a layer of abstraction around AWS and lets us tap into the power of AWS without really getting into the details of managing infrastructure, handling spot instances, handling failure. So with that introduction, I will hand it over to Jithendra to talk more about the product Batchly and also show a video in terms of how it works in specific use cases. Over to you, Jithendra.

[Jithendra] Thank you Paddy. As my colleague Paddy had mentioned, we love AWS. We had been working with AWS for the last four years. That's been a phenomenal success for us in terms of what capabilities are available in AWS and how it can be utilized. But what we have noted is that of late, we find that people are having difficulties getting on board onto AWS and that could simply be because of the success of AWS. So if AWS wanted to offer a lot of services, to a lot diverse set of users. But what we have today is a lot of features available and people getting confused over what is the best approach to take.

What we have done is, we have tried to take all the good things out of AWS and we tried to create a platform that makes it things easier for anyone to process a large number of prices and a large amount of free data. I'll tell you exactly what Batchly is. Batchly is a cloud-based parallel processing system.

It's essentially a container to which you can put in your code. It adapts itself to the existing infrastructure available on AWS. It identifies what is the right way to process and it offers a simple end point in terms of identifying how the processing should be done.

What we have noted is that there's a lot of cost, there is a lot of features available on AWS, but there is no real SLA terms that makes it human-like. What we have done is we have brought in a concept within Batchly that makes it easier for you to set up an SLA. So you want to process, you have a large amount of a data and you want to process it. So there are only two factors that we consider. How quickly do you want to process and at what rate do you want to process it? You can come over, you can specify, we can take a simple used case as a processing image.

So you have S3 storage system in which you have kept say a million images and you can come over and specify that you want to process all that images within an hour.

Batchly takes care of abstracting all the complexities that are available and it essentially ensures that the right stage of instance is taken and the right number of instances are taken to ensure that your SLA is met. So that is one way for Batchly to process. There is another SLA that we bring in and that essentially is based on cost. You can come over and see, I have a million images, I don't care how long does it take but I want it to be processed within $20, or $25. You set a budget for your processing.

So what Batchly does is, it goes into the spot market. The spot market is essentially additional capacity that AWS always maintains to ensure that whenever you request for an instance On-Demand that instance could be provisioned. On the site what AWS does is, that these instances are essentially available, for everyone to bid and take. So these come at very steep discounts and you get somewhere between 70 to 90 percent discount, even in some cases. The only catch is that if there is a huge request for On-Demand instances, the one that you have bid and taken could be taken away from you.

This plays along nicely with what Batchly does, where it's essentially a container that can process things in parallel and with the fault tolerance and error tracking kind of features, we can essentially reprocess a items if they're not processed or if it is taken away midway then we can essentially re-play for the same logic and ensure that things are done. So this abstraction helps us in actually lowering the cost of processing where if we get enough spot instances we can go ahead and we promise you that we can give you somewhere between like 60 to 80 percent cost reduction. So Batchly architecture is quite simple.

So it is somewhere like an ETL kind of a model or an EMR kind of a model but there is a central control line that takes care of processing everything that is brokered [SP], that essentially ensures that the data is massaged and offered to agents in the form that's acceptable to it. Agents are essentially your cases of course that are running parallel across multiple instances and we use the entire feature set of AWS to ensure that there is tracking, there is logging and the all the good things that come up with the AWS system is made available.

The entire application is accessible and visible to us through little portal wherein you can come over, you can configure and you can monitor everything from a portal. You don't really have to launch any instance or you have to look to what is happening. All the data is readily available through our portal. You can monitor, you can take reports, you can run, stop. You can do everything that's out there without the complications of actually putting your hands on any kind of infrastructure. So what Batchly does for you is it respects your SLA, irrespective of what is the size of your operation.

So today you could have at your office space have a million images that you want to process it within an hour and tomorrow you could have a billion images and you can still say that you want to process it within an hour. So irrespective of what is the size of the data, whatever is your SLA, Batchly would inspect it and ensure that your work is done within your [inaudible 0:15:19] call time and cost budget. Depending on whatever is the need you scale out or scale in we take care essentially, keeping that elasticity of AWS within your application and ensure that things are done at the pace that is required. In addition, we monitor your code execution, so you write your code within our container we take care of, retrain once a year, tracking all the errors, login it and essentially operating through our portal where in case something goes wrong, you get to know what really happened and you can take corrective action. We take care of automating all of the infrastructure. We are sure that if an instance goes down there is a new one that comes up. We ensure that an adequate pool of instances are always maintained to ensure that the work that needs to be done is done on time and we take care of cleaning up everything to do ensure that the entire cost is kept low.

In addition to it, we save cost every time, whether you select a time-based budget or a cost-based budget we always ensure that all used instances that you have taken up are of the right size, they're utilized to the fullest and we always bid for spot instances to ensure that there is always cost savings that's available by using Batchly. A simple example of a time or a cost bound SLA is that when you say that you want your work done in 10 hours, we ensure that there are only two instances running and we monitor, we ensure that within 10 hours your work is done. Then you want to speed things and you want to do it two hours we ensure that sufficient additional instances are added into the pool to ensure that the work is done in two hours.

So we always keep track of what is happening and we always monitor your instances and we ensure that whenever something goes wrong we take corrective actions immediately. On a cost-bound model so you can set a budget of $500, we make out three On-Demand instances and one spot instance running to ensure that the work is done within $500. You can come over and reduce it to say $250. We will go ahead, kill the on-demand instances, launch lot more spot instances at the adequate price, monitor the price to ensure that the budget is always adhered to and ensure that you're work is done. These are ways, your SLA is always respected and we always give you feedback in terms of what is happening and the whether that your work is done at what pace. All that information is readily available on the portal. So how does one get started with Batchly? Have I mentioned Batchly is essentially a container. So we have single interphase against , which you can write code and this does a single unit of work. So if you've processing say a million images we don't really want you to process million images, you just say how does a single image gets processed we'll take that, we package it into our container and we launch it and ensure that all of your images are processed. So the rest of this stuff is simply on the portal. You come over, you configure what is your location, where have you kept your files or data, your defining what is your SLA and you can monitor online or when your work is done we notify you that your work is done, you can always come over and go around. There are a lot of interesting used cases that are available when it comes to using spot instances. So it would be in terms of analytics, both on the financial world or on the technical world that there is a lot of log files that are generated, there is a lot of good data that you can pick out from it and in all, a lot of customers who don't want to spend too much money on it so they prefer to use spot instances to do those kind of analytics where the cost is low but the output is pretty high. A lot of image can be then coding kind of work there, you are allowed to skip a file or reprocess a file because it's a lot more cheaper to do it that way than invest money into infrastructure and keep it running at all time. A lot of data is generated as far as your geospatial analysis or scientific computing. These are good use cases for you to try on Batchly. Again, internet is a growing world and there is a lot of information that is available on the internet. So web crawling is one area where you bore in, you find out what is available. There is a lot of interesting data that comes up whenever you do a lot of this web crawling interviews. So these are all ideal candidates for Batchly and we have a small demo. So it just showcases how things run and what is out there and all that. It's a small video and I'll skip over certain portions to make sure that we get this demo done in the shortest time.

This is a small use case that we have taken. It's about 5.5 million images which are available in one of our SD buckets.

You come over you configure. Essentially there you setup a template, what we call as a job. So you specify what is your input source and how do you want to process it and all that. Everytime you run it you can keep changing say, I want my processing to be done in USBs because that is where my data is. We can set up your SLA, you keep changing your SLA, so the first time you could have say, about 5.5 million images. The second time you could have 10 million images but you do want to play around with the SLA because you would have time constraints and all that. Batchly takes care of all of your AWS cost so it goes there, creates your security group, it takes care of ensuring that the entire processing is done within your account and within a completely lockdown environment. All this is happening within your account, so even though Batchly is run as a service it is independent of your costing and your data.

So all of your data is within your account and all of the instances are running within your account. So you could have use cases there, you could have a database that's within your data center.

You could have a VPC connection back into AWS and your instances could be running within the VPC on AWS while your data can still reside in your database. These are kind of used cases where you can use cloud in a very innovative kind of model. So depending on the data processing Batchly automatically increases the number of instances that is required. It bids on your behalf so as you can see, it keeps looking through pricing history to understand what is the current price and what are the historic prices that are out there and at what cost if you bid you would get instance at a cheaper price and all that. So at the end of the job you get a summary, it gives you data in terms of what is the total cost it took for processing? What is the amount of savings that you get? How long did it take to process? At what rate did it process? All this information is available and depending on this, you can always keep track of what is happening within your instances and you can ensure that you know exactly what is happening out there. With this I am going to hand over it to Paddy.

Let Paddy take through this and what we do. - Thanks, Jithendra. I am just going to wrap up talking a little bit about 47Line.

47Line has a team within the cloud computing space for the last six years. It is one of the earliest adopters of Amazon Web Services globally. We have as a team more than 600 months of cloud experience and proven thought leadership in cloud-based architecture and design.

Batchly is one of the solutions that we've built. We have one more solution which is called priority engine. I hope we get an opportunity to talk about that in due course, maybe through Cloud Academy. So that is what we had. I hope this webinar gave a perspective on processing large data sets using Amazon and how using Batchly abstracts the complexity of using Amazon while also tapping the benefits of AWS. So this is all we had. We are prepared to take some questions.

That's great. This has been great for me also. Thank you, Paddy and thank you Jithendra. I especially liked the use cases part. I found it very interesting because speaking about large amount of the data sets processing makes you think that they are some exotic things that never happens in your work, but I've seen things like video coding analytics, which are things that are quite common in many companies. So it is expectedly not such a specialist thing like one might expect, and also I appreciate that very much the quotation by Grace Hopper, I think it got the bare essence of cloud computing much before the cloud [inaudible 0:24:42] Anyway, I see a couple of questions around before we run out of time. Just choose which one of you wants to answer to those questions. The first question is does Batchly works within customer success in AWS accounts? It works actually only within the customer AWS account. So even though Batchly [inaudible 0:25:14] as part of on-coding.

Either give us your credentials so that we can create a restricted user within your account, so that we can make AWS cloud on your behalf or we actually share the actual things that we require access too and so that you create an account and share the credentials with us. So what Batchly does is, Batchly essentially stores some data around the job template and all that but all of the information that needs to be maintained is always within your AWS account. So your data resides within your AWS account. Your processing happens within your AWS accounts. Most of the metrics that we try to capture are also within your AWS accounts. In addition, if there are in failure in logs and all that information is also available there. So Batchly as a system does not maintain any customer data. All it maintains is your user ID and access ports. Apart from that we don't maintain anything we've been asked.

That's great. It looks like we have some time for one more question. This question is, you mentioned about huge cost savings in spot instances. How does Amazon manage to do that? Jittu, I will take it. This concerns by Amazon Web Services but as I said initially when I spoke about spot instances this is excess capacity, spare capacity that Amazon has and gives it via the spot market. These are basically and it's not always going to be available, so this is only when the demand fall or On-Demand instances are low. Those instances are moved through the spot market and when there are situations where there is a requirement from users for lot of On-Demand instances, spot instances availability is going to be reduced. So these are just excess capacity and Amazon has found a brilliant value offering to the users at a reduced price and so it's really well-suited for in terms of dollar and operations.

Okay. That's great. - Actually, I'd like to add to what Paddy had mentioned. Amazon is not actually losing money by giving it out in the spot market. Amazon has actually made money. So if you notice cloud is all about capacity made available at all times for people to take. So Amazon has no other choice but to actually maintain excess capacity so that they can make that promise and ensure that promise is always adhered to. So instead of just leaving that capacity idle, I think Jeff Bezos is a pretty smart to say that why not do they offer it to users and take it back because they know for a time they think it's going to be taken back and they're going to [inaudible 0:28:18]. So it's a way for Amazon to make money using excess capacity compared to all the other cloud vendors who just have it but do not make anything out of it.

All right. So we've definitely have run out of time. It has been a great half-an-hour with you and it has been a very interesting webinar. I am amazed by the things that you are doing at 47Line. So keep up with the good work that you are doing and I am looking forward for our next webinar. Jithendra and Paddy, thank you very much. Thank you to everyone who turned around for this session.

Great. Bye-bye. Bye everybody.