Skip to main content

BigML: Machine Learning made easy


BigML offers a managed platform to build and share your datasets and models

Machine Learning as a Service (MLaaS) has become a real thing in the Cloud market and BigML‘s mission is simple and clear: making Machine Learning easy, beautiful and understandable for everybody.
I would say that BigML offers something much closer to Software as a Service (SaaS) than its IaaS and PaaS competitors. I recently played with AmazonML, AzureML and Google Prediction API, all of which are part of rich ecosystems of web services, from Cloud storage and CDNs, to VPCs, deployment automation and much more.
On the other hand BigML, while somehow remaining platform-agnostic, has successfully managed to exploit existing Cloud solutions to its advantage. Consider, for example, how it allows data imports from AWS S3, MS Azure, Google Storage, Google Drive, Dropbox, etc. This detail seems trivial at first, but might be a game changer in the long-term: once public Cloud infrastructures become a commodity, a cross-provider solution will be the best option.

BigML Features

Being focused “only” on Machine Learning, BigML offers a wider set of features, all well integrated within a usable Web UI. As you would expect, you can load your dataset, train and evaluate your models, and generate new predictions (either one by one or in a batch).
Here is a list of additional useful features I haven’t see elsewhere:

  • Plenty of ways to load your raw data, including most Cloud storage systems, public URLs or your own CSV/ARFF files.
  • A vast gallery of free datasets and models to play with, well organized into categories and publicly accessible.
  • Clustering algorithms and visualization: data analysis and visualization tools are essential to come up with a high-quality model.
  • Anomaly detection: dealing with outliers can be a pain, and detecting pattern anomalies can save you time and money, even before hitting your model.
  • Flexible pricing: you can choose between subscription plans (starting from $15/mo for students), pay as you go with BigML credits, or even buy your personal VPC.

How to import your data

Depending on your use scenario, you may want to import your data from an existing Cloud storage system, provide a public URL, or directly upload a CSV file. In Development mode you can even create an inline source on the fly.
BigML - Import your data
Even at this step, BigML offers a nice set of features:

  • CSV parsing configuration.
  • Fields type selection.
  • Strings locale selection (English, Dutch, Spanish, or Portuguese).
  • Headers parsing (CSV with or without a header row).
  • Date-time fields expansion.
  • Text analysis (language detection, tokenization, stop words, stemming).

Interestingly, you can update your Source configuration at any time, without any additional upload.
As soon as your Source is ready and correctly parsed, you can use it to generate a new Dataset. Alternatively, you can import ready-to-use data from their public datasets gallery.

Datasets are fully reusable, expandable and exportable

BigML Datasets are very easy to reuse, edit, expand and export. Indeed, you can easily rename and add descriptions to each one of your fields, add new ones (through normalization, discretization, mathematical operations, missing values replacement, etc), and generate subsets based on sampling or custom filters.
Furthermore – even before training your model – you are given values distribution and statistics for each field, and also a great Dynamic Scatterplot tool to visualize your data, two dimensions at a time. Here you may want to explore your features space, look for patterns, export chars or simply have fun.
BigML - Dataset Analysis
In practice, Datasets are your starting point for most operations. Let’s assume that our goal is training and evaluating a classification model. We will first need to split our dataset into smaller training and test sets: you can achieve this with the Training and Test set Split operation. Of course you are free to select how to allocate your records: 80/20 is the default split logic. This process will actually create two new independent datasets, which you can analyze and manipulate as you want.
As soon as the split is completed, you’ll want to select your new training set and launch the Configure Model operation.

BigML decision trees

A Machine Learning model could potentially be anything that’s able to analyze your raw data (eventually labelled) and that can somehow learn how to deal with new, unseen data.
Decision Trees are probably the most intuitive type of model you can build. They are easy to visualize and understand, and easier to store, since they can be almost directly converted into procedural code. Even if you are not a programmer, you can think of it as a nested structure of binary decisions.
That’s exactly what every BigML model is. When you train a model – starting from a training Dataset – you are asked to select your objective field (i.e. your target column). In order to reduce the effect of data overfitting and the size of your model, you will normally need some sort of statistical pruning, although you can decide to disable it.
Optionally, you can configure a few more options.

  • Missing split: whether or not to include missing values when choosing a split (disabled by default).
  • Node threshold: maximum number of nodes (512 by default).
  • Weights: you can choose to weight your records specifying one weighting field, or assign relative weights to your classes.
  • Sampling and Ordering: you can choose a custom sub-sample and shuffling logic.

BigML - Model Summary Report
At the end of the training process you will be able to visualize your model and obtain an informative report to better understand which of your fields are more relevant (see the screenshot above). Moreover, your model is graphically represented as an actual tree (below on the left) or as a sunburst (below on the right).
BigML - Model visualization
At this point you can already start generating new predictions, but of course we want to evaluate our model’s accuracy first.

Ensembles can improve your prediction accuracy

Ensembles, involving multiple alternative models which will provide better predictive performance, are a well documented way to improve your single-model systems accuracy. Each model can be trained using a subset of your data, or focused on specific classes, so that they will collaborate in generating a better prediction.
In BigML you can easily train Decision Forests with the Configure Ensemble dataset operation: you are simply asked how many models will be trained.
This approach drastically corrects the decision trees’ habit of overfitting your training data – and therefore improves your overall accuracy. In my case, I managed to improve my accuracy of 3% using an ensemble of 10 models, which might make sense if you can afford the additional time.
I also generated an ensemble of 100 models but, even though it increased the accuracy of an additional 1%,  it was clearly not a great idea, both in terms of cost and speed.
BigML - Ensemble

How to evaluate your results

Being able to quickly evaluate your models and compare multiple evaluations are critical features for a Machine Learning as a Service product, and I personally believe BigML has done a great job.
Typically, you want to test your models against a smaller part of your dataset. We previously created a 20% test set and I used it to generate an evaluation for both my model and my ensemble. You can either launch the Evaluate operation on your model, or the Evaluate a Model operation on your dataset. Not much configuration is needed, unless you have specific sampling or ordering needs.
Each Evaluation is an object itself and will be listed in your Evaluations list. Of course, based on your model type (regression or classification) you’ll be shown different kinds of metrics. If your model is a classifier you’ll be shown a Confusion Matrix, including statistics like Accuracy, Precision, Recall, F-Measure and Phi.
In case your Confusion Matrix is too big to be rendered on a web page (let’s say you have 6 possible classes as I did), you will be able to download it in Excel format. It will look similar to the figure below, where the top legend shows you what each single cell means with respect to the first element of the main diagonal (you can do the same for each cell on the main diagonal).
BigML - Confusion Matrix
But how about my ensemble? Does it perform better?
Apparently, it does. It achieved +2.82% of overall accuracy, and as high as 5.34% for some classes. You can compare two evaluations with the Compare Evaluation operation and be shown the same statistics of a single evaluation, plus a relative percentage for each metric.
BigML - Evaluations ComparisonMy model alone was pretty effective and I probably wouldn’t choose to pay the additional cost of an ensemble – both in terms of price and speed – although in many cases overfitting will kill your predictive power and ensembles can drastically boost your accuracy.

Generating new Predictions with API bindings and BigMLer

I would say that BigML is both user-friendly and developer-friendly. They took the time to code plenty of API clients, and even a command-line tool called BigMLer.
Of course you can perform every operation mentioned above via API, but I believe that offline phases are better handled with a clear and solid UI, especially during the model and dataset definition.
I chose the Python binding and coded a simple script to generate new predictions.

from bigml.api import BigML
from bigml.model import Model
from bigml.ensemble import Ensemble
USE_ENSEMBLE = False
labels = {
    '1': 'walking', '2': 'walking upstairs',
    '3': 'walking downstairs', '4': 'sitting',
    '5': 'standing', '6': 'laying'
}
def main():
    api = BigML("alex-1", "YOUR_API_KEY", storage="./cache")
    if USE_ENSEMBLE:
        predictor = Ensemble('ensemble/5557c358200d5a7b4300001e', api=api)
    else:
        predictor = Model('model/5557ac99200d5a7b42000001', api=api)
    #generate new prediction
    #Note: params might differ btw Ensemble and Model, this is an example
    prediction = predictor.predict(get_input_data(), with_confidence=True)
    label = prediction[0]
    confidence = prediction[1]
    print("You are currently %s (class %s, %s%%)." % (labels[label], label, confidence) )
def get_input_data():
    """ Retrieve input data from local CSV file """
    with open('record.csv') as f:
        record_str = f.readline()
    #generate 'fieldN' dict (I had 562 un-named columns!)
    record = {}
    for i,val in enumerate(record_str.split(',')):
        record['field%s' % (i+2)] = val
    return record
if __name__ == '__main__':
    main()

As far as performance is concerned, calls to my model took between 1.5 and 2 seconds, until I enabled the local storage option: this will store all your models parameters locally and avoid blocking API calls for every future prediction. With a warmed up local cache, my script execution time dropped down to 150ms. My 10 models ensemble took about 20 seconds to load at first, while every next call took only 1 second at most: it’s actually about 10 times slower than a single model – as 10 predictions are performed – but I think 1% of additional prediction confidence is worth the time. Please also keep in mind that my model worked on more than 560 input features and 6 possible output classes, therefore I’m sure that an average model would run much faster than mine!
Alternatively, you can convert your model to procedural code in fifteen different languages/formats by clicking on Download Actionable Model. I gave the Python version a try and it really takes just a few milliseconds to execute locally: this might be a good solution in case you don’t want to install new libraries (for example, I can think of embedded devices or network isolated clients).
I am definitely satisfied about the service and – as a developer – I greatly appreciate the effort in supporting so many programming languages and platforms, making everyone’s work simpler.

Written by

Alex is a Software Engineer with a great passion for music and web technologies. He's experienced in web development and software design, with a particular focus on frontend and UX.

Related Posts

— November 21, 2018

Google Cloud Certification: Preparation and Prerequisites

Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2018, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the first time. In t...

Read more
  • AWS
  • Azure
  • Google Cloud
— October 30, 2018

Azure Stack Use Cases and Applications

This is the second of a two-part series covering Azure Stack. Our first post provided an introduction to Azure Stack. Why would your organization consider using Azure Stack? What are the key differences between Azure Stack and Microsoft Azure? In this post, we'll begin to answer bot...

Read more
  • Azure
  • Hybrid Cloud
  • Virtualization
— October 3, 2018

Highlights from Microsoft Ignite 2018

Microsoft Ignite 2018 was a big success. Over 26,000 people attended Microsoft’s flagship conference for IT professionals in sunny Orlando, Florida. As usual, Microsoft made a huge number of announcements, ranging from minor to major in importance. To save you the trouble of sifting thr...

Read more
  • Azure
  • Ignite
— September 20, 2018

Planning for Microsoft Ignite 2018 Sessions: What Not to Miss

Cloud Academy is proud to be a sponsor of the Microsoft Ignite Conference to be held September 24 - 28 in Orlando, Florida. This is Microsoft’s biggest event of the year and is a great way to stay up to date on how to get the most from Microsoft’s products. In this post, I’ll help you p...

Read more
  • Azure
— September 18, 2018

How to Optimize Cloud Costs with Spot Instances: New on Cloud Academy

One of the main promises of cloud computing is access to nearly endless capacity. However, it doesn’t come cheap. With the introduction of Spot Instances for Amazon Web Services’ Elastic Compute Cloud (AWS EC2) in 2009, spot instances have been a way for major cloud providers to sell sp...

Read more
  • AWS
  • Azure
  • Google Cloud
— August 23, 2018

What are the Benefits of Machine Learning in the Cloud?

A Comparison of Machine Learning Services on AWS, Azure, and Google CloudArtificial intelligence and machine learning are steadily making their way into enterprise applications in areas such as customer support, fraud detection, and business intelligence. There is every reason to beli...

Read more
  • AWS
  • Azure
  • Google Cloud
  • Machine Learning
— July 5, 2018

How Does Azure Encrypt Data?

In on-premises environments, data security is typically a siloed activity, with a company's security team telling the internal technology groups (server administration, database, networking, and so on) what needs to be protected against intrusion.This approach is absolutely a bad...

Read more
  • Azure
— June 26, 2018

Disadvantages of Cloud Computing

If you want to deliver digital services of any kind, you’ll need to compute resources including CPU, memory, storage, and network connectivity. Which resources you choose for your delivery, cloud-based or local, is up to you. But you’ll definitely want to do your homework first.Cloud ...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud
Albert Qian
— June 19, 2018

Preparing for the Microsoft Azure 70-535 Exam

The credibility of Microsoft Azure continues to grow in the first quarter of 2018 with an increasing number of enterprises migrating their workloads, resulting in a jump for Azure from 10% to 13% in market share. Most organizations will find that simply “lifting and shifting” applicatio...

Read more
  • Azure
  • Compute
  • Database
  • Security
— April 12, 2018

Azure Migration Strategy: A Checklist to Get Started

By now, you’ve heard it many times and from many sources: cloud technology is the future of IT. If your organization isn’t already running critical workloads on a cloud platform (and, if your career isn’t cloud-focused), you’re running the very real risk of being overtaken by nimbler co...

Read more
  • Azure
— March 2, 2018

Three Must-Use Azure Security Services

Keeping your cloud environment safe continues to be the top priority for the enterprise, followed by spending, according to RightScale’s 2018 State of the Cloud report.The safety of your cloud environment—and the data and applications that your business runs on—depends on how well you...

Read more
  • Azure
  • Security
— February 15, 2018

Is Multi-Cloud a Solution for High Availability?

With the average cost of downtime estimated at $8,850 per minute, businesses can’t afford to risk system failure. Full access to services and data anytime, anywhere is one of the main benefits of cloud computing.By design, many of the core services with the public cloud and its underl...

Read more
  • AWS
  • Azure
  • Cloud Adoption
  • Google Cloud