Skip to main content

BigML: Machine Learning Made Easy

BigML offers a managed platform to build and share your datasets and models

Machine Learning as a Service (MLaaS) has become a real thing in the Cloud market and BigML‘s mission is simple and clear: making Machine Learning easy, beautiful and understandable for everybody.

I would say that BigML offers something much closer to Software as a Service (SaaS) than its IaaS and PaaS competitors. I recently played with AmazonML, AzureML and Google Prediction API, all of which are part of rich ecosystems of web services, from Cloud storage and CDNs, to VPCs, deployment automation and much more.
On the other hand, BigML, while somehow remaining platform-agnostic, has successfully managed to exploit existing Cloud solutions to its advantage. Consider, for example, how it allows data imports from AWS S3, MS Azure, Google Storage, Google Drive, Dropbox, etc. This detail seems trivial at first, but might be a game changer in the long-term: once public Cloud infrastructures become a commodity, a cross-provider solution will be the best option.

BigML Features

Being focused “only” on Machine Learning, BigML offers a wider set of features, all well integrated within a usable Web UI. As you would expect, you can load your dataset, train and evaluate your models, and generate new predictions (either one by one or in a batch).
Here is a list of additional useful features I haven’t see elsewhere:

  • Plenty of ways to load your raw data, including most Cloud storage systems, public URLs or your own CSV/ARFF files.
  • A vast gallery of free datasets and models to play with, well organized into categories and publicly accessible.
  • Clustering algorithms and visualization: data analysis and visualization tools are essential to come up with a high-quality model.
  • Anomaly detection: dealing with outliers can be a pain, and detecting pattern anomalies can save you time and money, even before hitting your model.
  • Flexible pricing: you can choose between subscription plans (starting from $15/mo for students), pay as you go with BigML credits, or even buy your personal VPC.

How to import your data

Depending on your use scenario, you may want to import your data from an existing Cloud storage system, provide a public URL, or directly upload a CSV file. In Development mode, you can even create an inline source on the fly.
BigML - Import your data
Even at this step, BigML offers a nice set of features:

  • CSV parsing configuration.
  • Fields type selection.
  • Strings locale selection (English, Dutch, Spanish, or Portuguese).
  • Headers parsing (CSV with or without a header row).
  • Date-time fields expansion.
  • Text analysis (language detection, tokenization, stop words, stemming).

Interestingly, you can update your Source configuration at any time, without any additional upload.

As soon as your Source is ready and correctly parsed, you can use it to generate a new Dataset. Alternatively, you can import ready-to-use data from their public datasets gallery.

Datasets are fully reusable, expandable and exportable

BigML Datasets are very easy to reuse, edit, expand and export. Indeed, you can easily rename and add descriptions to each one of your fields, add new ones (through normalization, discretization, mathematical operations, missing values replacement, etc), and generate subsets based on sampling or custom filters.

Furthermore – even before training your model – you are given values distribution and statistics for each field, and also a great Dynamic Scatterplot tool to visualize your data, two dimensions at a time. Here you may want to explore your features space, look for patterns, export chars or simply have fun.
HAR Dataset sample
In practice, Datasets are your starting point for most operations. Let’s assume that our goal is training and evaluating a classification model. We will first need to split our dataset into smaller training and test sets: you can achieve this with the Training and Test set Split operation. Of course, you are free to select how to allocate your records: 80/20 is the default split logic. This process will actually create two new independent datasets, which you can analyze and manipulate as you want.
As soon as the split is completed, you’ll want to select your new training set and launch the Configure Model operation.

BigML decision trees

A Machine Learning model could potentially be anything that’s able to analyze your raw data (eventually labeled) and that can somehow learn how to deal with new, unseen data.

Decision Trees are probably the most intuitive type of model you can build. They are easy to visualize and understand, and easier to store, since they can be almost directly converted into procedural code. Even if you are not a programmer, you can think of it as a nested structure of binary decisions.

That’s exactly what every BigML model is. When you train a model – starting from a training Dataset – you are asked to select your objective field (i.e. your target column). In order to reduce the effect of data overfitting and the size of your model, you will normally need some sort of statistical pruning, although you can decide to disable it.

Optionally, you can configure a few more options.

  • Missing split: whether or not to include missing values when choosing a split (disabled by default).
  • Node threshold: maximum number of nodes (512 by default).
  • Weights: you can choose to weight your records specifying one weighting field, or assign relative weights to your classes.
  • Sampling and Ordering: you can choose a custom sub-sample and shuffling logic.

BigML - Model Summary Report
At the end of the training process, you will be able to visualize your model and obtain an informative report to better understand which of your fields are more relevant (see the screenshot above). Moreover, your model is graphically represented as an actual tree (below on the left) or as a sunburst (below on the right).
BigML - Model visualization
At this point, you can already start generating new predictions, but of course, we want to evaluate our model’s accuracy first.

Ensembles can improve your prediction accuracy

Ensembles, involving multiple alternative models which will provide better predictive performance, are a well-documented way to improve the accuracy of your single-model system. Each model can be trained using a subset of your data, or focused on specific classes so that they will collaborate in generating a better prediction.

In BigML you can easily train Decision Forests with the Configure Ensemble dataset operation: you are simply asked how many models will be trained.

This approach drastically corrects the decision trees’ habit of overfitting your training data – and therefore improves your overall accuracy. In my case, I managed to improve my accuracy of 3% using an ensemble of 10 models, which might make sense if you can afford the additional time.

I also generated an ensemble of 100 models but, even though it increased the accuracy of an additional 1%,  it was clearly not a great idea, both in terms of cost and speed.
BigML - HAR Ensemble

How to evaluate your results

Being able to quickly evaluate your models and compare multiple evaluations are critical features for a Machine Learning as a Service product, and I personally believe BigML has done a great job.

Typically, you want to test your models against a smaller part of your dataset. We previously created a 20% test set and I used it to generate an evaluation for both my model and my ensemble. You can either launch the Evaluate operation on your model or the Evaluate a Model operation on your dataset. Not much configuration is needed unless you have specific sampling or ordering needs.

Each Evaluation is an object itself and will be listed in your Evaluations list. Of course, based on your model type (regression or classification) you’ll be shown different kinds of metrics. If your model is a classifier you’ll be shown a Confusion Matrix, including statistics like Accuracy, Precision, Recall, F-Measure and Phi.

In case your Confusion Matrix is too big to be rendered on a web page (let’s say you have 6 possible classes as I did), you will be able to download it in Excel format. It will look similar to the figure below, where the top legend shows you what every single cell means with respect to the first element of the main diagonal (you can do the same for each cell on the main diagonal).
BigML - Confusion Matrix
But how about my ensemble? Does it perform better?

Apparently, it does. It achieved +2.82% of overall accuracy and as high as 5.34% for some classes. You can compare two evaluations with the Compare Evaluation operation and be shown the same statistics of a single evaluation, plus a relative percentage for each metric.
BigML - Evaluations ComparisonMy model alone was pretty effective and I probably wouldn’t choose to pay the additional cost of an ensemble – both in terms of price and speed – although in many cases overfitting will kill your predictive power and ensembles can drastically boost your accuracy.

Generating new Predictions with API bindings and BigMLer

I would say that BigML is both user-friendly and developer-friendly. They took the time to code plenty of API clients, and even a command-line tool called BigMLer.

Of course, you can perform every operation mentioned above via API, but I believe that offline phases are better handled with a clear and solid UI, especially during the model and dataset definition. Choose the Python binding and coded a simple script to generate new predictions.

from bigml.api import BigML
from bigml.model import Model
from bigml.ensemble import Ensemble
labels = {
    '1': 'walking', '2': 'walking upstairs',
    '3': 'walking downstairs', '4': 'sitting',
    '5': 'standing', '6': 'laying'
def main():
    api = BigML("alex-1", "YOUR_API_KEY", storage="./cache")
        predictor = Ensemble('ensemble/5557c358200d5a7b4300001e', api=api)
        predictor = Model('model/5557ac99200d5a7b42000001', api=api)
    #generate new prediction
    #Note: params might differ btw Ensemble and Model, this is an example
    prediction = predictor.predict(get_input_data(), with_confidence=True)
    label = prediction[0]
    confidence = prediction[1]
    print("You are currently %s (class %s, %s%%)." % (labels[label], label, confidence) )
def get_input_data():
    """ Retrieve input data from local CSV file """
    with open('record.csv') as f:
        record_str = f.readline()
    #generate 'fieldN' dict (I had 562 un-named columns!)
    record = {}
    for i,val in enumerate(record_str.split(',')):
        record['field%s' % (i+2)] = val
    return record
if __name__ == '__main__':

As far as performance is concerned, calls to my model took between 1.5 and 2 seconds, until I enabled the local storage option: this will store all your models’ parameters locally and avoid blocking API calls for every future prediction. With a warmed up local cache, my script execution time dropped down to 150ms. My 10 models ensemble took about 20 seconds to load at first, while every next call took only 1 second at most: it’s actually about 10 times slower than a single model – as 10 predictions are performed – but I think 1% of additional prediction confidence is worth the time. Please also keep in mind that my model worked on more than 560 input features and 6 possible output classes, therefore I’m sure that an average model would run much faster than mine!

Alternatively, you can convert your model to procedural code in fifteen different languages/formats by clicking on Download Actionable Model. I gave the Python version a try and it really takes just a few milliseconds to execute locally: this might be a good solution in case you don’t want to install new libraries (for example, I can think of embedded devices or network isolated clients).

I am definitely satisfied with the service and – as a developer – I greatly appreciate the effort in supporting so many programming languages and platforms, making everyone’s work simpler.


Written by

Alex Casalboni

Alex is a Software Engineer with a great passion for music and web technologies. He's experienced in web development and software design, with a particular focus on frontend and UX.

Related Posts

Guy Hummel
— June 26, 2019

Running Apache Spark on Azure Databricks

In this article, we’ll cover how to set up an Azure Databricks cluster and how to run queries in an interactive notebook. However, this article only scratches the surface of what you can do with Azure Databricks. If you would like to learn more, including how to create graphs, run sched...

Read more
  • Azure
Thomas Mitchell
— June 25, 2019

Understanding the Core Azure Architectural Components

Microsoft Azure relies on a few key architectural components to provide redundancy and high availability. Core Azure architectural components include Azure regions, Azure Availability Zones, resource groups, and the Azure Resource Manager. In this article, we’ll discuss the basics ab...

Read more
  • Azure
Guy Hummel
— June 6, 2019

How to Become a Microsoft Certified Azure Administrator

Microsoft Azure is one of the hottest cloud services on the planet, and it’s growing at a phenomenal rate. This rapid growth has created a huge demand for people who know how to administer and manage Azure implementations. To make it easier for employers to verify the skills of Azure...

Read more
  • Azure
  • exam
Guy Hummel
— May 20, 2019

Preparing for the Microsoft AZ-900 Exam

Microsoft has offered Azure certification exams for years, but until recently, they were all meant for technical IT professionals. Now non-technical professionals, such as salespeople and managers, can take the new AZ-900 exam to prove their understanding of Azure fundamentals. People w...

Read more
  • Azure
Thomas Mitchell
— May 16, 2019

An Overview of Azure Storage (Part 2)

Archive Storage Archive Storage offers the lowest storage costs of all Azure storage. Its retrieval costs, however, are higher when compared to Hot and Cool storage. The archive tier of storage is designed for data that can tolerate several hours of latency when being retrieved. It’s a...

Read more
  • Azure
  • Storage
Thomas Mitchell
— May 14, 2019

An Overview of Azure Storage (Part 1)

Welcome to part one of our series on Azure Storage. Stay tuned for the second part. An Overview of Azure Storage Microsoft Azure Storage is a cloud-based storage offering that provides multiple storage solutions for organizations. In addition to a massively scalable object store for ...

Read more
  • Azure
  • Storage
Giacomo Marinangeli
— March 29, 2019

NEW: Custom Hands-On Labs for Azure and Google Cloud Platform

Harvard Business Review recently estimated that some 90% of corporate training never gets applied on the job. Given the $200B training industry, that is a staggering amount of waste. One reason for the disconnect? Lack of context. Cloud Academy’s platform was built to make it extraor...

Read more
  • Azure
  • Content Engine
  • Google Cloud Platform
  • Hands-on Labs
Guy Hummel
— March 28, 2019

How to Become a Microsoft Certified Azure Solutions Architect

Microsoft Azure is the fastest growing cloud provider. Azure’s revenue grew an incredible 76% in the last quarter of 2018. As more and more businesses move their IT infrastructure to Microsoft’s cloud platform, the demand for Azure professionals keeps rising. Since there are relatively ...

Read more
  • Azure
Nitheesh Poojary
— March 20, 2019

What is Heroku? Getting Started with PaaS Development

So just what is Heroku? It's a service for developers eager to get their applications online without having to worry about infrastructure details. Metered, pay-as-you-go Cloud Computing services come in all kinds of flavors. Infrastructure as a Service (IaaS) offerings like AWS allow e...

Read more
  • Azure
  • Development & Deploy
Nitheesh Poojary
— March 12, 2019

Understanding Object Storage and Block Storage Use Cases

Cloud Computing, like any computing, is a combination of CPU, memory, networking, and storage. Infrastructure as a Service (IaaS) platforms allow you to store your data in either Block Storage or Object Storage formats. Understanding the differences between these two formats - and how ...

Read more
  • Azure
  • Storage
Thomas Mitchell
— January 29, 2019

What are Azure Blueprints?

What are Azure Blueprints? Blueprints, in the traditional sense, are used by architects and engineers to design and build new things.  They are used to ensure that the final products are built to specifications and in compliance with certain standards and requirements. Azure Bluepri...

Read more
  • Azure
  • Blueprints
Thomas Mitchell
— January 22, 2019

Azure Hybrid Identity Authentication Methods

The move to the cloud is picking up steam.  As such, many corporations are beginning to find themselves supporting a mixture of on-prem apps as well as cloud apps. Users are finding that they need access to this mix of applications as well.  As one would expect, this can become a challe...

Read more
  • Azure
  • Hybrid Cloud
  • Hybrid Identity