Automated Machine Learning in the Cloud
Reading time: 8 mins
By Kumar Singh, Research Director, Data, SAPinsider
It has been more than eight years since AutoWEKA, the first free and open-source machine learning library, was released in 2013. It is not surprising then that automation of the data science process has been around for nearly a decade, but these tools have evolved extensively in recent years. The initial technologies were mainly focused on algorithm selection and hyperparameter tuning, which helps automate aspects of the work done by data scientists. They are not, however, practical for the day-to-day work of a data analyst.
AutoML tools have evolved into the democratization of data science. They now include a broader scope, encompassing the automation of the entire data-to-insights pipeline — from cleaning data to tuning algorithms through feature selection and feature creation, even operationalization. And now, with the advent of the cloud, the case of AutoML has become much more substantial. This article will explore:
- Why the need for Automated ML has been fueled by the cloud
- What are the Automated ML solutions offered by hyperscalers
- What are some of the critical aspects SAPinsiders need to be aware of when strategizing about Automated ML in the cloud.
Why AutoML in the Cloud?
As the amount of information owned by organizations took off (and continues to do so) and grew exponentially, those businesses often found themselves trapped with data insights that increased only linearly, unable to leverage the massive amounts of information at their disposal. Organizations are increasingly looking to the cloud for data management infrastructure due to the likeability, scalability, and flexibility that data management in the cloud allows.
The next step is making the best use of the massive amount of data generated by organizations. And this is where artificial intelligence (AI) and machine learning (ML) tools come into the picture. However, organizations need more than just access to insights from these models; they need these insights fast.
Accelerated modeling is critical for the move into Enterprise AI, and it’s a function of scale — using more data for more data projects, at a faster rate, with the purpose of automating everything. While having a cloud-based infrastructure is undoubtedly a critical part of achieving all this, there is also a need to implement in combination ML models operationalized by self-service analytics programs. Another factor is that hiring exponentially more data scientists, who are not only expensive but extremely difficult to find and hire, is a big challenge for most companies. As a result, many organizations are shifting toward developing more and more citizen data scientists to support rapidly accelerated data efforts and a growing number of ML projects in production, fueled by the cloud.
AutoML, or augmented analytics, allows citizen data scientists to do more advanced work, allowing data scientists to work on more specialized tasks. Citizen data scientists (analysts or business users) can produce more valuable (and less mundane) results. This arrangement benefits the business because a large staff of citizen data scientists produces more data projects supported by data scientists or accelerated AI modeling.
Leading Cloud-Based Analytics Tools
While the market for AutoML tools is getting crowded rapidly, the three leading hyperscalers lead the pack in terms of having the end-to-end capability desired in Automated ML solutions. We will cover the Automated ML tools from these hyperscalers in this section. Please note that this is not a comparison or ranking but an alphabetical listing of key features of Auto ML solutions from the three leading hyperscalers:
- AWS SageMaker AutoPilot
- Azure AI Automated ML
- Google Cloud AutoML
AWS Sage Maker AutoPilot
Amazon SageMaker Autopilot covers the entire MLOps pipeline. With this tool, you can automatically build, train, and tune ML models in an intuitive interface while maintaining complete visibility and control. SageMaker Autopilot explores different models to determine the best model based on the data and the problem type. You can also quickly deploy the model directly to production in just one click or iterate on the recommended solutions with Amazon SageMaker Studio to further improve the model quality. Some of the critical features of AutoPilot are:
Automatic data pre-processing and feature engineering. SageMaker Autopilot automatically fills in missing data, provides statistical insights about columns in your dataset, and automatically extracts information from non-numeric columns, such as time and date information from timestamps.
Automatic ML model selection. Based on your data, Amazon SageMaker Autopilot infers the type of prediction that is most appropriate, such as binary classification, multi-class classification, or regression. Then, SageMaker Autopilot explores high-performing algorithms such as gradient boosting decision trees, feedforward deep neural networks, and logistic regression, training and optimizing hundreds of models based on these algorithms to find the model that best fits your data.
Model leaderboard. You can review all the ML models Amazon SageMaker Autopilot automatically generates for your data. Using the list of models, you can view metrics such as accuracy, precision, recall, and area under the curve (AUC), as well as review model details, such as the impact of features on predictions, and deploy the model that is most suitable for your use case.
Feature importance. SageMaker Autopilot generates an explanation report, developed by Amazon SageMaker Clarify, that helps you to explain how the models created by SageMaker Autopilot make predictions. In addition, you can identify the percentage contribution of each attribute in your training data to the predicted result. The higher the percentage, the stronger the impact of that feature on your model.
For a detailed review, you can visit the product page: https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html.
Azure AI Automated ML
Azure AI Automated ML tool is another best-in-class tool that enables users to automatically build and deploy predictive models using the no-code user interface or an SDK (used chiefly by developers). This tool also covers the end-to-end model development pipeline. It can help with data pipelines and preparation, creating ML models customized to the input data and helping refine the underlying algorithm and associated hyperparameters. The Automated ML functionality increases productivity across the end-to-end ML pipeline. Some of the key features of Azure Automated ML capabilities are:
Drag-and-drop functionality. Users have access to features like the designer, which has modules for data transformation, model training, and evaluation, making it easy to create and publish ML pipelines.
Automated ML algorithm selection. Allows users to rapidly create accurate models for classification, regression, and time-series forecasting. To avoid these models becoming models where the users do not know what the underlying algorithm is and how it works, the feature uses model interpretability to help understand how the model was built.
Data labeling. Data preparation is a significant pain point in ML model development, and there are features to make this task easier and accelerate the process. It helps users prepare data quickly, manage and monitor labeling projects, and automate iterative tasks with ML–assisted labeling.
Responsible machine learning. Get model transparency through training and inferencing with interpretability capabilities. Assess model fairness through disparity metrics and mitigate unfairness. Help protect data with differential privacy and confidential ML pipelines.
For a detailed review, you can visit the product page.
Google Cloud Vertex AI
Google Cloud’s AutoML solution allows users to train high-quality ML models seamlessly and easily, with minimal ML expertise. Just like other hyperscalers, it offers a range of sub-products that allow users to run a wide range of AI modeling approaches on various types of data. The key features of Vertex AI, the solution that is essentially a unified platform to build, deploy, and scale models, can be described using some of the essential components of model building and deployment:
Training. Models can be trained on Vertex AI using AutoML or using custom training if you need more customization options available in AI Platform Training. In custom training, you can choose different machines to power your training jobs, enable distributed training, use hyperparameter tuning, and accelerate using GPUs.
Deployment. Vertex AI allows you to deploy models and get an endpoint to serve predictions. You can deploy models whether the model was trained on Vertex AI or not. Like other hyperscaler products, this is a key feature as it allows you to migrate models seamlessly.
Data labeling. Data labeling jobs let you customize ML models by labeling a dataset you plan to use for training. You can request a label for your video, image, or text data. This accelerates the model development timeline extensively.
Feature Store. This is a fully managed repository for storing, serving, and sharing ML feature values. Feature Store takes care of all underlying infrastructure to support the functions mentioned. You can use it for storage and compute resources, for instance, and scale it easily as needed.
For a detailed review, you can visit the product page.
What Does This Mean for SAPinisders
- Differentiate data scientists from citizen data scientists. For non-data scientists to effectively participate in data projects, there needs to be a significant mindset shift around data tooling. Citizen data scientists often do not possess advanced feature engineering skills, parameter optimization skills, algorithm comparison skills, etc. However, as AutoML and, more importantly, augmented analytics technologies emerge, not everyone involved in the data pipeline needs to know about these skills — parts of the channel can be automated. It is, however, critical to understand where exactly you need to use data scientists versus citizen data scientists.
- Drive usability, stability, and transparency. When leveraging cloud-based AutoML solutions to put the power of ML in the hands of citizen data scientists, they must meet these three criteria. If you are looking at other options, you need to be aware of usability, stability, and transparency.An easy-to-use system should be accessible to non-developers with little technical expertise. Ensure that the method you choose provides context-sensitive help and explanations for different parts of the data process, as well as a visual, code-free user interface. To execute augmented analytics, users must have access to a system that can be used from one end of the data pipeline to the next. An accurate description of the algorithms used provides citizen data scientists with the knowledge they need to develop trust in the outcomes and determine if they are appropriate for the project at hand.
- Leverage adaptability as the secret sauce. The idea behind augmented analytics, which is a tool primarily used by citizen data scientists, is that they can contribute to data projects independently, but that doesn’t mean that those projects will not be used by others (namely data scientists). An automated system needs to serve as a starting point for custom development and dedicated learning by data scientists. The outputs, for example, should be able to be translated into Python code for complete learning, including feature transformation and cross-validation. Of course, it cannot be understated that adding these features to augmented analytics or the AutoML platform does not mean anyone can create models and push them into production without oversight, review, or input from someone specialized in the field (such as a data scientist).