Predict as you train
--
AI models are at the heart of every AI solution. There is no solution without a model. It is often neglected, however, that model building is only a part of the entire AI lifecycle, and does not have any business impact in itself.
And even worse, data science work is supported by plenty of different frameworks and technologies that are isolated from production.
There is no justification for different technologies in different phases.
Model development requires a continuous “try and learn” and cannot be compared to the software development phase. The need for different approaches along the AI lifecycle, however, is no justification for different technologies in different phases of an AI project.
Moving from proof-of-concepts to AI in production is still very time consuming and comes with many pitfalls. Given that more than 85% of the projects still remain alchemy and never make it into production, it is time to re-envision the entire process how AI solutions are built.
It is the AI solution that augments decisions and creates impact and ROI. So, priority must be given to accelerate the entire AI solution building instead of solely improving the model experimentation phase.
Minimizing “time-to-value” is the top business priority and unifying technologies along the AI lifecycle is the number one strategy to build the right solution for the right problem in time.
There is Databricks’ Unified Data Analytics Platform, right?
When we talk about PredictiveWorks. and its innovative”predict as you train” feature, it is about a unified & code-free approach to boost the entire AI lifecycle. And that is definitely different.
From a top-level view, the AI lifecycle can be roughly organized in three different phases:
Data integration & understanding, model building, and prediction. The latter describes the application of AI models in production — deployed as AI solutions. All phases are based on data workflows with overlapping functionality.
From a business perspective, it is hard to understand why these workflows cannot be built with the same technology, run on the same platform and reuse the same components in different phases.
An example:
Most AI models work with mathematical data representations and need upstream stages to prepare real-world data. The same stages are needed for model building and prediction.
So, why should we build them twice for different phases and with different technologies?
“Predict as you train” is an initiative to closely ties all phases together. With the same platform and technology, and an embedded model management as the glue for model training and prediction.
The aim is to reuse the same component for the same task in different phases along the entire AI lifecycle.
PredictiveWorks. implements “predict as you train” on top of Google’s data fusion platform CDAP and is powered by Apache Spark. Apache Spark’s DataFrames are used as a unified integration technology for
- Deep & Machine Learning,
- Business Rules & SQL Queries, and
- Natural Language & Time Series processing,
PredictiveWorks. ships with the following framework integrations:
- Analytics Zoo from Intel for deep learning use cases,
- Apache Spark MLlib for machine learning,
- JBoss Drools for business rule processing,
- Apache Spark SQL for query processing,
- John Snow LABS SparkNLP for natural language processing, and
- SparkTIME by Dr. Krusche & Partner for time series analytics.
They are provided as fine-grained data workflow plugins: pre-built configurable software artifacts for Google’s CDAP. PredictiveWorks. comes with 150+ plugins for plenty of data operations to orchestrate AI-driven workflows with the same platform and technology without writing a line of code.
Example: Model training (SparkSink plugin)
The image below shows the Java source of a configurable K-Means SparkSink plugin for building a segmentation model based on Apache Spark MLlib.
Example: Model prediction (SparkCompute plugin)
PredictiveWorks. comes with a symmetric K-Means SparkCompute plugin to leverage a trained segmentation model in a model prediction workflow.
The illustrated SparkSink and SparkCompute plugins slightly differ in their functionality as they perform different tasks with respect to the K-Means model. These plugins, however, do not make a workflow. Necessary upstream plugins, however, to connect to data sources and transform data into feature vectors are the same and can be reused.
So, most of the work in the prediction phase is done already in the training phase.
Embedded Model Management
What is missing so far is an embedded model management that glues training and prediction workflows together. PredictiveWorks. management was inspired by Databricks’ MLflow and is implemented around the concept of model training runs as well. It is based on three components:
- Model Registry
- Model Recorder
- Model Viewer
The model registry manages every model type (e.g., K-Means) in its own big data table and registers time series of model runs.
The model recorder writes model parameters, stage (experimentation, production etc.), version, evaluation metrics and output files into the registry and reads models instances from there. It is embedded into every training and prediction plugins.
The model viewer provides a comprehensive UI to validate different model building runs, to find out e.g., which tuning parameters work best.
Example: Model recorder embedded in SparkSink plugin
The image below illustrates how the K-Means model recorder is embedded into the Java code of the plugin for model building. Note, the compute function defines the main API method of SparkSink artifacts.
Reminder: Plugins are configurable pre-compiled software artifacts. The image reveals a view under the hood at source level and illustrates how model recorders are embedded.
What is the achieved?
- The plethora of different AI frameworks, libraries and technologies is reduced to a single Java & Scala platform with code-free point-and-click orchestration experience.
- Google’s data fusion platform CDAP is extended to offer the full spectrum of machine intelligence. Data integration seamlessly integrate into every AI-driven data workflow.
- Apache Spark is at the heart and its unified DataFrames technology is used to expose a limited set of proven analytics frameworks as configurable & pre-compiled software plugins.
- A comprehensible model management, inspired by Databricks’ MLflow, is embedded into every training & prediction plugin and reduces operationalization efforts to its minimum.
“Predict-as-you-train” is one of the main innovative features of PredictiveWorks. and empowers small teams to perform AI projects at lightning speed.
Originally published by Dr. Stefan Krusche