Predict as you train

5 min readDec 16, 2020


AI models are at the heart of every AI solution. There is no solution without a model. It is of­ten neglected, however, that model building is only a part of the entire AI lifecycle, and does not have any business impact in itself.

And even worse, data science work is supported by plenty of different frameworks and tech­nologies that are isolated from production.

There is no justification for different technologies in different phases.

Model development requires a continuous “try and learn” and can­not be com­pared to the soft­ware development phase. The need for different approaches along the AI lifecycle, how­ever, is no justifi­ca­tion for differ­ent technologies in different phases of an AI project.

Moving from proof-of-concepts to AI in produc­tion is still very time consum­ing and comes with many pitfalls. Given that more than 85% of the projects still remain alchemy and never make it in­to pro­duc­tion, it is time to re-envision the entire process how AI solutions are built.

It is the AI solution that augments decisions and creates impact and ROI. So, priority must be given to accelerate the entire AI solution building instead of solely improving the model experimentation phase.

Minimizing “time-to-value” is the top business priority and unifying technologies along the AI lifecycle is the number one strategy to build the right solution for the right problem in time.

There is Databricks’ Unified Data Analytics Platform, right?

When we talk about PredictiveWorks. and its innovative”predict as you train” feature, it is about a unified & code-free approach to boost the entire AI lifecycle. And that is definitely different.

From a top-level view, the AI lifecycle can be roughly organized in three different phases:

Data in­te­gration & understanding, model building, and prediction. The latter describes the app­li­cation of AI models in production — deployed as AI solutions. All phases are based on data workflows with overlapping functionality.

From a business perspective, it is hard to understand why these workflows cannot be built with the same tech­nology, run on the same platform and reuse the same components in different phases.

An example:

Most AI models work with mathematical data representations and need upstream stages to prepare real-world data. The same stages are needed for model building and prediction.

So, why should we build them twice for different phases and with different technologies?

“Predict as you train” is an initiative to closely ties all phases together. With the same plat­form and tech­no­logy, and an embedded model management as the glue for model train­ing and pre­dic­tion.

The aim is to reuse the same component for the same task in different phases along the en­tire AI lifecycle.

PredictiveWorks. implements “predict as you train” on top of Google’s data fusion plat­­form CDAP and is powered by Apache Spark. Apache Spark’s DataFrames are used as a unified integra­tion technology for

  • Deep & Machine Learning,
  • Business Rules & SQL Queries, and
  • Natural Language & Time Series processing,

PredictiveWorks. ships with the fol­low­ing framework integrations:

  • Analytics Zoo from Intel for deep learning use cases,
  • Apache Spark MLlib for machine learning,
  • JBoss Drools for business rule processing,
  • Apache Spark SQL for query pro­cessing,
  • John Snow LABS SparkNLP for natural language processing, and
  • SparkTIME by Dr. Krusche & Partner for time series analytics.

They are provided as fine-grained data workflow plugins: pre-built configurable software artifacts for Google’s CDAP. PredictiveWorks. comes with 150+ plugins for plenty of data operations to orchestrate AI-driven workflows with the same platform and technology with­out writing a line of code.

Example: Model training (SparkSink plugin)

The image below shows the Java source of a configurable K-Means SparkSink plugin for build­ing a seg­men­tation model based on Apache Spark MLlib.

Example: Model prediction (SparkCompute plugin)

PredictiveWorks. comes with a symmetric K-Means SparkCompute plugin to leverage a trained segmentation model in a model prediction workflow.

The illustrated SparkSink and SparkCompute plugins slightly differ in their function­al­ity as they perform different tasks with respect to the K-Means model. These plugins, however, do not make a workflow. Necessary upstream plugins, however, to connect to data sources and transform data into feature vectors are the same and can be reused.

So, most of the work in the prediction phase is done already in the training phase.

Embedded Model Management

What is missing so far is an embedded model management that glues training and prediction workflows together. PredictiveWorks. management was inspired by Databricks’ MLflow and is implemented around the concept of model training runs as well. It is based on three components:

  • Model Registry
  • Model Recorder
  • Model Viewer

The model registry manages every model type (e.g., K-Means) in its own big data table and registers time series of model runs.

The model recorder writes model parameters, stage (experimentation, production etc.), ver­sion, evaluation metrics and output files into the registry and reads models instances from there. It is embedded into every training and prediction plugins.

The model viewer provides a comprehensive UI to validate different model building runs, to find out e.g., which tuning parameters work best.

Example: Model recorder embedded in SparkSink plugin

The image below illustrates how the K-Means model recorder is embedded into the Java code of the plugin for model building. Note, the compute function defines the main API method of Spark­Sink artifacts.

Reminder: Plugins are configurable pre-compiled software artifacts. The image reveals a view under the hood at source level and illustrates how model recorders are embedded.

What is the achieved?

  • The plethora of different AI frameworks, libraries and technologies is reduced to a single Java & Scala platform with code-free point-and-click orchestration experience.
  • Google’s data fusion platform CDAP is extended to offer the full spectrum of machine intelligence. Data integration seamlessly integrate into every AI-driven data workflow.
  • Apache Spark is at the heart and its unified DataFrames technology is used to expose a limited set of proven analytics frameworks as configurable & pre-com­piled soft­ware plugins.
  • A comprehensible model management, inspired by Databricks’ MLflow, is embedded into every training & prediction plugin and reduces operationalization efforts to its mi­nimum.

“Predict-as-you-train” is one of the main innovative features of PredictiveWorks. and empow­ers small teams to perform AI projects at lightning speed.

Originally published by Dr. Stefan Krusche




PredictiveWorks. is a declarative (code-free) AI software factory that revolutionizes the AI production process. #IoT