After you create the model, a summary and visualization is generated. The model is empty in the beginning as calculations are being made. So we recommend that you check back after 24 hours (errors show up within a few hours). After the model is set up, it is updated daily. You can track model performance that shows what happens to customers after the predictive scores have been computed.
A predictive score between 0-100 is generated for every customer in the scoring audience. The predictive scores are available on customer profiles, which can be used for lookups, segmentation, syndications and exports.
After you set up the predictive score definition (predictive model), an AI model is trained on historical data to learn behavioral rules, which separate converting customers vs non-converting customers. Blueshift automatically derives customers behaviors from customers clickstream events and campaign engagements. Specifically, the platform derives hundreds of behaviors such as recency, frequency, time spent, catalog affinities (category affinity, brand affinity, product attribute affinity), and so on, for each event. These behaviors, along with user attributes, are referred to as features/variables and are fed into the AI model. The model then learns an optimal combination of features, which leads to conversion and a scoring function is learned.
Blueshift uses a white box approach in building an AI model, which can be interpreted for insights and performance. The following reports are available in the Model Summary dashboard, where each report is aimed at different aspects of the model.
|Analyze and Chart scores||Shows the predictive scores for each probability bucket from 0 to 100.||Use this report to slice and dice predicted probability scores vs the observed conversion rates.|
|Model accuracy||Shows the data used for training, testing and standard model accuracy metrics.||This information can be used by data analysts and data scientists to judge model accuracy.|
|Insights||Shows the features (derived behaviors and user attributes) that were the most meaningful in predicting the goal events.||Use this report to understand what features/variables contribute the most to prediction.|
Predictive scores are likelihood scores between 0-100, where the lower and higher score represents lower and higher chances of conversion respectively. The expectation from the ideal model is that a group of customers with score X convert at X% rate i.e conversion rate for customers with score 0 is 0%, score 1 is 1%, …, score 100 is 100%.
To analyze the performance, Blueshift tests the model on past unseen customers (customers intentionally set aside for testing). The customers are assigned a score, grouped by score (0-10, 10-20, 20-30, and so on), and the score compared with their actual conversion rate. A good model will have a conversion rate for each decile close enough (+-10) within the decile boundaries. If that’s not the case, more data might be required or likely the model has been overfit.
This includes more advanced stats on the model for data science users. The size of the training and validation dataset, AUC (Area Under the RoC Curve) and log loss for the underlying AI model are shown. AUCs above 0.65 are considered good, while an AUC of 0.65 or below is indicative of low predictive accuracy. The chart shows the data for the training and holdout dataset. Training data is used to build a model, whereas the model is tested on the holdout dataset to create performance reports.
A ranked list of feature/variable importance is shown, where the top-ranked features are most used for decision making in the AI model. The length of the bar shows the relative strength of the feature.
You can track what happens to customers after the predictive scores have been computed unlike model visualization where already observed data is shown. To measure the real performance of the predictive score, Blueshift tracks the performance of all customers scored on a given date, similar to cohort analysis in analytical tools. Customers are grouped in deciles, (0-10, 10-20, 20-30, and so on), based on the scores assigned to them and then the score deciles are compared with their actual conversion rates. A good model would show an increasing conversion rate for increasing order of deciles.
Campaigns or external experiments targeted on scored customers could influence the performance, therefore this report is not meant to measure the performance lift or draw any conclusions. Historical context and conditions in which we learned the AI model could have been different than now. For example, historical behaviors could have come from a past holiday sale or when a site-wide promotional campaign was live. Hence, actual performance might deviate from the expected performance (score X converts at X%). Overall you should expect a directional increase in conversion rate in the ascending order of deciles.
To measure true lift, you can run A/B tests.