How I Cut Model Training Time by 93% with Snowflake-Powered MLOps


How MLOps Saved $1.8M in ARR for a SaaS Giant

Pedro Marques

TL;DR

Problem: 5-hour model training time and 46% precision.

Solution:

  • Remove low value features
  • Parallelized training processes.
  • Balance positive and negative weights.

Impact:

  • Training time: ↓93% (5 hours to 20 minutes).
  • Precision: ↑30% (0.46 to 0.60); Recall: ↑39% (0.36 to 0.50).
  • $1.8M potential ARR protected in July predictions.

“It has been six months since the team started building the model, but I’m not super happy with the results.”

That phrase from the Chief Data Officer kicked off what would become an interesting journey into ML optimization. What began as a routine investigation led to not only significant time savings (↓93%) but also a performance improvement in our ML model.

The numbers were compelling: efficiency increased by 93%, positioning us to save $1.8M in ARR through one month of focused work. But the real story isn’t in the numbers – it’s in how we got there.

Found this useful so far? Follow me on Medium for more practical insights on technical leadership and automation solutions.

The Problem: Slow, Ineffective Models

Picture this: Data scientists would start training the model, head to the pub for a pint, and return to find the model still running.

That was a real data scientist workflow….minus the pub part.

The impact? Skilled data scientists were spending an entire day training a model instead of focusing on strategic revenue protection.

At the start of the project, the Lead Data Scientist left the company forcing me to pivot from pure infrastructure work to a 70/30 split—70% on manual model maintenance. With only 30% of my time available for MLOps initiatives, I prioritized these 3 high-impact optimizations to deliver quick wins and long-term scalability.

The Three-Part Time-Cutting Strategy

1. Remove Low Value Features (↓10%)

After conducting a feature selection analysis, I discovered that eight features were either redundant, skewed, or had low variance. After testing the removal of several feature combinations, I found that removing only two preserved the evaluation metrics and slightly reduced training time by 10% (5h to 4h30min).

2. Quick Win: Balanced Positive And Negative Weights (↓61%)

47% of the training dataset was synthetic data generated by SMOTE to balance the minority positive class, which comprised only 6% of the records. This slowed training, so after researching online, I discovered the XGBoost feature scale_pos_weight, which balances the weights of the positive and negative classes, making SMOTE unnecessary. After implementing it, training time was reduced by 61% (4h30min to 1h46min). Additionally, an unexpected outcome occurred: precision increased by 30% (0.46 to 0.60), and recall rose by 39% (0.36 to 0.50)

import snowflake.ml.modeling.xgboost as sw_xgb
import snowflake.snowpark.functions as F

num_churned = train_df.select(F.sum(col("CHURN_LABEL"))).collect()[0][0]

total_records = train_df.count()

num_not_churned = total_records - num_churned

model = sw_xgb.XGBClassifier(random_state=42, scale_pos_weight=num_not_churned/num_churned)

3. Parallelised Training Processes (↓81%)

I tested parallelisation locally using scikit-learn n_jobs=-1 in XGBoost, reducing time by 95%. Due to governance concerns, I couldn’t run the model locally. It had to run on Snowflake instead. But when I ran it on Snowflake? Zero improvement. After further researching running concurrent tasks with worker processes on Snowflake, I discovered joblib’s parallel_backend with loky. But when I implemented it? Zero improvement. Almost giving up, I tried using threading instead. After implementing it, training time was reduced by 81% (1h46min to 20min). Suddenly, we could run 20 experiments daily, which could significantly improve churn prediction precision over time.

from joblib import parallel_backend
from snowflake.ml.modeling.model_selection import GridSearchCV as SnowGridSearchCV

grid_search = SnowGridSearchCV(
	...
)

with parallel_backend('threading'):
	grid_search.fit(train_df)

Dealing with similar MLOps scaling challenges? Contact us today!

Contact Us

When the Chief Data Officer reviewed the results, he noted the results “looked pretty good”.

The Impact: Immediate Results and Long-Term Value

The numbers speak for themselves:

Metric Before After Improvement
Training Time 5 hours 20 minutes ↓93%
Precision 46% 60% ↑30%
Recall 36% 50% ↑39%
Experiments per Day 1 24 ↑2300%
Potential ARR Protected - $1.8M -

Business Impact

  • The improved model identified at-risk customers with higher accuracy, protecting $1.8M in ARR in July alone.
  • Reducing training time to 20 minutes enabled data scientists to focus on strategic tasks, accelerating innovation.
  • The optimized pipeline, built on reusable CI/CD automation and monitoring, serves as a blueprint for future models, reducing development time and costs.

Key Takeaways

  • Focus on optimizations with the greatest ROI.
  • Tie technical improvements to revenue and efficiency.
  • Create reusable frameworks to scale AI initiatives.
  • Deliver results despite resource limitations.

How are you balancing AI innovation with operational efficiency?