KISS at OpenAI, #batchforlife, and data science conspiracies

openAI. Simple Batch Processing. Sagemaker Algorithms. Data science is changing.

Sep 01, 2023

Keep it simple stupid. From openAI, simple batch processing, and the simplicity of AWS Sagemaker Algorithms. That’s the theme this week.

Our industry is moving fast, as data scientists we sometime’s forget that the core of our work is simple elegance. We need to strive for simple models, simple architectures, and simple data transformations (when possible of course).

This week we will look at all the conversations that happened about simplicity in data science, learn about batch processing post-deployement (#batchforlife), and how easy to use Sagemaker Algorithms are (🤯)… oh and a data science conspiracy that impacts all of us.

KISS at Open AI

Friendly reminder: The fancier the model, the less likely it is to work.

The art of benchmarking with simple models is something junior data scientist (and some seniors) struggle with.

Something in the water this week was inspiring conversations about simplicity in ML across linkedin and twitter.

One of the things that is unfortunately not simple is evaluating the quality of speech to text models. Especially in an obscure language like flemish. With words like muggengeheugen, a new dialect every 2 streets (ask Niels Nuyttens to pronounce this world and then ask Wiljan Cools too..).

Normally you need humans to review all the files and you calculate a mean opinion score. But Silke Plessers wrote a blog researching using PCA-based Reconstruction error to automatically evaluate quality.

Check out Silke’s Blog

And also unfortunately for Greg at openAI, LLM evaluation is also not simple. Hopefully this research will lead to the automated (quantitative) evaluation of LLMs.

#batchforlife

Deploying models in production is already complex enough. Thankfully most machine learning use cases today need batch processing, not streaming. Maria Vechtomova and co. at wrote a great post about deploying models in batch mode.

Marvelous MLOps Substack

What do ML engineers deploy: batch use case

In the article Deployment strategies for ML products, we talked about the need for 3 environments with access to production data (DEV, ACC, PRD) and how those environments are used in the ML deployment process. We have touched a bit on what exactly is being deployed, but it is good to come up with some concrete examples…

2 years ago · 2 likes · Maria Vechtomova

Evening as streaming use cases become more common, batch processing isn’t going anywhere. Especially in AWS Sagemaker, where batch transforms make it simple.

Sagemaker Algorithms and their beautiful simplicity

Sagemaker Algorithms allows you to very simply take a model from training to deployment.

Check out the full blog about SageMaker Algorithms and how to deploy them.

Speaking of Sagemaker Algorithms 🙈

Data science conspiracy

There is a conspiracy that effects all data science.

Data science is changing. You need to learn post-deployment data science because all of the value your models bring, only comes once they have been deployed.

I just published an intro course to the concepts of machine learning monitoring. This lays the ground work of the more in-depth course to come.

Take the course here

Shout out to all the great people in the MLOps and Post-Deployment Data Science community

Thanks Shyam Swaroop, James Nesfield, and Nico Verheyden (I promise won't 🥺)

Thanks Raphaël Hoogvliets for some great conversations this week and giving some insight into how you run your batch processes

Gokhan Ciflikli thanks for writing a great blog on feature drift. One of the best pieces on the topic, check it out -> https://www.gokhan.io/python/model-monitoring-nannyml/

Last but not least, Stijn (Stan) Christiaens the biggest Stan.

And in the final hour Raghu Venkat. Indeed it is! Great to have you around.

And of course very one else mentioned in this edition Maria Vechtomova , Silke Plessers , Bojan Tunguz, Ph.D.. There are of course many more people, but these are what i could remember, thanks even if i didn’t mention you here.

Thanks again, until next time, and don’t forget to:

Check out the full blog about SageMaker Algorithms and how to deploy them.

Post Deployment Data Science Newsletter

Discussion about this post