How to Build a Market Leading AI Product: Strategies for Hiring Elite ML Teams

Erik Gafni

January 1, 2025

This document describes general guidelines for how to hire your AI team, ranging from role types & expectations, matching positions to project requirements, and interview structures.

Types of Positions

An MLE is a Swiss army knife and the bread and butter of an ML team — They are always very strong software engineers with a minimally intermediate level of ML knowledge. They could work on any part of the ML system stack, but are usually either writing code to support R&D efforts or writing code to make sure things are working smoothly in production. They spend most of their time in an IDE rather than notebooks, runs experiments, develops data preprocessing techniques, run open-source repositories, reproduce papers (sometimes), and even do ETL or data engineering in a pinch.

An MLE might not have a higher degree in ML, but is at least self-taught via the many online courses and books available. They might not have a deep understanding of foundational ML mathematics such as probability, statistics, and linear algebra, but they know enough to be dangerous.

These roles vary significantly, and sometimes MLEs won’t really touch the model, and sometimes they might be writing custom model architectures. The most productive people on a team is usually the MLE that can both write production quality software and understand the ML science.

ML Engineer (MLE)

An ML Scientist spends most of their time improving a model by doing experiments. This generally involves analyzing model performance characteristics, investigating the training and test data, trying new pre-processing or data augmentations, exploring new neural network architectures, trying different optimization techniques, and staying on top of the academic literature.

They’re usually working on a problem which has an unknown solution (research) and spend a lot of time in jupyter notebooks. They ideally will have deep domain experience in whatever your product requires, but it is not necessary as if you know one ML domain you can pick up another quickly (especially if there are other domain experts on the team to train you).

An ML Scientist is usually capable of creating custom neural network architectures for a particular type of dataset and/or task. Top performers on Kaggle are often very good ML Scientists.

ML Scientist / Researcher / Data Scientist

MLOps is devops for a machine learning team. They are responsible for making the development experience of the ML Scientists and Engineers as incredible as possible, shortening their dev cycle and maximizing their efficiency.

They are responsible for standard devops tasks such as CI/CD, version control, cloud services, logging, billing, and infrastructure. They will tailor these solutions for ML teams, such as making sure CI/CD has access to a GPU, running the ML pipeline end-to-end every night to avoid model regressions, automating model deployment and observability, having ML job-submission infrastructure, making environments reproducible, dockerizing repositories etc.

They additionally perform ML specific tasks such as model versioning and packaging, experiment tracking, ML/Data orchestration, monitoring for ML metrics and performance, and alerts.

‍If you want your team to move fast, make sure to hire an MLOps specialist as early as possible to keep your ML team efficient, productive, and avoid things like model regressions! At least make sure you have basic MLOps in place like CI/CD.

MLOps Engineer

ML Data / ML ETL Engineer

ML Data / ETL Engineer sometimes falls under the “MLOps” category. We believe that this role is so important that it deserves its own position. We are a huge fan of Dagster for all things ETL. Note often ETL will still be owned by MLEs, but we leave the discussion of it in this section.

The ML ETL engineer deals with ingesting and digesting all of the raw data that goes into model training and validation, and builds the orchestration system which automates the ML life cycle: data ingestion, data pre-processing, model training, model optimization, model deployment. The system should provide continuous testing and deployments of all stages of the pipeline as well as model observability to make sure the team is alerted if a model starts to underperform.

The Data Engineers generally focus on orchestration of getting the raw unprocessed data into the hands of the MLEs with minimum friction. They generally don’t have enough knowledge of data science and model-specific knowledge to handle all of the Data Engineering stack that is normally handled by the DataEngineers.

MLEs are usually responsible for the pre-processing steps of the data. For example, assembling the training dataset from 10 different tables with some complex pandas/polars/pyspark code.

ML Lead

An ML Lead can come in various forms. An ML Tech Lead is someone who is making architectural decisions, but doesn’t actually manage people and is very hands on. A Principal ML Lead is someone who is making architectural decisions, is hands on, and manages people.

An ML Director is someone who is relatively hands off coding but manages people and high level strategy. The ideal candidate has a lot of experience getting models into production and leading teams, and understands all of the base skill sets required to do ML (science, engineering, ML practice, and systems architecture). These people are very rare, but if you can get one on your team early you avoid technical debt which kills companies.

‍An ML Lead has seen the common pitfalls like not having CI/CD implemented from the beginning, and making sure the data is collected properly and models assessed accurately so you don’t over inflate your results (often ML models look like they’re working, but they’re really not because they learned some spurious correlation). They’ll also make sure that basic MLOps are in place so that the ML Scientists are doing research and data science, rather than spending time hacking together horrible solutions to infrastructure problems.

Usually newer teams can’t afford an ML Lead, but many of our customers find fractional and interim ML team leads extremely valuable. This person must have superb cultural fit, and be a team player focused on enabling everyone else on the team, and thus should usually be incentivized heavily by stock options particularly if full time.

Assessment of Required Skills

During an interview, you should rank people along the following axes of base skills. Here are the high level attributes of what you should be interviewing for for each position.

Base Skills

ML Practice
- Experience at a big tech company or fast growing startup working on ML
- Worked on an ML team of 4+
- Has built and shipped production models end-to-end that had business impact
- Knows the ML ecosystem and libraries like pytorch_lightning
- Skilled at: training, evaluating, and deploying models

Software Engineering
- Worked on a large production code base with a team of 4+
- Shipped and maintained production code with large user base
- Has worked on large and complex distributed software system
- Skilled at: python, git, cloud, web development, linux, bash

ML Theory / Math
- MSc or PhD in math or AI
- or equivalent experience, ex success in another scientific and quantitative field such as physics or bioinformatics
- Published scientific papers in prestigious journals
- Domain experience with your problem (ex NLP, vision, audio, etc)
- Has won Kaggle competitions
- Skilled at: ML theory, probability, statistics, linear algebra, optimization

ML Systems Architecture
- Orchestration of complex data and machine learning systems
- Orchestrates data ingestion, data pre-processing, model training, model optimization, model deployment
- Skilled at: Airflow, dagster, distributed systems, data pre-processing, production systems

‍

Team Composition

The following section explains who to hire and when.

‍

‍

Guidelines for team composition changes a lot from project to project. If you’re building your own custom neural network or gradient boosting tree, you’ll need a lot more Machine Learning Scientists. If you’re wrapping GPT4 via an API and using pinecone, you mostly need MLEs.

A general rule of thumb is to start your team with MLE generalists and then only hire the Machine Learning Scientists, MLOPs Engineers, or Data / ETL Engineers as you need those parts of your systems to advance faster. You can also supplement many of these functions with fractional-time consultants who specialize in them without having to hire people to work on it full time! This is incredibly beneficial for startups.

‍For example, there are many MLOps tasks that require a lot of specific expertise, and can be set-up very quickly by a part-time consultant. This is also often work that employees don’t want to do because it is one-off and doesn’t really help their personal career growth, so it hurts morale to make them do it.

Additionally, once these tasks are done they don’t require a lot of maintenance. It is also often true that ML Scientists cannot do their job well until baseline MLOps, MLEngineering and Data Engineering / ETL is already set up.

Essential tasks that can be accomplished by a specialized fractional-time AI engineer at lightning speed:

- Interim Team Lead
- CI/CD with GPUs
- Cloud infrastructure as code with terraform
- Cloud billing alerts
- Cloud-agnostic ML training job infrastructure
- Model distillation / optimization
- Setting up ETL with dagster

Interviewing For Different Positions

Here are some example questions that assess each of the main skill sets required by the different positions of a team. This is an area that Eventum excels at as a big part of our value proposition, particularly for companies that do not have a lot of internal ML experience yet and thus have little hope in conducting a successful technical interview.

Interviewing well is quite difficult, and we aim to have a very low false positive rate — if we’re not completely sure that someone is great we won’t hire them.

Example Interview Questions

ML Practice
- Walk me through a past project, while I dig into all the gory details.
- Mock ML problem: Here is a set of data and a modeling objective, walk me through how you would solve it
- How do you configure your training pipeline?
- How do you do hyper-parameter sweeps?
- Do you use pytorch lightning? Why or why not?
- How would you setup multi-host distributed GPU training?
- How do you setup CI/CD in an ML system?

AI Domain specific questions
- How does self supervised learning work in computer vision?
- What are some common ways to do cross validation in time series
- What is the best way to validate GPT4 output?
- What are some things you can do with Open Source LLMs that you can’t do with closed source?
- What are some of the tricks involved with training an LLM?
- What is LoRA and what is used for?
- What are some different methods to fine tune Stable Diffusion? What are their advantages / disadvantages?

Software Engineering
- Basic leetcode like coding tests in increasing difficulty
- What is your preferred way to do package management in production?
- What does git squashing do?
- What is a git rebase?
- Explain Big O run-time analysis
- A take home assignment

ML Theory / Math
- ML theory, probability, statistics, linear algebra, optimization
- Explain the optimization procedure of a GAN
- Explain the architecture of a ConvNet
- Explain Batchnorm and Dropout. Why do they help?
- What do the QKV of Attention represent?
- Explain Double Descent
- Explain Diffusion
- What are some techniques required to train a transformer?
- When would you use a UNet of a Resnet?
- What are skip connections for?
- What is model distillation? What are some techniques to do it?
- What is a p-value?
- What is Bayes Rule?
- What is Bias vs Variance?
- Explain PCA

ML Systems Architecture
- Explain the design decisions of airflow vs dagster vs a bunch of microservices
- How do you ensure the privacy and security of sensitive data in an ML system?
- What considerations do you take into account when designing a scalable and reliable ML system?
- How do you handle model versioning and model serving in a production environment?
- Can you discuss the challenges and considerations in integrating real-time streaming data with an ML system?
- How do you approach feature engineering and feature stores in a large-scale ML system?
- How do you design for model retraining and updating in a dynamic ML system?

Culture
- Unfortunately this is one of the hardest or impossible things to assess in an interview and often is left to a “gut feeling”. This is any area where Eventum’s consult-to-hire provides a huge advantage — you get to take your time hiring someone part time with low risk before bringing them on full time.
‍
- How well can this person communicate?
- Do they have attitude issues?
- Are they a team player?
- Are they self-motivated?
- Do they go out of their way to solve problems and provide value?

Table of Contents