ML-Infrastructure: Build vs. Buy vs. Open-Source

Notes from TWIMLCon’s Unconference session

At the inaugural TWIMLCon in San Francisco, I led an unconference session focused on a topic that is on the minds of many companies: whether to build ML infrastructure in-house, whether to buy it, or whether to just leverage open-source.

This unconference session turned out to be most popular with lots of lively discussion. Here are the notes from this session (anonymized for privacy):

Why (not) use Open-Source ML Infrastructure?

Most teams finding themselves in need of ML Infrastructure start by looking at reference implementations of ML Platforms from large tech companies like Uber (Michelangelo) and AirBnb (BigHead). These platforms provide a good rubric of what a typical ML platform might require. However, since none of these implementations are open-source, all teams must decide whether to build, buy, or piece together open-source components of their infrastructure.

The general consensus of the attendees was that current open-source offerings for ML Infrastructure, though progressing rapidly, are not mature enough for most teams (e.g., unlike MySQL for databases or AirFlow for data pipelines.) Either they require an immense amount of setup and hacking or are inadequate in the functionality they offer. In addition, someone from the team must become an expert in that platform and keep up with frequent changes in the platform.

When to Build ML Infrastructure In-House?

The consensus on building in-house was that building in-house makes sense if your application has a very specialized use case (e.g., prediction latency must be <20 ms, or your models must work with a legacy model serving system.) In this case, a team might spend a long time customizing an off-the-shelf offering, making the Buy decision less worthwhile. In addition, if your setup requires a lot of customization, it is worth building the required competency in-house instead of depending on a third-party.

We took an informal poll on how long it took a team to build their ML Platform in-house. The answers spanned quite a spectrum:

Depending on the complexity of the platform, building in-house can take anywhere from a couple of solid ML Engineers working over a few quarters (small shop) to a dozen engineers and two years (for a serious deployment).

An attendee noted that sometimes it may be unclear what a team requires from their ML Infrastructure. In that case, trying to build a prototype in-house can help identify the gaps and loopholes to better inform the Build vs. Buy decision.

Why Buy ML Infrastructure?

The strongest reason to buy an ML Platform vs. building in-house is that building in-house represents an opportunity cost.

The time that your team spends building ML Infrastructure is the time spent not doing something else, e.g., product features, better instrumentation.

— TWIMLCon Attendee

If building infrastructure is not going to help you differentiate in your business (a low leverage activity), don’t build in-house. Particularly if you’re on a small team, building infrastructure is not the best use of your resources; they could be better spent on product development and servicing customers.

When asked to list the justifications to upper management for buying an ML Platform, attendees listed the following:

It increases the productivity of the data science team (as noted by an attendee, “even if I can save one hour for an engineer every day, that’s 300+ hours saved a year”)
Everything in one place (“My team doesn’t need to go to five different places to get something done”)
The team needs to have access to the latest and greatest tools and techniques.
We can bring more products to market, faster.

How do I run a process to Buy ML Infrastructure?

First, know what you are looking for. This was probably the point most highlighted by multiple attendees. For instance, have answers to the following questions:

Which infrastructure does my ML Platform need to support? E.g. AWS? GCP? Azure? On-prem?
What language support do I require: R, Python, Java?
What problem am I looking to solve with this platform? What do I really want to get out of my ML Platform? E.g. scalable training, faster deployment, improved collaboration among team members, model monitoring etc. Different vendors are stronger in different areas.
When do I need the platform to be fully up and running?
What is my realistic budget incl. setup fees, support contracts, onboarding and training?
What tangible business metrics do I want to impact with this purchase? How does success look like in this case?
Will we need a large amount of support or are we well equipped to service the platform ourselves?

Once you know what you want:

Research vendor offerings. Read articles and blogs. Listen to podcasts. Talk to your peers at other companies.
Do demos with vendors and understand if their offering will meet your needs (Make sure you are armed with your list of requirements to prevent your team from getting dazzled by a shiny toy.)
For the vendors that meet your needs, do a thorough trial. Make sure you involve your stakeholders so you have adequate buy-in internally (e.g., IT, Engineering, DevOps, PM, Leadership etc.)
Pick a platform that meets your budget, needs, and will provide you the support you require.

As you might imagine, this process can take several weeks, so plan accordingly.

Advice for Vendors of ML Infrastructure

The biggest and rather surprising piece of advice that was provided was: push your customers on their needs first. Lots of times, the customer is looking to better understand the landscape of tools out there and see what might match their needs.

So helping the customer identify and articulate their needs is extremely valuable.

Offer to partner with customers on trials to identify where your platform might have gaps with respect to their use case. You’d rather know this early only vs. once a trial is done. Communicate clearly what integrations you offer now and are willing to offer in the future.

Articulate accurately what you do and do not do — if a platform does everything, it does nothing.

— Platform Buyer

Vendors are over-promising and under-delivering. If you can do the opposite, you will stand out!

Thank you to TWIMLCon for hosting this Unconference session and to everyone who attended. If I missed anything or if you have any questions, please reach out at manasi@verta.ai.

Unrelated to TWIMLCon, my company works with many teams who have faced the build-vs-buy-vs-open-source dilemma and we have created a cheat sheet of tradeoffs involved in ML Infrastructure.

Submit your email to check it out and feel free to reach out for questions.

About Manasi:

Manasi Vartak is the founder and CEO of Verta, an MIT-spinoff building software to enable production machine learning. Verta grew out of Manasi’s Ph.D. work at MIT CSAIL on ModelDB. Manasi previously worked on deep learning as part of the feed-ranking team at Twitter and dynamic ad-targeting at Google. Manasi is passionate about building intuitive data tools like ModelDB and SeeDB, helping companies become AI-first, and figuring out how data scientists and the organizations they support can be more effective. She got her undergraduate degrees in computer science and mathematics from WPI.

About Verta:

Verta builds software for the full ML model lifecycle starting with model versioning, to model deployment and monitoring, all tied together with collaboration capabilities so your AI & ML teams can move fast without breaking things. We are a spin-out of MIT CSAIL where we built ModelDB, one of the first open-source model management systems.

Subscribe To Our Blog

Get the latest from Verta delivered directly to you email.