The most sophisticated AI systems to date are capable of impressive feats, from driving cars through city streets to writing human-like prose. But they share a common bottleneck: the hardware. Development of systems at the forefront often requires huge computing power. For example, the creation of the DeepMind program, which predicts the structure of a protein. AlphaFold took a cluster of hundreds of GPUs. Re-emphasizing the issue, one source estimates that it would take 355 years to develop the GPT-3 language-generating AI startup OpenAI using a single GPU.
New methods and chips designed to speed up certain aspects of AI systems development promise (and have already done) to reduce hardware requirements. But developing with these methods requires experience that can be difficult for smaller companies. At least that’s what Varun Mohan and Douglas Chen, co-founders of the infrastructure startup, say. Exafunction. Coming out of stealth today, Exafunction is developing a platform to abstract away the complexity of using hardware to train AI systems.
“Improvements [in AI] often backed by a significant increase in…computational complexity. As a result, companies are forced to invest heavily in hardware to realize the benefits of deep learning. It is very difficult because the technology is improving very quickly and the workload size is rapidly increasing as deep learning proves the value within the company,” Chen said in an email interview with TechCrunch. “The specialized accelerator chips required to perform deep learning computations at scale are lacking. Effective use of these chips also requires esoteric knowledge uncommon for deep learning practitioners.”
With $28 million in venture capital, $25 million of which came from Greenoaks’ Series A round led by Founders Fund, Exafunction is looking to address what it sees as a symptom of a lack of AI expertise: idle hardware. GPUs and the aforementioned specialized chips used to “train” AI systems, i.e. provide data that the systems can use to make predictions, are often underused. Because they run some AI workloads so quickly, they sit idle waiting for other parts of the hardware stack, like processors and memory, to catch up.
Lukas Beywald, founder of AI development platform Weights and Biases, reports that nearly a third of his company’s customers use less than 15% GPUs on average. However, in 2021 poll commissioned by Run:AI, which competes with Exafunction, only 17% of companies said they were able to achieve “high utilization” of their AI resources, and 22% said their infrastructure was mostly idle.
The costs add up. According to Run:AI, 38% of companies had an annual budget for AI infrastructure, including hardware, software, and cloud service fees, in excess of $1 million as of October 2021. rated spent $4.6 million on GPT-3 training.
“Most deep learning companies go into business to focus on their core technologies rather than wasting their time and bandwidth worrying about resource optimization,” Mohan said via email. “We believe there is no serious competitor that solves the problem we are focused on, which is by abstracting away the challenges of driving accelerated hardware such as GPUs while delivering superior performance to customers.”
The seed of an idea
Before co-founding Exafunction, Chen was a software engineer at Facebook where he helped build tools for devices like the Oculus Quest. Mohan was a CTO at Nuro, an autonomous delivery startup, responsible for managing the company’s autonomous infrastructure teams.
“Because our deep learning workloads [at Nuro] complexity and demands have increased, it has become clear that there is no clear solution to properly scale our equipment,” said Mohan. “Simulation is a strange problem. Perhaps, ironically, as your software improves, you need to model even more iterations to find edge cases. The better your product, the harder it will be for you to find bugs. We realized how difficult this difficult path was and spent thousands of engineering hours trying to squeeze more performance from the resources we have.”
Exafunction clients connect to a company’s managed service or deploy Exafunction software to a Kubernetes cluster. The https://www.exafunction.com/ technology allocates resources dynamically, offloading computation to “low cost hardware” such as spot instances when available.
Mohan and Chen objected when asked about the inner workings of the Exafunction platform, preferring to keep those details under wraps for now. But they explained that at a high level Exafunction uses virtualization to run AI workloads even with limited hardware availability, ostensibly resulting in increased utilization while lowering costs.
Exafunction’s reluctance to disclose information about its technology, including whether it supports cloud accelerator chips such as Google Tensor Processing Units (TPUs) – causes some concern. But to dispel doubts, Mohan, without naming names, said that Exafunction already manages GPUs for “some of the most complex autonomous vehicle companies and organizations at the forefront of computer vision.”
“Exafunction provides a platform that decouples workloads from acceleration hardware such as GPUs, enabling maximum utilization—lowering costs, improving performance, and enabling companies to take full advantage of f… [The] The platform allows teams to consolidate their work on a single platform without
the problem of combining a disparate set of software libraries,” he added. “We expect that [Exafunction’s product] will advance the market by doing for deep learning what AWS has done for cloud computing.”
Mohan may have grand plans for Exafunction, but the startup isn’t the only one applying the concept of “intelligent” infrastructure allocation to AI workloads. Beyond Run:AI — whose product also creates an abstraction layer to optimize AI workloads — Grid.ai suggestions software that allows data scientists to train artificial intelligence models on hardware in parallel. For its part, Nvidia sells enterprise AIa set of tools and environments that enable companies to virtualize AI workloads on Nvidia-certified servers.
But Mohan and Chen see a huge address market despite the overcrowding. In the conversation, they positioned the subscription-based Exafunction platform not only as a way to remove barriers to AI development, but also to enable companies facing supply chain constraints to “discover more value” from existing hardware. (In recent years for a number of different reasons, GPUs have become a hot commodity.) There’s always the cloud, but according to Mohan and Chen, it can drive up costs. One evaluate found that training an AI model using on-premise hardware is up to 6.5 times cheaper than the least expensive cloud alternative.
“While deep learning has an almost infinite number of applications, two of the ones we are most interested in are autonomous vehicle simulation and video output at scale,” Mohan said. “Simulation is at the heart of all software development and validation in the autonomous vehicle industry… Deep learning has also led to exceptional advances in automated video processing with applications across a wide variety of industries. [But] while GPUs are essential to autonomous vehicle companies, their hardware is often underused despite their price and scarcity. [Computer vision applications are] also demanding on computing resources, [because] each new video stream is actually a data stream – each camera outputs millions of frames per day.
Mohan and Chen say the Series A capital will be used to expand the Exafunction team and deepen the product. The company will also invest in optimizing AI system runtime “for the most latency-sensitive applications” (such as autonomous driving and computer vision).
“While we are currently a strong and flexible development-first team, we look forward to rapidly increasing the size and capabilities of our organization in 2022,” Mohan said. “Across virtually every industry, it is clear that as workloads become more complex (and more companies want to use deep learning data), the demand for computing resources far outstrips [supply]. While the pandemic has highlighted these concerns, this phenomenon and its associated bottlenecks could worsen in the coming years, especially as cutting-edge models become exponentially more demanding.”
Credit: techcrunch.com /