Today, foundation model training at scale (and large scale training in general) depends on a tight coupling of specific hardware configuration to runtime environment and software dependencies. The coupling heavily constrains the settings (platform and infrastructure) and available windows of time in which the workloads can run, partly due to demand exceeding supply for such infrastructure. Participation is further limited to only those scientists who have access both to the large scale computational resources and to the systems expertise required to make progress given such a tight coupling.
IBM Research is addressing this problem with a cloud-native software/middleware stack designed to (1) utilize cloud-native abstractions for decoupling software from the platform/infrastructure; (2) leverage hardware and middleware features for accelerating speed and scalability when available at runtime; and (3) provide an interface for scientists and domain experts that requires no additional expertise beyond writing and running Python programs on their laptop.
Presenter Bios
Marquita Ellis is an early career researcher at the IBM Thomas J. Watson Research Center. Within Hybrid Cloud Research at IBM, Ellis’s interests include leveraging HPC and cloud technologies together to accelerate discovery and broaden participation in the sciences and information technology for all.
Davis Wertheimer is a Research Scientist at the IBM Thomas J. Watson Research Center. His current work tackles fundamentally the same problem of deep learning under constraints, with an aim toward democratizing the power of large neural models, though the scale, domain, and constraints are now entirely different.