Rethinking AI from the Ground Up: Hongseok Namkoong Aims for Smarter Data, Not Bigger Models

A Columbia-led framework challenges AI’s obsession with scale—and opens the door to more efficient, principled development.

September 03, 2025

In today’s AI arms race, bigger often wins. Tech companies compete to build ever-larger models trained on massive datasets, pouring resources into compute infrastructure and experimentation. But Hongseok Namkoong, Assistant Professor in the Decision, Risk, and Operations Division at Columbia Business School and a member of the Data Science Institute, is taking a different approach—one that could reshape how AI systems are trained in the first place.

“We’re still just beginning to understand how to systematically choose and curate training data,” Namkoong says.

Namkoong’s research is grounded in a core belief: the most important challenges in AI today involve making systems more trustworthy, adaptable, and empirically grounded. He focuses on what he sees as a foundational but understudied problem—how to select and combine the right data sources for training machine learning models.

A New Framework for Data Mixture Optimization

In a recent study, Namkoong and his team introduced a Bayesian optimization framework for data mixture optimization—a method for selecting the most effective combination of data sources when training large models. Until now, researchers have typically relied on ad hoc methods, intuition, or trial-and-error to make these decisions.

This new framework uses probabilistic scaling laws to extrapolate from small-scale experiments to larger training regimes, adaptively selecting data mixtures, model sizes, and training durations. The method yielded up to 3.3x efficiency gains over standard approaches such as random search and traditional Bayesian optimization.

“This optimization problem wasn’t even articulated before this paper,” Namkoong says.

“We are the first to frame the question, propose an answer—and now we’re going to dive into the research. My hope is to take the entire field in a new direction.”

Namkoong’s research is grounded in a core belief: the most important challenges in AI today involve making systems more trustworthy, adaptable, and empirically grounded.

Namkoong sees this work as a step toward helping a broader range of researchers and institutions build effective AI systems. By replacing trial-and-error with principled guidance, the framework makes it easier to optimize training strategies in a variety of settings.

“My goal is to help the research community develop tools that are broadly accessible—not only to major technology companies, but to smaller institutions and nonprofits as well,” he says.

“Better foundations does not mean a newer generation of transformers. It means a better way to curate data sets. And in some sense, this is a woefully understudied topic in the research community.”

Infrastructure That Accelerates Research

The project was supported in part by Empire AI, which provided access to high-performance computing resources across New York State. That access allowed Namkoong’s team to dramatically accelerate their work.

“With Empire AI, what would have taken a year or two using our own machines took just 2–3 weeks.”

While infrastructure played a key role in scaling the experiments, the intellectual work—the problem formulation, modeling, and strategy—was developed at Columbia, where Namkoong’s lab is advancing a broader research agenda focused on trustworthy, efficient, and adaptable AI.