AIOctober 17, 202531 views

Goldman Sachs Warns of AI Data Crisis

Goldman Sachs has raised a critical alarm for the AI industry: natural data for training language models has run out. This scarcity threatens to slow down AI evolution and the development of better tools.

The Synthetic Data Problem

According to Neema Raphael, Goldman Sachs' Data Director and Head of Data Engineering, the gap is currently being filled with synthetic data—information pre-processed by previous AI models. While technically unlimited, this approach comes with serious risks:

Lower quality information
Loss of human elements in training data
Potential degradation of future AI models

A Possible Solution: Corporate Data Vaults

Raphael believes there's still untapped potential locked away in proprietary corporate databases, beyond the reach of the public internet. This includes:

Trading flows
Customer interactions
Internal business records

Goldman Sachs itself holds vast amounts of such data. However, there's a catch: "The challenge is understanding the data, understanding the business context, and then being able to normalize it in a way that makes sense for the company to consume it," Raphael explained.

What This Means for AI's Future

The AI industry faces a critical crossroads: rely on potentially inferior synthetic data or unlock corporate data treasures currently kept under lock and key.

Goldman Sachs Warns of AI Data Crisis

The Synthetic Data Problem

A Possible Solution: Corporate Data Vaults

What This Means for AI's Future

More

OpenAI DevDay: Native ChatGPT Integration and AgentKit Tools

Microsoft Appoints Judson Althoff as New Commercial CEO