Synthetic data’s fine line between reward and disaster

Up to 20% of the data used for training AI is already synthetic — that is, generated rather than obtained by observing the real world — with LLMs using millions of synthesized samples. That could reach up to 80% by 2028 according to Gartner, adding that by 2030, it’ll be used for more business decision making than real data. Technically, though, any output you get from an LLM is synthetic data.

AI training is where synthetic data shines, says Gartner principal researcher Vibha Chitkara. “It effectively addresses many inherent challenges associated with real-world data, such as bias, incompleteness, noise, historical limitations, and privacy and regulatory concerns, including personally identifiable information,” she says.

Generating large volumes of training data on demand is appealing compared to slow, expensive gathering of real-world data, which can be fraught with privacy concerns, or just not available. Synthetic data ought to help preserve privacy, speed up development, and be more cost effective for long-tail scenarios enterprises couldn’t otherwise tackle, she adds. It can even be used for controlled experimentation, assuming you can make it accurate enough.

source

Leave a Comment

Your email address will not be published. Required fields are marked *