MIT startup DataCebo offers a tool to evaluate synthetic data
DataCebo, a spin-off from the MIT Computer Science & Artificial Intelligence Laboratory (CSAIL), offers a new tool, dubbed Synthetic Data (SD) Metrics, to help companies benchmark the quality of machine-generated synthetic data by comparing it to actual datasets.
The app, which is an open-source Python library for evaluating model-independent synthetic tabular data, defines measures for statistics, efficiency and data privacy, according to Kalyan Veeramachaneni, principal investigator at MIT and co-founder of Data Cebo.
“For tabular synthetic data, there is a need to create metrics that quantify how synthetic data compares to real data. Each metric measures a particular aspect of the data, such as coverage or correlation, allowing you to identify specific elements that have been preserved or overlooked during the data synthesis process,” said Neha Patki, co-founder of DataCebo .
Features like CategoryCoverage and RangeCoverage can quantify whether a company’s synthetic data covers the same range of possible values as real data, Patki added.
“To compare correlations, the software developer or data scientist downloading SDMetrics can use the CorrelationSimilarity metric. There are a total of over 30 metrics and more are still in development,” Veeramachaneni said.
Synthetic Data Vault generates synthetic data
The SDMetrics library, according to Veeramachaneni, is part of the Synthetic Data Vault (SDV) project which was first launched at MIT’s Data to AI Lab in 2016. As of 2020, DataCebo owns and develops all aspects of SDV.
The Vault, which can be defined as an ecosystem for generating synthetic library data, was launched with the idea of helping companies create data models to develop new software and applications within the company.
“While there is a lot of work in the area of synthetic data, particularly in self-driving cars or imagery, little is being done to help businesses take advantage of it,” Veeramachaneni said.
“The SDV was developed to ensure that companies could download the synthetic data generation packages in cases where no data was available or there was a risk of compromising data privacy,” added Veeramachaneni.
Under the hood, the company claims to use several graph modeling and deep learning techniques, such as Copulas, CTGAN, and DeepEcho, among others.
According to Veeramachaneni, the copulas have been downloaded more than a million times and models using this technique are used by major banks, insurance companies and companies that focus on clinical trials.
The CTGAN, or neural network-based model, has been downloaded over 500,000 times.
Other datasets with multiple tables or time series data are also supported, the founders of DataCebo said.
Copyright © 2022 IDG Communications, Inc.