By AI Trends Staff
Assuring that the huge volumes of data on which many AI applications rely is not biased and complies with restrictive data privacy regulations is a challenge that a new industry is positioning to address: synthetic data production.
Synthetic data is computer-generated data that can be used as a substitute for data from the real world. Synthetic data does not explicitly represent real individuals. “Think of this as a digital mirror of real-world data that is statistically reflective of that world,” stated Gary Grossman, senior VP of Technology Practice Edelman, public relations and marketing consultants, in a recent account in VentureBeat. “This enables training AI systems in a completely virtual realm.”
The more data an AI algorithm can train on, the more accurate and effective the results will be.
To help meet the demand for data, more than 50 software suppliers have developed data synthetic products, according to research last June by StartUs Insights, consultants based in Vienna, Austria.
One alternative for responding to privacy concerns is anonymization, the masking or elimination of personal data such as names and credit card numbers from ecommerce transactions, or removing identifying content from healthcare records. “But there is growing evidence that even if data has been anonymized from one source, it can be correlated with consumer datasets exposed from security breaches,” Grossman states. This can even be done by correlating data from public sources, not requiring a security hack.
A primary tool for building synthetic data is the same one used to create deepfake videos—generative adversarial networks (GANs), a pair of neural networks. One network generates the synthetic data and the second tries to detect if it is real. The AI learns over time, with the generator network improving the quality of the data until the discriminator cannot tell the difference between real and synthetic.
A goal for synthetic data is to correct for bias found in real world data. “By more completely anonymizing data and correcting for inherent biases, as well as creating data that would otherwise be difficult to obtain, synthetic data could become the saving grace for many big data applications,” Grossman states.
Big tech companies including IBM, Amazon, and Microsoft are working on synthetic data generation. However, it is still early days and the developing market is being led by startups.
A few examples:
AiFi — Uses synthetically generated data to simulate retail stores and shopper behavior;
AI.Reverie — Generates synthetic data to train computer vision algorithms for activity recognition, object detection, and segmentation;
Anyverse — Simulates scenarios to create synthetic datasets using raw sensor data, image processing functions, and custom LiDAR settings for the automotive industry.
Synthetic Data Can Be Used to Improve Even High-Quality Datasets
Even if you have a high-quality dataset, acquiring synthetic data to round it out often makes sense, suggests Dawn Li, a data scientist at the Innovation Lab of Finastra, a company providing enterprise software to banks, writing in InfoQ
For example, if the task is to predict whether a piece of fruit is an apple or an orange, and the dataset has 4,000 samples for apples and 200 samples for oranges, “Then any machine learning algorithm is likely to be biased towards apples due to the class imbalance,” Li stated. If synthetic data can generate 3,800 more synthetic examples for oranges, the model will have no bias toward either fruit and thus can make a more accurate prediction.
For data you wish to share that contains personally identifiable information (PII), and for which the time it takes to anonymize makes that impractical, synthetic samples from the real dataset can preserve important characteristics of the real data and can be shared without the risk of invading privacy and leaking personal information.
Privacy issues are paramount in financial services. “Financial services are at the top of the list when it comes to concerns around data privacy. The data is sensitive and highly regulated,” Li states. As a result, the use of synthetic data has grown rapidly in financial services. While it is difficult to obtain more financial data, because of the time it takes to generate real world experience, synthetic data can be generated to allow the data to be used immediately.
A popular method for generating synthetic data, in addition to GANs, is the use of variational autoencoders, neural networks whose goal is to predict their input. Traditional supervised machine learning tasks have an input and an output. With autoencoders, the goal is to use the input to predict and try to reconstruct the input itself. The network has an encode and a decoder. The encoder compresses the input, creating a smaller version of it. The decoder takes the compressed input and tries to reconstruct the original input. In this way, scaling down the data in the encode and building it back up from the encode, the data scientist is learning how to represent the data. “If we can accurately rebuild the original input, then we can query the decoder to generate synthetic samples,” Li stated.
To validate the synthetic data, Li suggested using statistical similarity and machine learning efficacy. To assess similarity, view side-by-side histograms, scatterplots, and cumulative sums of each column to ensure we have a similar look. Next, look at correlations and plot a matrix of the real and synthetic data sets to get an idea of how similar or different the correlations are.
To assess machine learning efficacy, review a target variable or column. Create some evaluation metrics and assess how well the synthetic data performs. “If it performs well upon evaluation on real data, then we have a good synthetic dataset,” Li stated.
Best Practices for Working with Synthetic Data
Best practices for working with synthetic data were suggested in a recent account in AIMultiple written by Cem Dilmegani, founder of the company that seeks to “democratize” AI.
First, work with clean data. “If you don’t clean and prepare data before synthesis, you can have a garbage in, garbage out situation,” he stated. He recommended following principles of data cleaning, and data “harmonization,” in which the same attributes from different sources need to be mapped to the same columns.
Also, assess whether synthetic data is similar enough to real data for its application area. Its usefulness will depend on the technique used to generate it. The AI development team should analyze the use case and decide if the generated synthetic data is a good fit for the use case.
And, outsource support if necessary. The team should identify the organization’s synthetic data capabilities and outsource based on the capability gaps. The two steps of data preparation and data synthesis can be automated by software suppliers, he suggests.