SYNTHETIC DATA GENERATION FOR ENABLING PRIVACY-PRESERVING CYBERSECURITY RESEARCH AND MODEL TRAINING

Abstract
High-quality datasets are a priority in today's cybersecurity, but they are most likely unavailable because of privacy policies and dataset limitations. Synthetic data generation presents a strong solution, which enables intrusion detection systems and threat models to be trained without exposing actual sensitive data. This research addresses two primary issues: Can synthetic network traffic accurately emulate real-world data for cybersecurity purposes? Can privacy-preserving mechanisms defend against advanced attacks? We evaluate five generative methods, CTGAN, CopulaGAN, tabular diffusion models, and their differentially private (DP-augmented) variants, on NSL-KDD and CICIDS-2017 datasets. We quantify utility using statistical fidelity, classifier accuracy (AUC, F1-score), diversity, and resistance to membership inference and reconstruction attacks. Results indicate GAN-based models achieve more than 90% fidelity and keep classifier AUC 3% behind real-data baselines, with diffusion models enabling higher diversity at the cost of less computation. DP-SGD integration effectively thwarts attacks to within-random accuracy with little loss of utility. Some limitations continue, though. Synthetic data can potentially exclude intricate correlations of real traffic, and harsh privacy settings (ε ≤ 1) have a strong impact on downstream performance, demonstrating difficult trade-offs between fidelity, diversity, and privacy protection. Our work is twofold: (1) a strict, comparative benchmark of synthetic data methods for cybersecurity; (2) empirical validation of DP-augmented synthesis as a feasible and resilient option; and (3) a best-practice framework in equilibrium among utility, diversity, privacy, and known bounds. Our work enables responsible AI and ethical data sharing for cybersecurity, while demonstrating appreciation for the balance between privacy and utility in these trade-offs.
Keywords
CTGAN, conditional GAN, CopulaGAN, Cybersecurity, differential privacy, diffusion model, generative models, membership inference, synthetic data