Differentially Private Sliced Inverse Regression: Minimax Optimality and Algorithm
Differentially Private Sliced Inverse Regression: Minimax Optimality and Algorithm
1. Data
Abstract
We use two real datasets in the analysis:
- The supermarket dataset – Each record corresponds to a daily observation collected from a major supermarket in northern China. The response of interest is the number of customers on a particular day, and each predictor represents the sales volume of a specific product on that day.
- The Arcene dataset – This dataset is used to distinguish cancer from normal patterns in mass-spectrometric data. It is a two-class classification problem with continuous input variables.
Availability
The data files are not included in the supplementary materials. However, the Arcene dataset is publicly available and can be accessed online:
Guyon, I., Gunn, S., Ben-Hur, A., & Dror, G. (2004). Arcene Dataset. UCI Machine Learning Repository. Retrieved from https://archive.ics.uci.edu/dataset/167/arcene.
The supermarket dataset cannot be made publicly available due to data sharing restrictions.
2. Code
Abstract
All code is written in R and organized into modular scripts. It includes core functions for:
- Differentially private histogram slicing
- Sparse variable selection via noisy peeling
- Generalized eigenvalue decomposition with added noise
- Performing (sparse) differentially private SIR
Availability
The full R code for this paper is provided, including:
- One file containing necessary functions
- Two files for the simulation settings in the main manuscript
- Two files for real data analyses
- Two additional scripts for DP histogram slicing and DP cross-validation (referenced in the supplementary materials)
Description
The following R scripts are included for simulations, empirical analyses, and result visualization:
- functions.R Contains utility functions for:
- Computing projection matrices
- Estimating quantiles based on histogram CDFs
- Differentially private histogram estimation
- Generating autoregressive covariance matrices
- Peeling algorithm for differentially private top‑s selection
low_dim.R Generates simulation results for Table 1 in the main manuscript (high-dimensional settings). Results are stored in the
resultfolder.high_dim.R Generates simulation results for Table 2 in the main manuscript (low-dimensional settings). Results are stored in the
resultfolder.table.R Summarizes outcomes from
low_dim.Randhigh_dim.Rand compiles the final result tables for inclusion as Table 1 and Table 2 in the manuscript.supermarket.R Applies DP-SIR to the supermarket dataset, evaluating predictive performance using generalized additive models (GAMs). Supports Figure 1, Figure 2, and Table 3 in the main document.
arcene.R Applies DP-SIR to the Arcene dataset for binary classification, benchmarking performance via logistic regression and ROC-AUC analysis. Supports Figure C.1 in the supplementary material.
illustration_hist.R Illustrates the effects of differentially private histogram slicing on eigenstructure and subspace error. Supports Figures C.2–C.11 in the supplementary material.
- tuns.R Compares the proposed cross-validation estimator with an oracle benchmark. Supports Figure C.12 in the supplementary material.
3. Instructions for Use
Reproducibility Scope
All plots and tables in the paper can be reproduced using the scripts above. Simulation outputs are stored in the result folder.
Requirements
- R version: ≥ 4.0
- Required packages:
install.packages(c("geigen", "VGAM", "Matrix", "MASS", "glmnet", "ROCR", "pROC", "plotROC", "mgcv", "cowplot", "foreach", "doParallel", "xtable", "patchwork"))
Some scripts assume a registered parallel backend:
cl_size <- 10
cl <- makeCluster(cl_size)
registerDoParallel(cl)
