generate_blp_panel#
- pymc_marketing.customer_choice.synthetic_data.generate_blp_panel(*, T=50, J=4, K=2, L=2, R_geo=1, true_alpha=-2.0, true_beta=None, sigma_alpha=0.5, sigma_beta=None, instrument_strength=0.7, price_xi_corr=0.6, xi_sigma=0.3, xi_product_sigma=0.5, sigma_eta=1.0, market_size=5000, n_dgp_draws=5000, region_heterogeneity=0.0, return_truth=False, random_seed=None)[source]#
Generate a synthetic BLP-style aggregate-share panel.
Produces a long-format DataFrame suitable for fitting
pymc_marketing.customer_choice.BayesianBLP(and for unit tests of the same), together with — whenreturn_truth=True— the latent parameters that generated it. The data-generating process explicitly induces correlation between the price residualη_jtand the structural errorξ_jt(controlled byprice_xi_corr), so that no-IV fits exhibit the expected endogeneity bias on the price coefficient and IV fits can be shown to recover it.- Parameters:
- T
Number of periods per region.
- J
Number of inside products. An outside good (row label
"outside") is added on top.- K
Number of product characteristics
x_jt.- L
Number of instruments
z_jt.- R_geo
Number of regions. Defaults to 1; set
>1together withregion_heterogeneity > 0to test hierarchical pooling.- true_alpha
Population-level price coefficient (should be negative).
- true_beta
Population-level characteristic coefficients, shape
(K,). Defaults to a vector of ones.- sigma_alpha
Across-consumer SD of the price coefficient.
- sigma_beta
Across-consumer SD of each characteristic coefficient, shape
(K,). Zero entries indicate no heterogeneity on that characteristic. Defaults to all zeros (heterogeneity only on price).- instrument_strength
Magnitude of the first-stage instrument loading
π_z = instrument_strength / sqrt(L). Set small (e.g. 0.1) to simulate weak instruments.- price_xi_corr
Correlation
Cor(η_jt, ξ̃_jt)of the joint price-residual / structural-error draws. Drives the endogeneity bias.- xi_sigma
SD of the time-varying part
ξ̃_jt.- xi_product_sigma
SD of the product fixed effect
ξ_j.- sigma_eta
SD of the price first-stage residual
η_jt.- market_size
Total category volume per market. Used both to scale the Multinomial draws of observed shares and as the
ncolumn on the returned panel.- n_dgp_draws
Number of QMC draws used to compute the true mixed-logit shares. Should be much larger than the number of draws the model itself uses so that the DGP is essentially exact.
- region_heterogeneity
Across-region SD applied to
α_randβ_r. Zero (default) produces homogeneous regions; positive values produce heterogeneous ones.- return_truth
If
True, return(df, truth_dict); otherwise returndf.- random_seed
Seed or
np.random.Generator.
- Returns:
- df
pd.DataFrame Long-format panel with columns
["region", "market", "period", "product", "share", "n", "price", "x_0", ..., "x_{K-1}", "z_0", ..., "z_{L-1}"]. The outside good appears once per market withpriceand all characteristics / instruments set to zero.marketis a global integer;periodis the integer time index within a region (0..T-1);regionis a string label. Passtime_col="period"topymc_marketing.customer_choice.BayesianBLPto make the time dimension first-class for counterfactuals.- truth
dict, optional Returned only when
return_truth=True. Contains the population and per-cell parameters that generated the panel — useful for recovery tests. See source for the exact keys.
- df
Examples
Generate a small panel with strong instruments and a bias-inducing price/structural-error correlation:
from pymc_marketing.customer_choice import generate_blp_panel df, truth = generate_blp_panel( T=30, J=3, K=2, L=2, true_alpha=-2.0, price_xi_corr=0.6, random_seed=42, return_truth=True, )