generate_blp_panel#

pymc_marketing.customer_choice.synthetic_data.generate_blp_panel(*, T=50, J=4, K=2, L=2, R_geo=1, true_alpha=-2.0, true_beta=None, sigma_alpha=0.5, sigma_beta=None, instrument_strength=0.7, price_xi_corr=0.6, xi_sigma=0.3, xi_product_sigma=0.5, sigma_eta=1.0, market_size=5000, n_dgp_draws=5000, region_heterogeneity=0.0, return_truth=False, random_seed=None)[source]#

Generate a synthetic BLP-style aggregate-share panel.

Produces a long-format DataFrame suitable for fitting pymc_marketing.customer_choice.BayesianBLP (and for unit tests of the same), together with — when return_truth=True — the latent parameters that generated it. The data-generating process explicitly induces correlation between the price residual η_jt and the structural error ξ_jt (controlled by price_xi_corr), so that no-IV fits exhibit the expected endogeneity bias on the price coefficient and IV fits can be shown to recover it.

Parameters:
T

Number of periods per region.

J

Number of inside products. An outside good (row label "outside") is added on top.

K

Number of product characteristics x_jt.

L

Number of instruments z_jt.

R_geo

Number of regions. Defaults to 1; set >1 together with region_heterogeneity > 0 to test hierarchical pooling.

true_alpha

Population-level price coefficient (should be negative).

true_beta

Population-level characteristic coefficients, shape (K,). Defaults to a vector of ones.

sigma_alpha

Across-consumer SD of the price coefficient.

sigma_beta

Across-consumer SD of each characteristic coefficient, shape (K,). Zero entries indicate no heterogeneity on that characteristic. Defaults to all zeros (heterogeneity only on price).

instrument_strength

Magnitude of the first-stage instrument loading π_z = instrument_strength / sqrt(L). Set small (e.g. 0.1) to simulate weak instruments.

price_xi_corr

Correlation Cor(η_jt, ξ̃_jt) of the joint price-residual / structural-error draws. Drives the endogeneity bias.

xi_sigma

SD of the time-varying part ξ̃_jt.

xi_product_sigma

SD of the product fixed effect ξ_j.

sigma_eta

SD of the price first-stage residual η_jt.

market_size

Total category volume per market. Used both to scale the Multinomial draws of observed shares and as the n column on the returned panel.

n_dgp_draws

Number of QMC draws used to compute the true mixed-logit shares. Should be much larger than the number of draws the model itself uses so that the DGP is essentially exact.

region_heterogeneity

Across-region SD applied to α_r and β_r. Zero (default) produces homogeneous regions; positive values produce heterogeneous ones.

return_truth

If True, return (df, truth_dict); otherwise return df.

random_seed

Seed or np.random.Generator.

Returns:
dfpd.DataFrame

Long-format panel with columns ["region", "market", "period", "product", "share", "n", "price", "x_0", ..., "x_{K-1}", "z_0", ..., "z_{L-1}"]. The outside good appears once per market with price and all characteristics / instruments set to zero. market is a global integer; period is the integer time index within a region (0..T-1); region is a string label. Pass time_col="period" to pymc_marketing.customer_choice.BayesianBLP to make the time dimension first-class for counterfactuals.

truthdict, optional

Returned only when return_truth=True. Contains the population and per-cell parameters that generated the panel — useful for recovery tests. See source for the exact keys.

Examples

Generate a small panel with strong instruments and a bias-inducing price/structural-error correlation:

from pymc_marketing.customer_choice import generate_blp_panel

df, truth = generate_blp_panel(
    T=30,
    J=3,
    K=2,
    L=2,
    true_alpha=-2.0,
    price_xi_corr=0.6,
    random_seed=42,
    return_truth=True,
)