[MPlus] Latent Class Analysis / Latent Profile Analysis
Latent Class Analysis (LCA) and Latent Profile Analysis (LPA) are powerful statistical methods for identifying unobserved subgroups within a population. These approaches have gained significant popularity in social sciences, psychology, and health research over the past decade.
While these methods share similar goals, they differ in one key aspect:
- Latent Class Analysis (LCA) works with categorical indicator variables (e.g., yes/no responses, Likert scales treated as categorical). LCA identifies classes based on different probabilities of endorsing items.
- Latent Profile Analysis (LPA) works with continuous indicator variables (e.g., test scores, physiological measurements). LPA identifies profiles based on different mean levels on the continuous indicators.
Both methods aim to identify homogeneous subgroups (latent classes or profiles) within heterogeneous populations, allowing researchers to understand patterns that might not be immediately apparent in the data.
The Modeling Process
Running an LCA/LPA typically involves several stages:
- Data Preparation: Ensure your data is in a format Mplus can read (e.g., space-delimited, tab-delimited, or CSV, often without headers in the data file itself). Define variable names clearly.
- Model Enumeration (Step 1 Syntax): Fit models with different numbers of classes (e.g., 1-class, 2-class, 3-class, etc.). Compare model fit statistics to determine the optimal number of classes that best represent the heterogeneity in your data.
- Saving Class Assignments (Step 2 – Part of Step 1 Syntax): Use the SAVEDATA command to save classification information (most likely class membership and posterior probabilities) for each individual.
- Relating Classes to Other Variables (Step 3 Syntax): Use the 3-step approach (or related methods) to examine predictors (covariates) of class membership or differences in outcomes across classes, properly accounting for classification error.
- Interpretation & Visualization: Understand the meaning of each class/profile and visualize the patterns.
Step 1: Running the Basic LCA/LPA Model (Class Enumeration)
The primary goal here is to find the best number of classes (let’s call this k). You’ll run separate Mplus analyses, incrementing k each time (e.g., run a 2-class model, then a 3-class, then a 4-class…).
Here’s a sample Mplus syntax for a 3-class LPA model. Explanations follow each line/block.
TITLE: 3-Class LPA
DATA:
FILE IS "data.dat";
FORMAT IS FREE;
VARIABLE:
NAMES ARE id stud_id item1 item2 item3 item4 age gender;
USEVARIABLES ARE item1 item2 item3 item4;
MISSING ARE .;
IDVARIABLE IS id;
CLASSES = C(3);
ANALYSIS:
TYPE = MIXTURE;
ESTIMATOR = MLR;
STARTS = 500 100;
PROCESSORS = 4;
LRTSTARTS = 500 100 1000;
MODEL:
%OVERALL%
C ON age gender;
%C#1%
[item1-item4];
item1-item4;
%C#2%
[item1-item4];
item1-item4;
%C#3%
[item1-item4];
item1-item4;
OUTPUT:
TECH1 TECH8 TECH11 TECH14;
SAMPSTAT;
STANDARDIZED (STDYX);
SAVEDATA:
FILE IS lpa_3class_results.dat;
SAVE = CPROBABILITIES;
PLOT:
TYPE = PLOT3;
SERIES = item1 item2 item3 item4 (*);
Line-by-Line Explanation for Beginners:
- TITLE: Simple description for your output file header.
- DATA: Tells Mplus where to find your data (FILE IS) and how it’s arranged (FORMAT IS). FREE means variables are separated by spaces or tabs.
- VARIABLE: This is crucial.
- NAMES ARE: Lists all variables in your .dat file in the exact order they appear.
- USEVARIABLES ARE: Specifies only the variables Mplus should use as indicators to form the classes in this specific analysis.
- MISSING ARE: Tells Mplus what value(s) represent missing data.
- IDVARIABLE IS: Optional, names a variable holding unique case identifiers. Useful for merging saved data later.
- CLASSES = C(k): This is the core command defining the latent class variable. C is the default name Mplus uses for the latent class variable. (k) specifies the number of classes to extract (here, 3).
- ANALYSIS: Controls the estimation process.
- TYPE = MIXTURE: This must be specified for LCA/LPA.
- ESTIMATOR = MLR: Choose the statistical method. ML (Maximum Likelihood) is standard, but MLR provides robust standard errors and a scaled chi-square, better if indicators aren’t perfectly normally distributed (common in LPA). For LCA with only categorical indicators, ML is often sufficient.
- STARTS = # #: Mixture models can have multiple solutions (local maxima). Using many random starting value sets helps ensure Mplus finds the best one (global maximum). 500 100 means 500 initial stage starts, refining the best 100 in the final stage.
- PROCESSORS = #: If your computer has multiple cores, this can speed things up.
- LRTSTARTS = # # #: Used for calculating the Lo-Mendell-Rubin (LMR) and Bootstrap Likelihood Ratio Tests (BLRT) which compare the current k-class model to the k-1 class model. Requires saving data from the k-1 model run (see Mplus docs).
- MODEL: Defines the statistical model itself.
- %OVERALL%: Parameters defined here are constrained to be equal across all classes unless overridden in class-specific blocks.
(Example: C ON age gender; would estimate the effect of age and gender on predicting class membership while forming the classes – this is a 1-step approach, distinct from the recommended 3-step). - %C#1%, %C#2%, %C#3%: These blocks define parameters specific to each latent class.
- [item1-item4];: Square brackets [] denote intercepts (for categorical indicators in LCA) or means (for continuous indicators in LPA). This command estimates the mean of item1 through item4 specifically for this class.
- item1-item4;: Listing variables without brackets estimates their variances within this class. By default, Mplus assumes within-class variances are estimated freely for each class in LPA. For LCA, you’d typically focus only on thresholds/probabilities within the [].
- %OVERALL%: Parameters defined here are constrained to be equal across all classes unless overridden in class-specific blocks.
- OUTPUT: Specifies what results to print in the output file.
- TECH1: Model parameter specifications.
- TECH8: Assessment of classification quality (includes Entropy).
- TECH11/TECH14: Used for Likelihood Ratio Tests (LMR/BLRT).
- SAMPSTAT: Basic descriptive statistics of the used variables.
- STANDARDIZED (STDYX): Provides standardized parameter estimates, useful for comparing effect sizes.
- SAVEDATA: Saves information to a new data file.
- FILE IS: Name of the file to be created.
- SAVE = CPROBABILITIES: Saves the estimated posterior probabilities of belonging to each class for every individual, along with their most likely class assignment (based on the highest probability). This file is crucial for Step 3 and for plotting outside Mplus.
- PLOT: Requests plots generated by Mplus (viewable in its diagrammer).
- TYPE = PLOT3: Provides profile plots showing estimated means (LPA) or probabilities (LCA) for each class.
- SERIES = item1 item2 item3 item4 (*): Specifies which variables define the profiles to be plotted. The (*) indicates that all classes should be plotted on the same graph.
Choosing the Number of Classes:
After running models for k=2, 3, 4, etc., compare:
- Information Criteria: AIC, BIC, sample-size Adjusted BIC (aBIC). Lower values generally indicate better fit relative to model complexity. BIC is often preferred for class enumeration.
- Likelihood Ratio Tests: Lo-Mendell-Rubin (LMR) LRT and Bootstrap LRT (BLRT). A significant p-value (< .05) suggests the k-class model fits significantly better than the k-1 class model. Non-significance suggests the k-1 model might be sufficient.
- Entropy: Ranges from 0 to 1. Higher values (e.g., > 0.80) indicate clearer separation between classes. Low entropy suggests individuals are not clearly assigned to a single class.
- Interpretability & Theory: Do the resulting classes make theoretical sense? Are the class sizes reasonable (avoiding tiny classes unless theoretically justified)?
Often, there’s no single “perfect” answer. Researchers weigh these different indices alongside theoretical considerations.
Step 2: Saving Class Information
This isn’t really a separate Mplus run, but rather utilizes the SAVEDATA command within the Step 1 syntax for your chosen best-fitting model. As shown above:
SAVEDATA: FILE IS lpa_FINAL_results.dat;
SAVE = CPROBABILITIES;
This creates a new data file (lpa_FINAL_results.dat
) containing all original variables plus new columns. These new columns typically include:
- Posterior probabilities for each class (e.g., CPROB1, CPROB2, CPROB3 for a 3-class model).
- The most likely class membership (C).
This file is essential for the 3-step approach and for creating custom plots.
Step 3: Relating Classes to Covariates and Outcomes (The 3-Step Approach)
Simply assigning individuals to their most likely class (saved in Step 2) and then running regressions or ANOVAs using this assignment as a predictor/outcome could lead to biased estimates by treating the assigned class as perfectly measured, ignoring the inherent uncertainty in classification.
The 3-Step Approach corrects for this classification error. Mplus offers several methods, with R3STEP
(Asparouhov & Muthén, 2014), building upon earlier methods like BCH
(Bolck, Croon, Hagenaars, 2004; Vermunt, 2010).
Here’s how you implement the R3STEP method in Mplus to examine how age and gender predict class membership and how an outcome_score
differs across classes, using the saved probabilities from Step 2:
TITLE: 3-Class LPA - Step 3 Analysis (R3STEP);
DATA:
FILE IS lpa_FINAL_results.dat;
VARIABLE:
NAMES ARE id stud_id item1 item2 item3 item4 age gender
CPROB1 CPROB2 CPROB3 C
outcome_score;
USEVARIABLES ARE age gender outcome_score;
NOMINAL = C;
CLASSES = C(3);
AUXILIARY(R3STEP) = age gender outcome_score;
ANALYSIS:
TYPE = MIXTURE;
MODEL:
%OVERALL%
C ON age gender;
outcome_score ON C;
OUTPUT:
TECH8;
PLOT:
TYPE = plot3;
- DATA: FILE IS: Crucially points to the SAVEDATA file created in the previous step, which contains the posterior probabilities.
- VARIABLE: NAMES ARE: Must now include the names of the variables saved by SAVEDATA (e.g., CPROB1, CPROB2, CPROB3, C). You also list your covariates (age, gender) and distal outcomes (outcome_score).
- USEVARIABLES ARE: Now lists the auxiliary variables (covariates and outcomes) you want to relate to the classes. The original indicators (item1-item4) are not listed here.
- NOMINAL = C;: Explicitly tells Mplus that the C variable (most likely class) is nominal. This is important for some procedures.
- AUXILIARY(R3STEP) = …;: This is the core command for the 3-step approach.
- AUXILIARY() tells Mplus these variables are related to the class variable after its formation, using a method that accounts for classification error.
- (R3STEP) specifies the particular correction method. Other options include (BCH) or (DCAT). R3STEP is often recommended.
- List all covariates and outcomes you want to analyze within this command.
- ANALYSIS: TYPE = MIXTURE;: Still required. Mplus uses the saved probabilities to reconstruct the classification uncertainty.
- MODEL: Now defines relationships between the latent classes (C) and the auxiliary variables.
- C ON age gender;: Models the probability of being in each class (relative to a reference class) based on age and gender. This is like a multinomial logistic regression, corrected for classification error.
- outcome_score ON C;: Models the mean of outcome_score as a function of latent class membership. This is like an ANOVA, corrected for classification error. It tests if the average outcome_score differs significantly between the identified latent profiles.
- OUTPUT/PLOT: Standard requests. Note that Mplus plots are less informative for Step 3 relationships compared to Step 1 profiles.
Interpreting and Visualizing the Results
1. Interpreting the Classes (from Step 1 Output):
- LPA: Look at the estimated means ([item#];) for each indicator within each class (%C#1%, %C#2%, etc.). Create a table or plot comparing these means across classes to understand the unique profile of each group. For example, Class 1 might have high means on all engagement items (“Highly Engaged”), Class 2 low means (“Disengaged”), and Class 3 high on some but low on others (“Selectively Engaged”). Check variances too – some classes might be more homogeneous than others.
- LCA: Look at the estimated item probabilities (derived from intercepts/thresholds in the output) for each indicator within each class. Which items are likely to be endorsed by members of each class? Class 1 might have a high probability of saying “yes” to items A and B but not C, while Class 2 has the opposite pattern.
2. Interpreting Auxiliary Relationships (from Step 3 Output):
- Predictors (C ON …): Interpret the logistic regression coefficients. A significant coefficient for age predicting C#1 (vs. the reference class) means age is associated with the likelihood of belonging to Class 1.
- Outcomes (outcome ON C): Interpret the mean differences. A significant coefficient indicates the mean of the outcome variable differs significantly for that class compared to the reference class.
3. Visualization:
While Mplus offers basic plots (PLOT: TYPE=PLOT3;
), they can be limited in customization. It’s often better to export the results (means/probabilities from the output file, or individual data from the SAVEDATA file) and use other software like R, Python, or even Excel for more polished and informative visualizations.
Ref: https://www.researchgate.net/figure/Radar-plot-of-LCA-results-This-visualization-represents-class-specific-posterior_fig1_335462225
- Line Plots:
- What: Plot the estimated mean score for each indicator variable (X-axis) separately for each latent class (different colored lines).
- How: Extract (or copy) the class-specific means from the Mplus output (under MODEL RESULTS -> Means). Input these into your plotting software.
- Example: You’d see distinct lines representing the average profile of the “Highly Engaged,” “Disengaged,” and “Selectively Engaged” students across the different engagement items.
Ref: https://www.researchgate.net/figure/Radar-plot-of-LCA-results-This-visualization-represents-class-specific-posterior_fig1_335462225
- Radial / Spider Plots:
- What: Plot indicators as axes radiating from a central point. Each class is represented by a shape connecting its values (probabilities for LCA, or standardized means for LPA) on each axis. Allows quick visual comparison of the overall pattern/shape of profiles.
- How: Extract class-specific item probabilities (LCA) or standardized means (LPA – use STDYX output) from the Mplus output. Each axis represents an indicator, and you plot the value for each class on its corresponding axis, connecting the dots for each class.
- Example: You could easily compare the “stat distribution” across different engagement dimensions for each student profile simultaneously. One profile might look like a star skewed towards behavioral engagement, another towards cognitive engagement.
Getting Data for Plotting:
- Class Profiles (Means/Probabilities): Manually copy these from the relevant section of the Mplus .out file.
- Individual Data: Use the SAVEDATA file generated in Step 2. This file contains individual scores, class assignments, and probabilities, which you can load into R/Python/Excel for more complex plots (e.g., plotting individual trajectories within classes if you have longitudinal data).