[Python] Data Analysis using LLMs (ChatGPT/Cluade/Gemini)
Large language models have become useful tools for data analysis tasks, from generating code to suggesting statistical approaches. Recent research demonstrates that LLMs can effectively support data preprocessing, exploratory analysis, visualization, and statistical modeling when prompted appropriately. However, LLM-generated code achieves below 40% accuracy in biomedical data science tasks without careful prompting strategies, rising to 74% with structured approaches (reference). This gap between potential and performance highlights the importance of effective human-AI collaboration in data analysis workflows.
Most researchers already know that uploading data to LLMs raises privacy and compliance concerns. Even for publicly available data, Data Use Agreements typically restrict sharing with third parties, and much of what we work with is sensitive or restricted. But here’s what’s often overlooked: you don’t need to share your data to use LLMs for analysis. You describe your data structure (e.g., variable names, types, coding schemes), and ask for the specific tasks, and the LLM generates code that you run locally. The model never sees your actual data.
This guide shows how to work with LLMs this way. Rather than treating the LLM as either a black box or a replacement for analyst judgment, the approach here positions the LLM as a collaborative tool: one that can suggest approaches, generate code, and accelerate routine tasks while you maintain control over your data and methodological decisions.
Data Analysis using LLMs: Ovierview
* without sharing your data with LLMs

I think effective LLM use in data analysis requires balancing two modes: constraining the model when you know what you want, and inviting suggestions when you’re exploring options.
- When to constrain: If you have a specific analysis in mind, provide exact specifications. LLMs generate responses by predicting statistically likely continuations. Without constraints, they fill ambiguity with plausible defaults that may not match your data structure, variable names, or analytic goals.
- When to invite suggestions: If you’re uncertain about appropriate methods, let the LLM propose options. LLMs have been trained on extensive statistical and programming documentation. They can suggest analyses you might not have considered, identify potential issues with proposed approaches, and explain tradeoffs between methods.
Stage 1: Data Preparation
Data preparation includes importing, merging, subsetting, and restructuring datasets. This stage establishes the analytic dataset.
Providing context
LLMs cannot see your data. They don’t know your variable names, data types, or coding schemes. Before requesting any data manipulation, establish context:
I have two dataframes loaded in Python:
- demographics (n=5,000): participant_id, age, gender, race_ethnicity
- survey (n=4,800): participant_id, phq9_1 through phq9_9
Merge using left join on participant_id.
Keep all rows from demographics.
Name the result merged_df.
This prompt specifies what exists (dataframe names, variables), what to do (left join), and what to produce (merged_df). The model doesn’t need to guess.
Why codebooks matter
Value coding is where LLM errors frequently occur. If your income variable uses codes 1-8 for income brackets and 99 for “refused,” the model won’t know this unless you specify it:
Variable: household_income
Type: Ordinal
Values:
1 = Less than $10,000
2 = $10,000-$19,999
...
8 = $100,000 or more
99 = Refused (treat as missing)
Create binary variable low_income:
1 if household_income <= 3
0 if household_income >= 4
NaN if household_income == 99
Providing the codebook prevents the model from treating 99 as a high income value or misinterpreting ordinal scales.
Stage 2: Variable Construction
Creating derived variables, composite scores, recoded categories, transformed measures, requires precise specifications when you know what you want, but can also benefit from LLM suggestions when standard approaches exist.
Specifying transformations
For standard clinical measures, specify the scoring rules:
I have phq9_1 through phq9_9, each coded 0-3.
Create phq9_total as the sum of all 9 items.
Create depression_severity using PHQ-9 clinical cutoffs:
0 = "Minimal" (score 0-4)
1 = "Mild" (score 5-9)
2 = "Moderate" (score 10-14)
3 = "Moderately severe" (score 15-19)
4 = "Severe" (score 20-27)
Handle missing: if any item is missing, both derived variables should be NaN.
Asking for suggestions
When you’re unsure about standard approaches, invite the LLM to suggest options:
I have a continuous income variable (annual household income in dollars).
I want to create a categorical version for analysis.
What are common approaches for categorizing income in health research?
What are the tradeoffs between using:
- Quartiles based on my sample
- Federal Poverty Level thresholds
- Fixed dollar ranges
The LLM can explain that FPL-based categories allow comparison across studies but require household size adjustment, while sample quartiles maximize statistical power but limit generalizability. This kind of methodological guidance is where LLMs add value beyond code generation.
Stage 3: Statistical Analysis
Choosing appropriate analyses depends on variable measurement levels and research questions. This stage benefits most from the collaborative approach, using LLM suggestions when exploring options, then switching to precise specifications when executing.
Asking for analysis suggestions
When uncertain about the appropriate analysis, describe your variables and research question:
I want to understand the relationship between these variables:
DV: depression_severity (ordinal, 5 levels: minimal to severe)
IV: age_group (ordinal, 4 levels: 18-29, 30-44, 45-64, 65+)
Control: gender (binary), income (ordinal, 5 levels)
Research question: Is the distribution of depression severity different across age groups, controlling for gender and income?
What statistical approaches would be appropriate? What are the assumptions and tradeoffs?
The LLM might suggest ordinal logistic regression for the outcome with ordinal predictors, or a chi-square test if you only care about bivariate association, or cumulative link models if proportional odds assumptions hold. It can explain when each is appropriate and what to check.
Specifying analysis execution
Once you’ve chosen an approach, switch to precise specification:
Run ordinal logistic regression:
- DV: depression_severity (ordinal, 5 levels)
- IV: age_group (reference: 18-29)
- Controls: gender, income_category
Dataframe: merged_df
Output: odds ratios with 95% CI for each predictor
Check proportional odds assumption
Use Python statsmodels
Code only.
Reference: Matching analyses to variable types
| Research Question | DV Type | IV Type | Common Approaches |
|---|---|---|---|
| Group mean differences | Continuous | Categorical (2 groups) | Independent t-test, Mann-Whitney U |
| Group mean differences | Continuous | Categorical (3+ groups) | ANOVA, Kruskal-Wallis |
| Association between categories | Categorical | Categorical | Chi-square, Fisher’s exact |
| Correlation | Continuous | Continuous | Pearson r, Spearman ρ |
| Predict binary outcome | Binary | Any | Logistic regression |
| Predict continuous outcome | Continuous | Any | Linear regression |
| Predict ordinal outcome | Ordinal | Any | Ordinal logistic regression |
| Predict count outcome | Count | Any | Poisson, negative binomial |
Examples
| Scenario | Suggested Analysis |
|---|---|
| Compare depression scores across racial/ethnic groups | One-way ANOVA or Kruskal-Wallis (check normality) |
| Predict service utilization (yes/no) from income, insurance, need | Logistic regression |
| Examine if discrimination predicts distress, controlling for demographics | Multiple linear regression |
| Compare pre/post intervention anxiety scores | Paired t-test or Wilcoxon signed-rank |
| Test if age moderates the social support-depression relationship | Moderated regression with interaction term |
Stage 4: Data Visualization
Visualization choices follow from variable types and analytic goals. LLMs can both suggest appropriate visualizations and generate customized code.
Asking for visualization suggestions
When you’re not sure what visualization fits your data:
I have the following analysis result:
- Mean PHQ-9 scores across 5 racial/ethnic groups
- With 95% confidence intervals
- Sample sizes ranging from 89 to 2,341 across groups
What visualization would effectively communicate these results for a journal article?
What should I consider regarding the unequal sample sizes?
The LLM might suggest a bar chart with error bars, note that confidence interval widths will vary with sample size (which is informative, not a problem), and recommend ordering groups by sample size or effect size rather than alphabetically.
Providing reference images
One of the most effective ways to communicate visualization style is to attach an image directly. Most LLM interfaces now support image uploads. When you find a visualization you like in a journal article, a report, or online, you can attach it and ask the LLM to replicate the style:
[Attach image of visualization you like]
Recreate this style of visualization for my data:
- X-axis: race_ethnicity (5 categories)
- Y-axis: phq9_mean with 95% CI error bars
- Dataframe: summary_stats
Match the color scheme, font style, and layout as closely as possible.
This approach is far more effective than trying to describe visual elements in words. The LLM can see the spacing, color palette, axis formatting, legend placement, and overall aesthetic, then translate those into code. If the reference image uses a specific style (e.g., minimalist with no gridlines, or a particular color gradient), the LLM can identify and replicate it.
You can also attach multiple reference images:
[Attach image 1: bar chart style you like]
[Attach image 2: color palette reference]
Create a bar chart using the layout style from image 1 and the color scheme from image 2.
My data: ...
Specifying visualization components
For precise control, specify each component:
Create a grouped bar chart:
Data:
- X-axis: age_group (4 categories)
- Y-axis: percentage with depression (moderate or above)
- Grouping: gender (2 categories)
- Dataframe: prevalence_df
Formatting:
- Figure size: (10, 6)
- Colors: #2c7bb6 for male, #d7191c for female
- Y-axis range: 0 to 40
- Y-axis label: "Prevalence (%)"
- Font: Arial, 12pt for axis labels, 14pt for title
- Error bars: 95% CI
- Legend: upper right, no frame
- Remove top and right spines
Export: PNG at 300 DPI
Use matplotlib. Code only.
Matching visualizations to data types
| Purpose | Variable Types | Visualization Options |
|---|---|---|
| Distribution of one continuous variable | Continuous | Histogram, density plot, box plot |
| Distribution of one categorical variable | Categorical | Bar chart |
| Compare continuous across groups | Continuous × Categorical | Grouped box plot, violin plot, bar chart with error bars |
| Compare proportions across groups | Binary × Categorical | Grouped bar chart, stacked bar chart |
| Relationship between two continuous | Continuous × Continuous | Scatter plot, with optional regression line |
| Distribution of ordinal across groups | Ordinal × Categorical | Stacked bar chart (100%), diverging bar chart |
| Change over time | Continuous × Time | Line chart |
Some Prompting Tips
These patterns come from real data analysis workflows. Each addresses a specific failure mode in how LLMs generate code.
1. State that objects already exist: with names and structure
Why: LLMs are trained on tutorials and documentation where code is self-contained. Without explicit context, they default to generating pd.read_csv(), dummy data, or invented variable names, wasting tokens and requiring manual cleanup.
The dataframe is called merged_df (n=4,800 rows).
It's already loaded in memory.
Columns include: participant_id, age, gender, race_ethnicity, phq9_total
Don't include pd.read_csv() or any data loading code.
Don't simulate or create example data.
2. Request minimal, localized edits (not full rewrites)
Why: LLMs optimize for coherent, complete responses. When you paste code asking for a fix, rewriting everything feels more “complete” than a surgical edit. But rewrites introduce new bugs, break working logic, and make diffs unreadable.
In this code, revise only the _extract_age_data method.
Keep the function signature and return type.
Don't modify any other methods in the class.
Don't change existing imports.
For small fixes:
Change only lines 127-130.
The rest of the code is working. Don't rewrite it.
3. Specify your environment and what’s already imported
Why: LLMs don’t know if you’re in Jupyter, Colab, VS Code, or a production server. Without this context, they hedge by including all possible setup code, authentication, imports, drive mounting, most of which you don’t need.
Environment: Jupyter Notebook in Google Colab
Already imported: pandas as pd, numpy as np, matplotlib.pyplot as plt
Google Drive is mounted at /content/drive
4. Provide codebook details for variable manipulation
Why: This is where LLM errors are most dangerous. If you ask to “recode income as binary,” the LLM will guess at cutoffs, miss special codes (like 99 for “refused”), and produce code that runs without errors but generates wrong results.
Variable: household_income
Type: Ordinal (integer codes)
Current values:
1 = Less than $10,000
2 = $10,000-$19,999
3 = $20,000-$29,999
4 = $30,000-$39,999
5 = $40,000 or more
99 = Refused (treat as missing)
Create binary variable low_income:
1 if household_income <= 2
0 if household_income >= 3
NaN if household_income == 99
5. Localize debugging with line numbers and error messages
Why: Vague debugging requests (“why doesn’t this work?”) activate the LLM’s tendency to speculate about multiple possible causes. You get a list of things that might be wrong instead of a fix for what is wrong.
Line 127 raises KeyError: 'Spanish_Total'
I verified the column exists with df.columns.tolist()
The dataframe has columns: English_Total, Spanish_Total, Other_Total
Show me the most likely cause and a minimal fix.
6. Build in fallback logic for messy data
Why: LLMs generate code for clean, well-structured data. Real-world data, especially from OCR, web scraping, or inconsistent sources, has variations that break brittle patterns. Asking for fallbacks upfront prevents code that works on your test case but fails on edge cases.
This regex needs to handle OCR variations.
If the primary pattern fails, try a more permissive pattern.
If both fail, return None instead of raising an error.
Add logging to track which pattern succeeded.
7. Demand structured output
Why: LLMs are trained to be helpful, which means explaining their reasoning. For code generation, this produces prose you have to scroll past. Explicit format constraints eliminate the noise and use less computing resources (reduce environmental costs).
Code only.
No explanation before or after.
No markdown formatting or backticks.
No comments unless they clarify non-obvious logic.
8. Attach reference images for visualization style
Why: Describing visual style in words is inefficient and imprecise. “Clean, minimalist, with a blue color scheme” could mean many things. An image communicates color palette, spacing, font choices, axis formatting, and layout in one glance.
[Attach image of visualization]
Recreate this style for my data:
- X-axis: race_ethnicity (5 categories)
- Y-axis: prevalence (%) with 95% CI error bars
- Dataframe: summary_df
Match the color scheme, font style, axis formatting, and spacing.
9. Export results to CSV/XLSX: don’t let LLMs format statistics
Why: When you ask an LLM to report statistical results directly, it may hallucinate numbers, round incorrectly, or misplace decimal points. Instead, ask for code that exports results to a structured file. You then format the table yourself to match journal requirements (APA, AMA, etc.).
Run logistic regression and export results to CSV:
- Include: variable names, OR, 95% CI lower, 95% CI upper, p-value
- Round OR and CI to 2 decimal places
- Round p-value to 3 decimal places
- Save as 'logistic_results.csv'
Code only.
This gives you a clean data file you can format in Excel or import into your manuscript, with values you can verify against the model output.
10. Verify outputs: LLMs hallucinate confidently
Why: LLM-generated code often looks correct but contains subtle errors: invented variable names, wrong value codes, inappropriate methods. The code runs without errors, but the results are wrong. Verification isn’t optional.
