[Python] Data Analysis using LLMs (ChatGPT/Cluade/Gemini)

Large language models have become useful tools for data analysis tasks, from generating code to suggesting statistical approaches. Recent research demonstrates that LLMs can effectively support data preprocessing, exploratory analysis, visualization, and statistical modeling when prompted appropriately. However, LLM-generated code achieves below 40% accuracy in biomedical data science tasks without careful prompting strategies, rising to 74% with structured approaches (reference). This gap between potential and performance highlights the importance of effective human-AI collaboration in data analysis workflows.

Most researchers already know that uploading data to LLMs raises privacy and compliance concerns. Even for publicly available data, Data Use Agreements typically restrict sharing with third parties, and much of what we work with is sensitive or restricted. But here’s what’s often overlooked: you don’t need to share your data to use LLMs for analysis. You describe your data structure (e.g., variable names, types, coding schemes), and ask for the specific tasks, and the LLM generates code that you run locally. The model never sees your actual data.

This guide shows how to work with LLMs this way. Rather than treating the LLM as either a black box or a replacement for analyst judgment, the approach here positions the LLM as a collaborative tool: one that can suggest approaches, generate code, and accelerate routine tasks while you maintain control over your data and methodological decisions.

Data Analysis using LLMs: Ovierview

* without sharing your data with LLMs

I think effective LLM use in data analysis requires balancing two modes: constraining the model when you know what you want, and inviting suggestions when you’re exploring options.

  • When to constrain: If you have a specific analysis in mind, provide exact specifications. LLMs generate responses by predicting statistically likely continuations. Without constraints, they fill ambiguity with plausible defaults that may not match your data structure, variable names, or analytic goals.
  • When to invite suggestions: If you’re uncertain about appropriate methods, let the LLM propose options. LLMs have been trained on extensive statistical and programming documentation. They can suggest analyses you might not have considered, identify potential issues with proposed approaches, and explain tradeoffs between methods.

Stage 1: Data Preparation

Data preparation includes importing, merging, subsetting, and restructuring datasets. This stage establishes the analytic dataset.

Providing context

LLMs cannot see your data. They don’t know your variable names, data types, or coding schemes. Before requesting any data manipulation, establish context:

I have two dataframes loaded in Python:
- demographics (n=5,000): participant_id, age, gender, race_ethnicity
- survey (n=4,800): participant_id, phq9_1 through phq9_9

Merge using left join on participant_id.
Keep all rows from demographics.
Name the result merged_df.

This prompt specifies what exists (dataframe names, variables), what to do (left join), and what to produce (merged_df). The model doesn’t need to guess.

Why codebooks matter

Value coding is where LLM errors frequently occur. If your income variable uses codes 1-8 for income brackets and 99 for “refused,” the model won’t know this unless you specify it:

Variable: household_income
Type: Ordinal
Values:
  1 = Less than $10,000
  2 = $10,000-$19,999
  ...
  8 = $100,000 or more
  99 = Refused (treat as missing)

Create binary variable low_income:
  1 if household_income <= 3
  0 if household_income >= 4
  NaN if household_income == 99

Providing the codebook prevents the model from treating 99 as a high income value or misinterpreting ordinal scales.


Stage 2: Variable Construction

Creating derived variables, composite scores, recoded categories, transformed measures, requires precise specifications when you know what you want, but can also benefit from LLM suggestions when standard approaches exist.

Specifying transformations

For standard clinical measures, specify the scoring rules:

I have phq9_1 through phq9_9, each coded 0-3.

Create phq9_total as the sum of all 9 items.
Create depression_severity using PHQ-9 clinical cutoffs:
  0 = "Minimal" (score 0-4)
  1 = "Mild" (score 5-9)
  2 = "Moderate" (score 10-14)
  3 = "Moderately severe" (score 15-19)
  4 = "Severe" (score 20-27)

Handle missing: if any item is missing, both derived variables should be NaN.

Asking for suggestions

When you’re unsure about standard approaches, invite the LLM to suggest options:

I have a continuous income variable (annual household income in dollars).
I want to create a categorical version for analysis.

What are common approaches for categorizing income in health research?
What are the tradeoffs between using:
- Quartiles based on my sample
- Federal Poverty Level thresholds
- Fixed dollar ranges

The LLM can explain that FPL-based categories allow comparison across studies but require household size adjustment, while sample quartiles maximize statistical power but limit generalizability. This kind of methodological guidance is where LLMs add value beyond code generation.


Stage 3: Statistical Analysis

Choosing appropriate analyses depends on variable measurement levels and research questions. This stage benefits most from the collaborative approach, using LLM suggestions when exploring options, then switching to precise specifications when executing.

Asking for analysis suggestions

When uncertain about the appropriate analysis, describe your variables and research question:

I want to understand the relationship between these variables:

DV: depression_severity (ordinal, 5 levels: minimal to severe)
IV: age_group (ordinal, 4 levels: 18-29, 30-44, 45-64, 65+)
Control: gender (binary), income (ordinal, 5 levels)

Research question: Is the distribution of depression severity different across age groups, controlling for gender and income?

What statistical approaches would be appropriate? What are the assumptions and tradeoffs?

The LLM might suggest ordinal logistic regression for the outcome with ordinal predictors, or a chi-square test if you only care about bivariate association, or cumulative link models if proportional odds assumptions hold. It can explain when each is appropriate and what to check.

Specifying analysis execution

Once you’ve chosen an approach, switch to precise specification:

Run ordinal logistic regression:
- DV: depression_severity (ordinal, 5 levels)
- IV: age_group (reference: 18-29)
- Controls: gender, income_category

Dataframe: merged_df

Output: odds ratios with 95% CI for each predictor
Check proportional odds assumption
Use Python statsmodels

Code only.

Reference: Matching analyses to variable types

Research QuestionDV TypeIV TypeCommon Approaches
Group mean differencesContinuousCategorical (2 groups)Independent t-test, Mann-Whitney U
Group mean differencesContinuousCategorical (3+ groups)ANOVA, Kruskal-Wallis
Association between categoriesCategoricalCategoricalChi-square, Fisher’s exact
CorrelationContinuousContinuousPearson r, Spearman ρ
Predict binary outcomeBinaryAnyLogistic regression
Predict continuous outcomeContinuousAnyLinear regression
Predict ordinal outcomeOrdinalAnyOrdinal logistic regression
Predict count outcomeCountAnyPoisson, negative binomial

Examples

ScenarioSuggested Analysis
Compare depression scores across racial/ethnic groupsOne-way ANOVA or Kruskal-Wallis (check normality)
Predict service utilization (yes/no) from income, insurance, needLogistic regression
Examine if discrimination predicts distress, controlling for demographicsMultiple linear regression
Compare pre/post intervention anxiety scoresPaired t-test or Wilcoxon signed-rank
Test if age moderates the social support-depression relationshipModerated regression with interaction term

Stage 4: Data Visualization

Visualization choices follow from variable types and analytic goals. LLMs can both suggest appropriate visualizations and generate customized code.

Asking for visualization suggestions

When you’re not sure what visualization fits your data:

I have the following analysis result:
- Mean PHQ-9 scores across 5 racial/ethnic groups
- With 95% confidence intervals
- Sample sizes ranging from 89 to 2,341 across groups

What visualization would effectively communicate these results for a journal article?
What should I consider regarding the unequal sample sizes?

The LLM might suggest a bar chart with error bars, note that confidence interval widths will vary with sample size (which is informative, not a problem), and recommend ordering groups by sample size or effect size rather than alphabetically.

Providing reference images

One of the most effective ways to communicate visualization style is to attach an image directly. Most LLM interfaces now support image uploads. When you find a visualization you like in a journal article, a report, or online, you can attach it and ask the LLM to replicate the style:

[Attach image of visualization you like]

Recreate this style of visualization for my data:
- X-axis: race_ethnicity (5 categories)
- Y-axis: phq9_mean with 95% CI error bars
- Dataframe: summary_stats

Match the color scheme, font style, and layout as closely as possible.

This approach is far more effective than trying to describe visual elements in words. The LLM can see the spacing, color palette, axis formatting, legend placement, and overall aesthetic, then translate those into code. If the reference image uses a specific style (e.g., minimalist with no gridlines, or a particular color gradient), the LLM can identify and replicate it.

You can also attach multiple reference images:

[Attach image 1: bar chart style you like]
[Attach image 2: color palette reference]

Create a bar chart using the layout style from image 1 and the color scheme from image 2.
My data: ...

Specifying visualization components

For precise control, specify each component:

Create a grouped bar chart:

Data:
- X-axis: age_group (4 categories)
- Y-axis: percentage with depression (moderate or above)
- Grouping: gender (2 categories)
- Dataframe: prevalence_df

Formatting:
- Figure size: (10, 6)
- Colors: #2c7bb6 for male, #d7191c for female
- Y-axis range: 0 to 40
- Y-axis label: "Prevalence (%)"
- Font: Arial, 12pt for axis labels, 14pt for title
- Error bars: 95% CI
- Legend: upper right, no frame
- Remove top and right spines

Export: PNG at 300 DPI

Use matplotlib. Code only.

Matching visualizations to data types

PurposeVariable TypesVisualization Options
Distribution of one continuous variableContinuousHistogram, density plot, box plot
Distribution of one categorical variableCategoricalBar chart
Compare continuous across groupsContinuous × CategoricalGrouped box plot, violin plot, bar chart with error bars
Compare proportions across groupsBinary × CategoricalGrouped bar chart, stacked bar chart
Relationship between two continuousContinuous × ContinuousScatter plot, with optional regression line
Distribution of ordinal across groupsOrdinal × CategoricalStacked bar chart (100%), diverging bar chart
Change over timeContinuous × TimeLine chart

Some Prompting Tips

These patterns come from real data analysis workflows. Each addresses a specific failure mode in how LLMs generate code.

1. State that objects already exist: with names and structure

Why: LLMs are trained on tutorials and documentation where code is self-contained. Without explicit context, they default to generating pd.read_csv(), dummy data, or invented variable names, wasting tokens and requiring manual cleanup.

The dataframe is called merged_df (n=4,800 rows).
It's already loaded in memory.
Columns include: participant_id, age, gender, race_ethnicity, phq9_total

Don't include pd.read_csv() or any data loading code.
Don't simulate or create example data.

2. Request minimal, localized edits (not full rewrites)

Why: LLMs optimize for coherent, complete responses. When you paste code asking for a fix, rewriting everything feels more “complete” than a surgical edit. But rewrites introduce new bugs, break working logic, and make diffs unreadable.

In this code, revise only the _extract_age_data method.
Keep the function signature and return type.
Don't modify any other methods in the class.
Don't change existing imports.

For small fixes:

Change only lines 127-130.
The rest of the code is working. Don't rewrite it.

3. Specify your environment and what’s already imported

Why: LLMs don’t know if you’re in Jupyter, Colab, VS Code, or a production server. Without this context, they hedge by including all possible setup code, authentication, imports, drive mounting, most of which you don’t need.

Environment: Jupyter Notebook in Google Colab
Already imported: pandas as pd, numpy as np, matplotlib.pyplot as plt
Google Drive is mounted at /content/drive

4. Provide codebook details for variable manipulation

Why: This is where LLM errors are most dangerous. If you ask to “recode income as binary,” the LLM will guess at cutoffs, miss special codes (like 99 for “refused”), and produce code that runs without errors but generates wrong results.

Variable: household_income
Type: Ordinal (integer codes)
Current values:
  1 = Less than $10,000
  2 = $10,000-$19,999
  3 = $20,000-$29,999
  4 = $30,000-$39,999
  5 = $40,000 or more
  99 = Refused (treat as missing)

Create binary variable low_income:
  1 if household_income <= 2
  0 if household_income >= 3
  NaN if household_income == 99

5. Localize debugging with line numbers and error messages

Why: Vague debugging requests (“why doesn’t this work?”) activate the LLM’s tendency to speculate about multiple possible causes. You get a list of things that might be wrong instead of a fix for what is wrong.

Line 127 raises KeyError: 'Spanish_Total'
I verified the column exists with df.columns.tolist()

The dataframe has columns: English_Total, Spanish_Total, Other_Total
Show me the most likely cause and a minimal fix.

6. Build in fallback logic for messy data

Why: LLMs generate code for clean, well-structured data. Real-world data, especially from OCR, web scraping, or inconsistent sources, has variations that break brittle patterns. Asking for fallbacks upfront prevents code that works on your test case but fails on edge cases.

This regex needs to handle OCR variations.
If the primary pattern fails, try a more permissive pattern.
If both fail, return None instead of raising an error.

Add logging to track which pattern succeeded.

7. Demand structured output

Why: LLMs are trained to be helpful, which means explaining their reasoning. For code generation, this produces prose you have to scroll past. Explicit format constraints eliminate the noise and use less computing resources (reduce environmental costs).

Code only.
No explanation before or after.
No markdown formatting or backticks.
No comments unless they clarify non-obvious logic.

8. Attach reference images for visualization style

Why: Describing visual style in words is inefficient and imprecise. “Clean, minimalist, with a blue color scheme” could mean many things. An image communicates color palette, spacing, font choices, axis formatting, and layout in one glance.

[Attach image of visualization]

Recreate this style for my data:
- X-axis: race_ethnicity (5 categories)
- Y-axis: prevalence (%) with 95% CI error bars
- Dataframe: summary_df

Match the color scheme, font style, axis formatting, and spacing.

9. Export results to CSV/XLSX: don’t let LLMs format statistics

Why: When you ask an LLM to report statistical results directly, it may hallucinate numbers, round incorrectly, or misplace decimal points. Instead, ask for code that exports results to a structured file. You then format the table yourself to match journal requirements (APA, AMA, etc.).

Run logistic regression and export results to CSV:
- Include: variable names, OR, 95% CI lower, 95% CI upper, p-value
- Round OR and CI to 2 decimal places
- Round p-value to 3 decimal places
- Save as 'logistic_results.csv'

Code only.

This gives you a clean data file you can format in Excel or import into your manuscript, with values you can verify against the model output.

10. Verify outputs: LLMs hallucinate confidently

Why: LLM-generated code often looks correct but contains subtle errors: invented variable names, wrong value codes, inappropriate methods. The code runs without errors, but the results are wrong. Verification isn’t optional.

  • January 25, 2026