SQL Tutorial: How to Obtain Clinical Notes in MIMIC-III datasets (for Beginners)

Why is MIMIC-III dataset useful?

MIMIC-III Clinical Notes Introduction Clinical NLP Series #NLP

For people interested in utilizing clinical notes for the research, the MIMIC-III dataset is likely to be the only publicly available dataset. Even though the MIMIC-IV dataset (which is more up-to-date) includes clinical notes, it does not include general notes. Other clinical note datasets are not open to the public (e.g., UCSF clinical data). Therefore, NLP studies on most clinical notes are conducted using the MIMIC-III dataset.

The slides in this presentation are a great help in understanding MIMIC-III: intro_to_mimic_db.pdf (pitt.edu)

In MIMIC-III, there are notes from a variety of categories, including those from nursing, radiology, physician, and social work.

To obtain access to the MIMIC-III dataset, this post explains the process in details: Getting access to MIMIC III hospital database for data science projects | by Andrew Long | Medium

How to extract clinical notes from the MIMIC-III dataset

After obtaining access to the dataset, we need to extract the dataset for us to analyze. For data extraction, my preference is Google BigQuery. For a beginner, this is the easiest way to start. Here is the tutorial from the MIMIC-III dataset on how to use Google BigQuery: BigQuery | MIMIC (mit.edu)

After logging in to the physionet, you can see the list of links to access the files on this page. To use Google BigQuery (SQL), please click request access to using Google BigQuery.

You will shortly receive an email from physionet on your Google Big Query access.

If you click the link, you can see the window of Google Big Query (Google Cloud). You can see the list of sub-datasets of MIMIC-III in the explorer window. After choosing the noteevents table in Explorer, you can see some sample notes in the preview window.

  CATEGORY='Social Work' OR CATEGORY='Case Management ';

I used the SQL codes to extract social work and case management notes. You can quickly learn how to write simple SQL code for data extraction here: SQL example statements for retrieving data from a table (iu.edu)

After extracting the selected dataset on SQL, you can save it in a different format. I recommend CSV (Google Drive) or JSONL.


Even though the notes are created in the ICU setting, it seems there are ways to extract behavioral-health-related notes. I am attaching resources for phenotyping (ref: phenotyping in Clinical NLP).

Phenotype Annotations for Patient Notes in the MIMIC-III Database v1.20.03 (physionet.org)

Another method is to utilize the ICD-9 code in the MIMIC-III dataset. Each note has an ICD-9 code in the MIMIC dataset, but you need to link the noteevents database and the ICD code from another dataset on SQL). The following lecture introduces the ICD-9 code in the MIMIC-III dataset.

ICD-9 and MIMIC-III – International Classification of Disease System | Coursera

  • November 29, 2022