Missing values in shhs1-dataset-0.7.0.cs

4 posts
Was this reply useful? Learn more...
[-] Alexander Tataraidze +0 points · over 1 year ago


This's not a qestion, just a small notice. Maybe it will be useful for someone.

I need to choose records in shhs1 with AHI < 5. There isn't AHI parameters in shhs1-dataset-0.7.0.csv, but we can calculate it as AHI = cai4p + oahi.

cai4p - https://sleepdata.org/datasets/shhs/variables/cai4p, oahi - https://sleepdata.org/datasets/shhs/variables/oahi.

However, there are 698 unknown oahi values and 1398 unknown cai4p values. They are defined as follows:

oahi = 60 * ( hrembp4 + hrop4 + hnrbp4 + hnrop4 + oarbp + oarop + oanbp + oanop ) / slpprdp,

cai4p = 60 * ( carbp4 + carop4 + canbp4 + canop4 ) / slpprdp.

So, we can expect that at least one of these variables should be unknown if oahi or cai4p are unknown, but they are not. Thus, we can calculate cai4p, oahi and AHI for every record in shhs1. This is code for it in Python:

import pandas as pd

data = pd.read_csv('shhs1-dataset-0.7.0.csv')

print('Amount of missing values in cai4p', data['cai4p'].isnull().sum()) print('Amount of missing values in oahi', data['oahi'].isnull().sum())

cai4p = 60*data[['CAROP4','CARBP4','CANBP4', 'CANOP4']].sum(1)/data['SlpPrdP'] oahi = 60*data[['HREMBP4','HROP4','HNRBP4', 'HNROP4','OARBP','OAROP','OANBP','OANOP' ]].sum(1)/data['SlpPrdP']

print('Amount of missing values in cai4p', cai4p.isnull().sum()) print('Amount of missing values in oahi', oahi.isnull().sum())

AHI = cai4p + oahi

104 posts
Was this reply useful? Learn more...
[-] mrueschman +1 point · over 1 year ago


Thanks for raising this issue -- it is an important one. There is a bit of documentation missing that would have helped you understand the missingness in cai4p and oahi. These variables have been filtered and many values have been censored from the dataset. The bigger issue is that we don't have documentation on sleepdata.org that describes the filters that have been applied and to which variables. For SHHS, we are mostly in the dark because the original (filtered) analytic datasets were generated 20 years ago and I have not come across the data processing code to know exactly what was done. The task of reverse engineering all the filters and making them known somehow has been on my backburner for awhile now.

Based on prior experience, I made an educated guess that cai4p was filtered by chestqual (quality of chest signal) and abdoqual (quality of abdomen signal), and this seems to be correct. The signal quality variables in SHHS1 run from 1 (lowest) to 4 (highest), and some quick tinkering led me to this formula:

if chstqual in (3,4) and abdoqual in (3,4) then cai4p_new = 60 * ( carbp4 + carop4 + canbp4 + canop4 ) / slpprdp;

cai4p_new then has 4,406 valid values and 1,398 missing values, like the cai4p variable you are working with.

These filters were applied with the mindset of only retaining AHI values where the corresponding scoring signals (e.g. effort channels for indices of central sleep apnea) were of good or better quality. I will work with my colleagues here to try to prioritize writing some documentation that describes this (currently) "hidden" filtering and/or reverse engineering some of these filters and presenting the filtering code alongside the calculation.

Thanks for checking out the site and bringing this topic to the forum!

4 posts
Was this reply useful? Learn more...
[-] Alexander Tataraidze +0 points · over 1 year ago

Thanks a lot for the comprehensive answer!

Write a Reply