Inconsistency of race coding?

1 post
Was this reply useful? Learn more...
[-] Adam Omidpanah +0 points · over 1 year ago


I see in the SHHS Phase 1 MOO that race is coded somewhat closely to NIH standards: 1-white, 2-black, 3-american indian alaska native 4-asian, etc. (pg 119, 127 of pdf). However, in the file shhs1-dataset-0.8.0.csv race takes 3 levels 1-white 2-black 3-other.

Have I accessed a limited use dataset? Where do I obtain proper race codings?

Is there a way to identify the parent cohort of the SHHS participants?

104 posts
Was this reply useful? Learn more...
[-] mrueschman +0 points · over 1 year ago


Thanks -- good question. You also asked about versioning in your email to, so I am going to post my reply here about that issue and the race issue you note.

The idea behind our versioning is that the most recent version (0.8.0 for SHHS) would be the “latest and greatest” and would be our suggested starting point for new analyses. From 0.3.0 and onward we broke the dataset into separate CSVs per visit, which would explain why 0.2.0 has more observations in its single file than the files that came later. Also around the switch from 0.2.0 to 0.3.0 we received updated data from the dataset owner (Johns Hopkins in this case) that added more cases to our “CVD Outcomes” dataset. We took down 0.3.0 because it contained records for SHHS subjects that did not consent to share data for future research.

Yes, the race data were collapsed into 3 categories by the SHHS dataset owners, which explains the difference between 0.2.0 and 0.4.0+. Our NSRR data mimic what is posted on BioLINCC ( – our 0.2.0 version of the data came from a preliminary BioLINCC dataset which did not have the race variable change incorporated yet.

Technically one could look back to the older dataset (possibly merging with a newer version) to get the race variable with more fine-grained categories, but we have not carried these data forward into subsequent releases since this is how the dataset owners have immortalized the dataset on BioLINCC. My best guess is that this change was made to more closely match a quasi-standard of how race is presented in BioLINCC datasets. Most datasets that I have seen from BioLINCC have this Black/White/Other breakdown.

As for your other question about the parent cohorts: There will not be a way to identify the parent cohort of SHHS participants from the NSRR datasets. These links were explicitly removed by the dataset owners as part of the de-identification process when posting on BioLINCC. I believe if you went through BioLINCC to request and obtain access to the parent cohorts (e.g. Framingham, ARIC, etc.) that they may grant access to the linking codes (lookup table with IDs across different data sources).

Hope this helps. Thanks!

7 posts
Was this reply useful? Learn more...
[-] Matthew Butler +0 points · over 1 year ago

I have a follow-up on this. As Adam wrote above, there were extra race categories in earlier versions of the SHHS1 dataset. Collapsing the categories seems ok to me, but I'm more worried about the new ethnicity variable in the newer SHHS datasets. This makes it seem as though the dataset was collected with modern NIH conventions of asking about race (Black/White/Asian/Native American/etc.) AND ethnicity (Hisp / Non-Hisp). There were 280 with race = Hisp in 0.2.0, and there are 280 with ethnicity = Hisp in 0.8.0. There were There were 4907 race = white in 0.2.0, and 4907 race = white in 0.8.0. This means that in the latest dataset, all subjects that are coded as ethnicity = Hispanic are also coded as non-white. That is an accurate reflection of the original dataset, but not an accurate representation of how ethnicity is currently defined, where ethnic Hispanics can choose among different races. What was the reason for breaking up the race category into a race and an ethnicity category?

104 posts
Was this reply useful? Learn more...
[-] mrueschman +0 points · over 1 year ago

Matt: That's a great observation. I can imagine a new user to the SHHS data making an incorrect assumption based on how the data are now presented (as separate race/ethnicity variables).

In the least, I will have us add better descriptions to the race and ethnicity variables on NSRR that describes the recoding that was done. As for the reason, I will pose that question to Jill at the SHHS Coordinating Center, who we were already conversing with about a separate issue.

Write a Reply