The Spout tool is used by the Brigham and Women's Program of Sleep Medicine to maintain data dictionaries for all of its datasets. Spout provides the group the ability to allow numerous people to simultaneously edit the same data dictionary. Spout also provides a mechansim for enforcing data dictionary standards, and allows the group to test the robustness and accuracy of the data dictionary compared to the dataset it is describing. Additionally, Spout is used to generate variable graphs seen on the NSRR website.
The Spout tool takes a CSV data dictionary and deconstructs it into a series of files in the JSON file format. Each variable and domain has its own JSON file so that making changes to a variable is limited to that specific file. Spout also provides the ability to generate new tests and validations that are specific to the dataset.
The official usage and documentation of Spout is available on the Spout GitHub repository.
What problem does Spout address?
What problems exist with single file dictionaries?
How are these problems solved by Spout?
What additional features does Spout provide?
spout new <project-name>
Running Command Line Tests on the SHHS Data Dictionary
Spout is available as a Ruby Gem and is trivial to run tests with Ruby installed using the
spout test command. The test suite also runs validations in under two seconds for a data dictionary of 1,800+ variables and 70+ domains. The Spout command-line testing interface also suppresses passing tests, and shows failing tests and provides expected solutions.
Continuous Integration Testing on Travis-CI for the SHHS Data Dictionary
Spout is designed to be tightly integrated with GitHub and Travis-CI from the ground up. Travis-CI tests keep track of changes made by data dictionary contributors, and emails team members when a failing test is introduced into the data dictionary. You can view live test results for the SHHS Data Dictionary here: https://travis-ci.org/sleepepi/shhs-data-dictionary
Running a Dataset Coverage Report on CHAT Data Dictionary
Spout provides the
spout coverage command to help visualize inconsistencies between the data dictionary and the underlying dataset. The coverage report seen above is being used to help construct a robust data dictionary for the CHAT dataset. The coverage report is an essential tool for developing data dictionaries as it provides the user feedback on problem areas in the dataset and provides information for potential solutions. Our group has used the coverage report to identify data outliers, domain inconsistencies, and variable naming mistakes before creating a release candidate for our tightly-coupled datasets and data dictionaries.
Interested in using Spout for your own data dictionaries?
If you have any questions, you can email us at: firstname.lastname@example.org