We've updated our privacy policy.

Spout - Data Dictionary Management

Spout

The Spout tool is used by the Brigham and Women's Program of Sleep Medicine to maintain data dictionaries for all of its datasets. Spout provides the group the ability to allow numerous people to simultaneously edit the same data dictionary. Spout also provides a mechansim for enforcing data dictionary standards, and allows the group to test the robustness and accuracy of the data dictionary compared to the dataset it is describing. Additionally, Spout is used to generate variable graphs seen on the NSRR website.

The Spout tool takes a CSV data dictionary and deconstructs it into a series of files in the JSON file format. Each variable and domain has its own JSON file so that making changes to a variable is limited to that specific file. Spout also provides the ability to generate new tests and validations that are specific to the dataset.

The official usage and documentation of Spout is available on the Spout GitHub repository.

What problem does Spout address?

  • Research groups storing their data dictionaries as Word Documents
  • Maintaining large CSV-based data dictionaries
  • Providing documentation for researchers and statisticians

What problems exist with single file dictionaries?

  • Lack of tracking changes as they are made to the data dictionary
  • Lack of ability for multiple users to make changes to improve the data dictionary
  • Inability to maintain consistency across definitions
  • Inability to test correctness of definitions in data dictionary
  • Lack of consistency in defining value domains that are used commonly across variables

How are these problems solved by Spout?

  • Allows users to work across multiple files
    • Creates a folder hierarchy of JSON files that are generally hosted in a Git repository
    • Git is exceptional at storing history for single files that are in text format (JSON is a type of text format)
  • Facilitates creating value domains for variables
  • Defines types of variables for consistency across data dictionaries
  • JSON object format is easily extensible for future needs, and provides ability to do more complex relationships
    • Ex: A variable can have multiple labels that are enumerated strings in a JSON array
  • Ability to reconstruct the JSON repository into a single CSV for sharing with statisticians and researchers

What additional features does Spout provide?

  • Ability to add validations, built using Ruby, extendible using test placeholders spout new <project-name>
    • Spout New project creates folders for testing, along with providing a framework to update spout and use with Travis-CI for online testing
  • Built-in validations for uniqueness, variable type, and domain relationships spout test
  • Ability to verify completeness of data dictionary by comparing to a dataset spout coverage
    • Spout coverage provides a simple tool to guide a user building a data dictionary from a dataset and highlights missing variables, undefined variable domains, and other inconsistencies
  • Integration with Ruby, GitHub, and Travis-CI designed from the ground up to leverage continuous testing environments
    • Promotes high degree of confidence in reviewing changes, and merging new changes in with existing datasets
  • Integration into web frameworks to leverage tightly coupled datasets and their corresponding data dictionaries
  • Documented Standards and Best Practices for variable names, descriptions, and other attributes that are described in the data dictionary for cohesiveness across data dictionaries
  • Ability to generate simple PNG graphs for all variables defined in the dataset spout graph

Spout in Practice

Running Command Line Tests on the SHHS Data Dictionary

Spout is available as a Ruby Gem and is trivial to run tests with Ruby installed using the spout test command. The test suite also runs validations in under two seconds for a data dictionary of 1,800+ variables and 70+ domains. The Spout command-line testing interface also suppresses passing tests, and shows failing tests and provides expected solutions.

Continuous Integration Testing on Travis-CI for the SHHS Data Dictionary

Spout is designed to be tightly integrated with GitHub and Travis-CI from the ground up. Travis-CI tests keep track of changes made by data dictionary contributors, and emails team members when a failing test is introduced into the data dictionary. You can view live test results for the SHHS Data Dictionary here: https://travis-ci.org/sleepepi/shhs-data-dictionary

Running a Dataset Coverage Report on CHAT Data Dictionary

Spout provides the spout coverage command to help visualize inconsistencies between the data dictionary and the underlying dataset. The coverage report seen above is being used to help construct a robust data dictionary for the CHAT dataset. The coverage report is an essential tool for developing data dictionaries as it provides the user feedback on problem areas in the dataset and provides information for potential solutions. Our group has used the coverage report to identify data outliers, domain inconsistencies, and variable naming mistakes before creating a release candidate for our tightly-coupled datasets and data dictionaries.

Interested in using Spout for your own data dictionaries?

If you have any questions, you can email us at: support@sleepdata.org