LibGuides: Research Data Management: Collect, Organize, Document

Data Collection and Organization

Keep track of document versions either sequentially (e.g. v01, v02,) or with a unique date and time ( e.g. 20140403_1800)

Use: FileNm_Guidelines_20140409_v01.docx
Don’t Use: FileNm_Guidelines_20140409_Review.docx AND FileNm_Guidelines_20140409_Investigation.docx
Why? Because two years from now, you won’t remember what you meant.

A good file naming system will replace an extensive folder hierarchy. Limit the number of nested folders and strive to make hierarchies as simple as possible. Complex folder hierarchies are harder to navigate and offer more opportunities for filing errors. System back-ups may take longer.

Use: F:/ Env/LIBR/DataMgmt_FileFormats_20140409_v01.docx
Don’t Use: F:/Environment/Library/Woodward/Data/Education/Materials/Draft/2014/04/-DataMgmt_FileFormats_20140409_v01.docx

Why? Because complex folder hierarchies are harder to navigate and offer more opportunities for filing errors. System back-ups may take longer.

From the UBC Guide for Organizing data

Recommended file formats

Open (i.e., non-proprietary) file formats are preferred when possible because they can be used by anyone, which helps ensure interoperability and so others can access and reuse your data in the future. UK Data Service provides a table of recommended and acceptable file formats for various types of data.

Quantitative tabular data with minimal metadata.

Comma-separated values (CSV) file (.csv)
Tab-delimited file (.tab)

Qualitative data. Textual.

eXtensible Mark-up Language (XML) text according to an appropriate Document Type Definition (DTD) or schema (.xml).
Rich Text Format (.rtf).
Plain text data, ASCII (.txt).

Digital image data.

TIFF version 6 uncompressed (.tif)

Digital audio data.

Free Lossless Audio Codec (FLAC) (.flac).

Digital video data.

MPEG-4 (.mp4).
OGG video (.ogv, .ogg).
motion JPEG 2000 (.mj2).

Documentation and scripts.

Rich Text Format (.rtf).
PDF/A or PDF (.pdf).
HTML (.htm).
OpenDocument Text (.odt).
R Markdown files (.rmd) (with HTML version as well).

It is important to keep track of different copies or versions of files, files held in different formats or locations, and information cross-referenced between files. This process is called 'version control'. Logical file structures, informative naming conventions, and clear indications of file versions all contribute to better use of your data during and after your research project.

File names should contain information (e.g., date stamps, participant codes, version numbers, location, etc.) that helps you sort and search for files and identify the content and right versions of files. Version control means tracking and organizing changes to your data by saving new versions of files you modified and retaining the older versions.

Good data organization practices minimize confusion when changes to data are made across time, from different locations, and by multiple people. Read more on file naming and version control at UBC Library, and UK Data Service.

Here are some recommended conventions:

Record dates in YYYYMMDD format
Use short unique identifiers
Include a summary of content in file name
Use underscores as delimiters
Keep track of document versions sequentially or by date
Make folder hierarchies simple

Data Documentation

Metadata is data about data or “documentation that describes data” (Cornell University). It is “structured data about anything that can be named, such as Web pages, books, journal articles, images, songs, products, processes, people (and their activities), research data, concepts, and services.” (DCMI website). Metadata makes it possible for others to understand how your data was collected, what it means, and what it can be used for. and how to interpret it. Documentation involves recording important metadata about the dataset structure and contents.

What documentation will be needed for the data to be read, interpreted, and potentially reused correctly in the future (also known as metadata)?

Documentation involves recording important pieces of information about the dataset structure and contents. Project-level metadata can include basic information about your project (e.g., title, funder, principal investigator, other people involved in the project and their roles, etc.), research design (e.g., background, research questions, aims, artists or artwork informing your project, etc.) and methodology (e.g., description of artistic process and materials, interview guide, transcription process, etc.). Item-level metadata should include basic information about creative outputs and their documentation (e.g., creator, date, subject, copyright, file format, equipment used for documentation, etc.). This information can be entered in a _README file in the root folder of your dataset.

A README is portable, durable way to provide information to other researchers about how to use your dataset.

A README is a guide to your dataset and is usually a plain text file to maximize its usability and long-term preservation potential. The purpose of a README is to assist other researchers to understand your dataset, its

contents, provenance, licensing and how interact with it. README files are generally named _README, _readme.txt or _read-me.md and are included as component of a dataset.

A README complements but does not replace the metadata that data repositories ask you to provide when you deposit your data. The best practice is to record information in both the repository’s metadata and the README. The repository’s metadata will support findability within and between data repositories while the README is portable and continues to describe the dataset after it has been separated from its original context. In all cases, you should use any conventions appropriate to your discipline to record the information about your dataset.

Content from the UBC Quick Guide to Creating a README File, version 1.2 (CC-BY)

A good Readme guide is available from Cornell University.

Downloadable ECU Specific README templates:

Core elements of any README include:

Contact information for the researcher(s)
The use license for your data (unless that is included in a separate file)
The context of your data collection (the goal of your research)
Your data collection methods (protocols, sampling, instruments, coverage, etc.)
The structure of files
Naming conventions for files, if applicable
Your sources used
Your quality assurance work (data validation, checking)
Any data manipulations or modifications
Data confidentiality and permissions
The names of labels and variables
Explanations of codes and classifications

Content from the UBC Quick Guide to Creating a README File, version 1.2 (CC-BY)

What is a controlled vocabulary?

Controlled vocabularies are a kind of metadata standard that features a set of expert-curated preferred terms used for indexing or searching within a particular subject domain. Some forms of controlled vocabularies are term lists, authority files, taxonomies, and thesauri.

Controlled vocabulary terms improve search results in two ways:

by connecting synonyms (different words with the same or similar meanings) with the preferred term for a concept, and
by distinguishing homophones (words that are spelled the same but have different meanings) reducing the ambiguity of natural language.

Using controlled vocabularies in the creating of data or metadata supports accuracy, consistency, and interoperability. There are well-established vocabularies for a variety of subjects, including personal and corporate names, geographic names, topics, concepts, resource types and genres, and languages.

Examples:

The Library of Congress employs several controlled vocabularies, including the widely-used LoC Subject Headings.

Content from the KPU Research Data Management Guide.

Examples of Other Metadata Standards & Controlled Vocabularies

SOURCE	CONTENT	URL
Cataloging Cultural Objects (CCO)	describe, document, and catalog cultural artifacts (like art and architecture) and visual media that represent them	https://www.vraweb.org/cco
Dublin Core (DCMI Schemas)	general purpose, widely used schema that can used in combination with metadata terms from other, compatible vocabularies in the context of application profiles	https://www.dublincore.org/specifications/dublin-core/dcmi-terms/
Getty Research Institute Vocabularies	geographic names, art & architecture, cultural objects, artist names	http://www.getty.edu/research/tools/vocabularies/
VRA Core	a data standard for the description of images and works of art and culture	https://www.loc.gov/standards/vracore/schemas.html

A metadata standard is a set of established categories you can use to describe your data. It’s recommended that you use one to help ensure your metadata is consistent, structured, and machine-readable, which is essential for depositing data in repositories and making them easily discoverable by search engines.

For more help finding a suitable metadata standard, you can contact the ECU library or reach out to the Portage DMP Coordinator at support@portagenetwork.ca.

The ECU Library uses the DataCite metadata schema (which maps to Dublin Core and DDI, which are two widely used general metadata standards).

Additionally, there are discipline-specific standards used by museums and galleries that may be useful to describe artworks or design objects, etc. at the item level (e.g., CCO, VRA Core). You can also explore arts-specific data repositories at re3data.org to see what metadata standards they use.

DataCite Metadata Elements

Identifier (such as a persistent DOI – Digital Object Identifier)
Title of data set
Creator (First name, last name, identifier like ORCID/ISNI/ROR, institutional affiliation, ECUs affiliation ID: https://ror.org/03k788b92)
Publisher
Subject keywords (use standardized disciplinary terminology)
Description (how and why the data was collected)
Abstract (how the dataset will be used)
Contributors (contributor type, first name, last name, identifier like ORCID/ISNI/ROR, institutional/organizational affiliation)
Date (YYYY-MM-DD)
Resource Type (image, text, spreadsheet, etc.)
Language
License (CC license. Describes how it can be used by others)

Research Data Management

ECU Institutional RDM Strategy

What are Research Data?