Glossary of Terms


All complex subjects have their own terminology that sometimes makes it hard for new people to break into the field. This sometimes includes uncommon words, but more often than not a subject will have very specific meanings for common words - the discussion of errors vs mistakes in this video is a good example of this.

This glossary is a reference of some of the uncommon terms and specific definitions of more common words that you will encounter throughout Data Tree and your broader dealings with data. 

Many of these definitions come from the course materials and experts that helped develop Data Tree. Others come from the CASRAI Dictionary. Those definitions are kindly made available under a Creative Commons Attribution 4.0 International License.



Browse the glossary using this index

Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL

D

Data Analysis

A research data lifecycle stage that involves the techniques that produce synthesised knowledge from organised information. A process of inspecting, cleaning, transforming, and modelling data with the goal of highlighting useful information suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.

CASRAI Dictionary


Data Catalogue

A curated collection of metadata about datasets and their data elements.


CASRAI Dictionary

Data Centre

A facility providing IT services, such as servers, massive storage, and network connectivity.

Many data centres act as a data repository for certain types of research data.


Data cleaning

Data cleaning is a continuous process that requires corrective actions throughout the data lifecycle. Data cleaning is the process of detecting and correcting corrupt or inaccurate records from a datasetData cleaning involves identifying, replacing, modifying, or deleting incomplete, incorrect, inaccurate, inconsistent, irrelevant, and improperly formatted, data. Typically, the process involves updating, correcting, standardising, and de-duplicating records to create a single view of the data, even if they are stored in multiple disparate systems. 

- CASRAI Dictionary

The most important thing to realise about data cleaning is that it is not just a one-time activity. Cleaning can (and should!) occur at every stage of the research data lifecycle.


Data curation

A managed process, throughout the data lifecycle, by which data/data collections are cleansed, documented, standardised, formatted and inter-related. This includes versioning data, or forming a new collection from several data sources, annotating with metadata, adding codes to raw data (e.g., classifying a galaxy image with a galaxy type such as "spiral"). Higher levels of curation involve maintaining links with annotation and with other published materials. Thus a dataset may include a citation link to publication whose analysis was based on the data. The goal of curation is to manage and promote the use of data from its point of creation to ensure it is fit for contemporary purpose and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Special forms of curation may be available in data repositories. The data curation process itself must be documented as part of curation. Thus curation and provenance are highly related.


 - CASRAI Dictionary


Data dictionary

A data dictionary, at its simplest, is a list and description of every variable within a dataset, including information such as the units of measurement and what the variable represents.

For more complex datasets, including multi-level or larger database structures, the data dictionary also includes descriptions of the relationships between tables, and for categorical data with a pre-defined set of possible options (e.g. an "enum" datatype in SQL, data coming from a "select" question in a survey, or other defined data such as days of the week), the data dictionary should also include the list of all possible values.




Data dredging

This term defines the practise of analysing large volumes of data, seeking possible relationships without any pre-defined hypothesis or goal. 

This practise is sometimes employed when working with 'big data' and often leads to premature conclusions. When working with big data, it is very easy to find statistically significant results, but it is much harder to spot if those results are actually meaningful, especially without a specific hypothesis. 

Data dredging is sometimes described as "seeking more information from a dataset than it actually contains" (CASRAI).



Data driven

Analysis and decision making led by the numbers, facts and statistical analysis, rather than intuition or experience.

Data exploration

Data exploration, often called exploratory analysis, uses descriptive statistical methods to learn about and understand the characteristics of a dataset

This includes exploring measures of central tendency (e.g. mean, median), measures of spread (standard deviation, range, variance). It also might include exploring the structure of the data, for example splitting the dataset by a categorical variable, or creating visualisations to view the data in different ways. 

This stage of analysis is often where a lot of data cleaning happens, as you can often spot missing data or outliers during this process. 


Data file format

The layout of a file in terms of how the data within the file are organised. A program that uses the data in a file must be able to recognise and possibly access data within the file. A particular file format is often indicated as part of a file's name by a filename extension (suffix). Conventionally, the extension is separated by a period from the name and contains three or four letters that identify the format. 

There are as many different file formats as there are different programs to process the files. Examples include: Word documents (.doc), Web text pages (.htm or .html), Web page images (.gif and .jpg), Adobe Postscript files (.ps), Adobe Acrobat files (.pdf), Executable programs (.exe), Multimedia files (.mp3 and others). 

Preferred formats are those designated by a data repository for which the digital content is maintained. If a data file is not in a preferred format, a data curator will often convert the file into a preferred format, thus ensuring that the digital content remains readable and usable. Usually, preferred formats are the de facto standard employed by a particular community. 

- CASRAI Dictionary


Data integrity

1. In the context of data and network security: The assurance that information can only be accessed or modified by those authorized to do so. 

2. In the context of data quality: The assurance the data are clean, traceable, and fit for purpose.

- CASRAI Dictionary


Data management plan

A formal statement describing how research data will be managed and documented throughout a research project and the terms regarding the subsequent deposit of the data with a data repository for long-term management and preservation.

 - CASRAI Dictionary


Data mining

The process of analysing multivariate datasets using pattern recognition or other knowledge discovery techniques to identify potentially unknown and potentially meaningful data content, relationships, classification, or trends. Data mining parameters include: Association (looking for patterns where one event is connected to another event); Sequence or path analysis (looking for patterns where one event leads to another later event); Classification (looking for new patterns); Clustering (finding and visually documenting groups of facts not previously known); Forecasting, or predictive analytics (discovering patterns in data that can lead to reasonable predictions about the future.

- CASRAI Dictionary


Data point

One measurement, observation or element, a single member of a larger dataset.

Data quality

The reliability and application efficiency of data. It is a perception or an assessment of dataset's fitness to serve its purpose in a given context. Aspects of data quality include: Accuracy, Completeness, Update status, Relevance, Consistency across data sources, Reliability, Appropriate presentation, Accessibility. Within an organisation, acceptable data quality is crucial to operational and transactional processes and to the reliability of analytics, business intelligence, and reporting. Data quality is affected by the way data are entered, stored and managed. Maintaining data quality requires going through the data periodically and scrubbing it. Typically this involves updating, standardising, and de-duplicating records to create a single view of the data, even if it is stored in multiple disparate systems. 

- CASRAI Dictionary


Data quality assurance

Data quality assurance (DQA) is the process of verifying the reliability and overall quality of data. 

 - CASRAI Dictionary

It is a process that should ideally be planned in advance, and integrated into your entire project workflow, from creation / sourcing of data, through processing and analysis and to storage and sharing of data.

Data quality checklists can be used to help identify potential issues with a dataset before you begin your exploratory analysis work. 


Data quality checklist

A data quality checklist is a list of possible issues with a dataset. This list can be created before you start exploring your data to help streamline your data cleaning. If there are physical or logical boundaries that your data should conform to, such as humidity not being above 100%, or age always being 0 or greater, these can form part of your checklist. As such, there is often a strong relationship between a data quality checklist and the data dictionary for your dataset.


Data Repository

1. A storage location for a collection of data that is too valuable to discard, but is only accessed occasionally.

2. An archival service providing the long-term permanent care and accessibility for digital objects with research value. The standard for such repositories is the Open Archival Information System reference model

CASRAI Dictionary


Data scientist

A person who has the knowledge and skills to conduct sophisticated and systematic analyses of data. A data scientist extracts insights from datasets for research or product development, and evaluates and identifies novel or strategic relationships or opportunities.

Data warehouse

Large, ordered repositories of data that can be used for analysis and reporting. In contrast to a data lake, a data warehouse is composed of data that has been cleaned, integrated with other sources, and is generally well-ordered. Data warehouses are often spoken about in relation to big data, but typically are components of more conventional systems.

Dataset

Any organised collection of data in a computational format, defined by a theme or category that reflects what is being measured/observed/monitored. 

- CASRAI Dictionary



Datetime

A standard way to express a numeric calendar date that eliminates ambiguity, acceptable formats being defined by ISO 8601. ISO 8601 is applicable whenever representation of dates in the Gregorian calendar, times in the 24-hour timekeeping system, time intervals and recurring time intervals or of the formats of these representations are included in information interchange. It includes calendar dates expressed in terms of calendar year, calendar month and calendar day of the month; ordinal dates expressed in terms of calendar year and calendar day of the year; week dates expressed in terms of calendar year, calendar week number and calendar day of the week; local time based upon the 24-hour timekeeping system; Coordinated Universal Time of day; local time and the difference from Coordinated Universal Time; combination of date and time of day; time intervals; recurring time intervals. 

- CASRAI Dictionary


Demonstrator

A one-off system, often software, that shows whether or how data can be used for a specific purpose or task.

Derived research data

Research data resulting from processing or combining 'raw' data, often reproducible but expensive.

Examples: compiled databases, text mining, aggregate census data.


Descriptive statistics

If you have a large set of data, then descriptive statistics provides graphical (e.g. boxplots) and numerical (e.g. summary tables, means, quartiles) ways to make sense of the data. The branch of statistics devoted to the exploration, summary and presentation of data is called descriptive statistics. If you need to do more than descriptive summaries and presentations it is to use the data to make inferences about some larger population. Inferential statistics is the branch of statistics devoted to making generalizations.

Digital Object Identifier

A name (not a location) for an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks. A DOI is a type of Persistent Identifier (PID) issued by the International DOI Foundation. This permanent identifier is associated with a digital object that permits it to be referenced reliably even if its location and metadata undergo change over time.

- CASRAI Dictionary


Digitisation

The process of creating digital files by scanning or otherwise converting analogue materials. The resulting digital copy, or digital surrogate, would then be classed as digital material and then subject to the same broad challenges involved in preserving access to it, as "born digital" materials.




Discrete variable

A set of data is discrete if the values belonging to it are distinct, i.e. they can be counted. Examples are the number of children in a family, the number of rain days in the month, the length (in days) of the longest dry spell in the growing season. (See also continuous variable for a more complete discussion.)

Disseminative Visualisation

Data visualisation designed as a presentational aid for disseminating information or insight, with no purpose other than communication.

Dublin Core

An initiative to create a digital "library card catalog" for the Web. Dublin Core is made up of 15 metadata elements that offer expanded cataloging information and improved document indexing for search engine programs. The 15 metadata elements used by Dublin Core are:

  • title (the name given the resource), 
  • creator (the person or organisation responsible for the content), 
  • subject (the topic covered), 
  • description (a textual outline of the content), 
  • publisher (those responsible for making the resource available), 
  • contributor (those who added to the content), 
  • date (when the resource was made available), 
  • type (a category for the content), 
  • format (how the resource is presented),
  • identifier (numerical identifier for the content such as a URL),
  • source (where the content originally derived from),
  • language (in what language the content is written), 
  • relation (how the content relates to other resources, for instance, if it is a chapter in a book), 
  • coverage (where the resource is physically located), 
  • rights (a link to a copyright notice).