Friday, 4 October 2024, 12:25 PM
Site: Datatree - Data Training Engaging End-users
Course: Introduction to Data Tree (Intro)
Glossary: Glossary of Terms
A

Absolute value

The absolute value is the value of a number, disregarding its sign. It is denoted by a pair of “|” signs. For example the modulus of –2.5 is |-2.5| = 2.5.

Access

The continued, available for use, ongoing usability of a digital resource, retaining all qualities of authenticity, accuracy and functionality deemed to be essential for the purposes the digital material was created and/or acquired for. Users who have access can retrieve, manipulate, copy, and store copies on a wide range of hard drives and external devices.

- CASRAI Dictionary

Administrative data

Information collected primarily for administrative, and not research purposes. It includes profiles and curriculum vitae of researchers, the scope and impact of research projects, funding, citations, and research outcomes. This type of data is collected by government departments and other organisations for the purposes of registration, transaction and record keeping, usually during the delivery of a service. These data are also recognized as having research value.


CASRAI Dictionary

Aggregated data

Data that are expressed in a summary form (e.g., summary statistics)


CASRAI Dictionary

Aggregation

The bringing together of elements. Types of aggregations differ by the nature of the processes by which elements are brought together and the reason understood for aggregating or contained as a unit. Aggregations differ in the nature of relations between the member parts.


CASRAI Dictionary

Algorithm

A computable set of steps to achieve a desired result.


CASRAI Dictionary

Alternative hypothesis

The alternative hypothesis, H1, is a statement of what the test is set up to establish. For example if comparing average annual rainfall in El Nino and ordinary years then we could have:
  • H0,that the two means are not equal, i.e. there is a difference between the two types of year..
The conclusion from the hypothesis test is either “Reject H0 in favour of H1.” or “Do not reject H0.” If H0 is rejected then the analysis usually continues by establishing, in this case, the extent of the difference between the two types of year.

Analogue data

Data in the form of analogue materials. RELATED TERM. Analogue materials


CASRAI Dictionary

Analogue signals

Continuous electronic signals.


CASRAI Dictionary

Analytics

The discovery of meaningful multidimensional patterns in data.


CASRAI Dictionary

Anomaly

A rule, practice, or observation that is different from what is normal or usual.


CASRAI Dictionary

Anonymity

A form of privacy that is not usually needed or wanted. There are occasions, however, when a user may want anonymity (for example, to report a crime). The need is sometimes met through the use of a site, called a remailer that re-posts a message from its own address, thus disguising the originator of the message. Unfortunately, many spam distributors also take advantage of remailers.


CASRAI Dictionary

Applied science

The application of existing scientific and professional knowledge to develop practical applications in a scientific field (e.g., actuarial science, agriculture, biology, chemistry, forestry, meteorology, physics, planetary and earth sciences), scientific regulation, or patent.


CASRAI Dictionary

Architecture

Fundamental organization of a system embodied in its components, their relationships to each other and to the environment, and the principles guiding its design and evolution. The term is not always used in normative or prescriptive ways. In some cases, the architecture may need to be flexible and thus more of an open framework rather than being a fixed set of components and services equal to everyone.


CASRAI Dictionary

Archive

A place or collection containing static records, documents, or other materials for long-term preservation.


CASRAI Dictionary

Archiving

A curation activity that ensures that data are properly selected, stored, and can be accessed, and for which logical and physical integrity are maintained over time, including security and authenticity.


CASRAI Dictionary

At-risk data

Data that are at risk of being lost. At-risk data include data that are not easily accessible, have been dispersed, have been separated from the research output object, are stored on a medium that is obsolete or at risk of deterioration, data that were not recorded in digital form, and digital data that are available but are not useable because they have been detached from supporting data, metadata, and information needed to use and interpret them intelligently.


CASRAI Dictionary

Average

For a numeric variable the average is a loosely used term for a measure of location. It is usually taken to be the mean, but it can also denote the median, the mode, among other things.
B

Best practice

A technique or methodology that, through experience and research, has proven to reliably lead to a desired result. A commitment to using the best practices in any field is a commitment to using all the knowledge and technology at one's disposal to ensure success. The term is used frequently in the fields of health care, government administration, the education system, project management, hardware and software product development, analytical chemistry, and elsewhere.


Big data

1. An evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that have the potential to be mined for information

2. Data that would take too much time and cost too much money to load into relational databases for analysis (typically petabytes and exabytes of data). 

3. Extensive datasets/collections/linked data primarily characterized by big volume, extensive variety, high velocity (creation and use), and/or variability that together require a scalable architecture for efficient data storage, manipulation, and analysis. In general, the size is beyond the ability of typical database software tools to capture, store, manage and analyze. It is assumed that as technology advances over time, the size of datasets that qualify as big data will increase. Also the definition can vary by sector, depending on what kind of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes).


Black box

Any device whose workings are not understood by or accessible to its user.


Born digital

Digital materials which are not intended to have an analogue equivalent, either as the originating source or as a result of conversion to analogue form. 

This term is used to differentiate them from 

  1. digital materials which have been created as a result of converting analogue originals; and 
  2. digital materials, which may have originated from a digital source but have been printed to paper, e.g. some electronic records.

CASRAI Dictionary

Box Plot

A graphical representation of numerical data, based on the five-number summary and introduced by John Wilder Turkey in 1970. The diagram has a scale in one direction only. A rectangular box is drawn, extending from the lower quartile to the upper quartile, with the median shown dividing the box. ‘Whiskers’ are then drawn extending from the end of the box to the greatest and least values. Multiple boxplots, arranged side by side, can be used for the comparison of several samples.

Bug

A coding error in a computer program which causes the program to perform in an unintended or unanticipated manner.


CASRAI Dictionary
C

Catalogue

A type of collection that describes, and points to features of another collection.


CASRAI Dictionary

Cataloguing

An intellectual process of describing objects in accordance with accepted library principles, particularly those of subject and classification order.


CASRAI Dictionary

Categorical Variable

A variable with values that range over categories, rather than being numerical. Examples include gender (male, female), paint colour (red, white, blue), type of animal (elephant, leopard, lion). Some categorical variables are ordinal.

Causation

The capacity of one variable to influence another. The first variable may bring the second into existence or may cause the incidence of the second variable to fluctuate. RELATED TERM. Correlation



CASRAI Dictionary

Change log

Tracks the progress of each change from submission through review, approval, implementation and closure. The log can be managed manually by using a document or spreadsheet, or it can be managed automatically with a software or Web-based tool.


CASRAI Dictionary

Checksum

To test if a file has changed over time. A checksum is a type of metadata and an important property of a data object to allow verifying identity and integrity. Also called a hash, a checksum is a randomly generated piece of data that is used to verify the fixity or stability of a digital object. It is most commonly used to detect whether some representation of digital object has changed over time. This is associated with PIDs but can be found and tested independently of PID systems.


CASRAI Dictionary

Citable data

A type of referable data that has undergone quality assessment and can be referred to as citations in publications and as part of research objects.


CASRAI Dictionary

Climate

Long-term weather patterns for a location or area, measured in averages, maxima and minima. Typically a minimum of 30 years of weather is considered to be the basis of a climate.

Climate simulation

Using computer models and quantitative methods to represent the atmosphere, oceans, land, ice and energy budget of the Earth.

Cloud computing

A large-scale distributed computing paradigm that is driven by economies of scale, in which a pool of abstracted, virtualised, dynamically- scalable, managed computing power, storage, platforms and services are delivered on demand to external customers over the Internet. 

Key elements:

  • it is a specialised distributed computing paradigm; 
  • it is massively scalable; 
  • it can be encapsulated as an abstract entity that delivers different levels of services to customers outside the Cloud; 
  • it is driven by economies of scale; and, 
  • the services can be dynamically configured (via virtualisation or other approaches) and delivered on demand.

CASRAI Dictionary

Cluster computing

Using multiple machines linked together and managing their collective capabilities to complete tasks. Computer clusters require a cluster management layer which handles communication between the individual nodes and coordinates work assignment.

Comma separated values

A file that contains the values in a table as a series of ASCII text lines organized so that each column value is separated by a comma from the next column's value and each row starts a new line.


CASRAI Dictionary

Compute intensive

Any computer application that requires a lot of computation, such as meteorology programs and other scientific applications.



Computer code

1. Computer code, or source code: A series of computer instructions written in some human readable computer language, usually stored in a text file. Computer code should include explanatory comments. 2. Machine code: Source code is 'compiled' or 'interpreted' to produce computer executable code. 3. A code is a collection of mandatory standards, which has been codified by a governmental authority and thus become part of the law for the jurisdiction represented by that authority. Examples include the National Building Code and the National Electrical Code. SYNONYM. Code; Source code; Script


CASRAI Dictionary

Confidence interval

A confidence interval gives an estimated range of values that is likely to include an unknown population parameter. For example suppose a study of planting dates for maize, and the interest is in estimating the upper quartile, i.e. the date by which a farmer will be able to plant in ¾ of the years. Suppose the estimate from the sample is day 332, i.e. 27th November and the 95% confidence interval is from day 325 to 339, i.e. 20th November to 4th December. Then the interpretation is that the true upper quartile is highly likely to be within this period. The width of the confidence interval gives an idea of how uncertain we are about the unknown parameter (see precision). A very wide interval (in the example it is ± 7 days) may indicate that more data needs to be collected before an effective analysis can be undertaken.

Confidential information

Any information obtained by a person on the understanding that they will not disclose it to others, or obtained in circumstances where it is expected that they will not disclose it. For example, the law assumes that whenever people give personal information to health professionals caring for them, it is confidential as long as it remains personally identifiable.


CASRAI Dictionary

Confidentiality

1. The duties and practices of people and organizations to ensure that individualsí personal information only flows from one entity to another according to legislated or otherwise broadly accepted norms and policies. 

2. In the context of of health data: Confidentiality is breached whenever personal information is communicated that is not authorized by legislation, professional obligations, or under contractual duties.


CASRAI Dictionary

Continuous variable

A numeric variable is continuous if the observations may take any value within an interval. Variables such as height, weight and temperature are continuous. In descriptive statistics the distinction between discrete and continuous variables is not very important. The same summary measures, like mean, median and standard deviation can be used. There is often a bigger difference once inferential methods are used in the analysis. The model that is assumed to generate a discrete variable is different to models that are appropriate for a continuous variable. Hence different parameters are estimated and used. (See also discrete variable, mixed variable.)

Copernicus

Earth Observation programme of the European Space Agency, primarily using the Sentinel series of satellites, to improve the understanding of and management of the environment.

Correlation

A statistical measure that indicates the extent to which two or more variables fluctuate together. Correlation does not imply causation. There may be, for example, an unknown factor that influences both variables similarly.


CASRAI Dictionary

Cryosphere

The part of the Earth-system where water is frozen, including glaciers and sea-ice.

Curation

The activity of managing and promoting the use of data from their point of creation to ensure that they are fit for contemporary purpose and available for discovery and reuse. For dynamic datasets this may mean continuous enrichment or updating to keep them fit for purpose. Higher levels of curation will also involve links with annotation and with other published materials.


D

Data Analysis

A research data lifecycle stage that involves the techniques that produce synthesised knowledge from organised information. A process of inspecting, cleaning, transforming, and modelling data with the goal of highlighting useful information suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.

CASRAI Dictionary

Data Catalogue

A curated collection of metadata about datasets and their data elements.


CASRAI Dictionary

Data Centre

A facility providing IT services, such as servers, massive storage, and network connectivity.

Many data centres act as a data repository for certain types of research data.

Data cleaning

Data cleaning is a continuous process that requires corrective actions throughout the data lifecycle. Data cleaning is the process of detecting and correcting corrupt or inaccurate records from a datasetData cleaning involves identifying, replacing, modifying, or deleting incomplete, incorrect, inaccurate, inconsistent, irrelevant, and improperly formatted, data. Typically, the process involves updating, correcting, standardising, and de-duplicating records to create a single view of the data, even if they are stored in multiple disparate systems. 

- CASRAI Dictionary

The most important thing to realise about data cleaning is that it is not just a one-time activity. Cleaning can (and should!) occur at every stage of the research data lifecycle.

Data curation

A managed process, throughout the data lifecycle, by which data/data collections are cleansed, documented, standardised, formatted and inter-related. This includes versioning data, or forming a new collection from several data sources, annotating with metadata, adding codes to raw data (e.g., classifying a galaxy image with a galaxy type such as "spiral"). Higher levels of curation involve maintaining links with annotation and with other published materials. Thus a dataset may include a citation link to publication whose analysis was based on the data. The goal of curation is to manage and promote the use of data from its point of creation to ensure it is fit for contemporary purpose and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Special forms of curation may be available in data repositories. The data curation process itself must be documented as part of curation. Thus curation and provenance are highly related.


 - CASRAI Dictionary

Data dictionary

A data dictionary, at its simplest, is a list and description of every variable within a dataset, including information such as the units of measurement and what the variable represents.

For more complex datasets, including multi-level or larger database structures, the data dictionary also includes descriptions of the relationships between tables, and for categorical data with a pre-defined set of possible options (e.g. an "enum" datatype in SQL, data coming from a "select" question in a survey, or other defined data such as days of the week), the data dictionary should also include the list of all possible values.



Data dredging

This term defines the practise of analysing large volumes of data, seeking possible relationships without any pre-defined hypothesis or goal. 

This practise is sometimes employed when working with 'big data' and often leads to premature conclusions. When working with big data, it is very easy to find statistically significant results, but it is much harder to spot if those results are actually meaningful, especially without a specific hypothesis. 

Data dredging is sometimes described as "seeking more information from a dataset than it actually contains" (CASRAI).


Data driven

Analysis and decision making led by the numbers, facts and statistical analysis, rather than intuition or experience.

Data exploration

Data exploration, often called exploratory analysis, uses descriptive statistical methods to learn about and understand the characteristics of a dataset

This includes exploring measures of central tendency (e.g. mean, median), measures of spread (standard deviation, range, variance). It also might include exploring the structure of the data, for example splitting the dataset by a categorical variable, or creating visualisations to view the data in different ways. 

This stage of analysis is often where a lot of data cleaning happens, as you can often spot missing data or outliers during this process. 

Data file format

The layout of a file in terms of how the data within the file are organised. A program that uses the data in a file must be able to recognise and possibly access data within the file. A particular file format is often indicated as part of a file's name by a filename extension (suffix). Conventionally, the extension is separated by a period from the name and contains three or four letters that identify the format. 

There are as many different file formats as there are different programs to process the files. Examples include: Word documents (.doc), Web text pages (.htm or .html), Web page images (.gif and .jpg), Adobe Postscript files (.ps), Adobe Acrobat files (.pdf), Executable programs (.exe), Multimedia files (.mp3 and others). 

Preferred formats are those designated by a data repository for which the digital content is maintained. If a data file is not in a preferred format, a data curator will often convert the file into a preferred format, thus ensuring that the digital content remains readable and usable. Usually, preferred formats are the de facto standard employed by a particular community. 

- CASRAI Dictionary

Data integrity

1. In the context of data and network security: The assurance that information can only be accessed or modified by those authorized to do so. 

2. In the context of data quality: The assurance the data are clean, traceable, and fit for purpose.

- CASRAI Dictionary

Data management plan

A formal statement describing how research data will be managed and documented throughout a research project and the terms regarding the subsequent deposit of the data with a data repository for long-term management and preservation.

 - CASRAI Dictionary

Data mining

The process of analysing multivariate datasets using pattern recognition or other knowledge discovery techniques to identify potentially unknown and potentially meaningful data content, relationships, classification, or trends. Data mining parameters include: Association (looking for patterns where one event is connected to another event); Sequence or path analysis (looking for patterns where one event leads to another later event); Classification (looking for new patterns); Clustering (finding and visually documenting groups of facts not previously known); Forecasting, or predictive analytics (discovering patterns in data that can lead to reasonable predictions about the future.

- CASRAI Dictionary

Data point

One measurement, observation or element, a single member of a larger dataset.

Data quality

The reliability and application efficiency of data. It is a perception or an assessment of dataset's fitness to serve its purpose in a given context. Aspects of data quality include: Accuracy, Completeness, Update status, Relevance, Consistency across data sources, Reliability, Appropriate presentation, Accessibility. Within an organisation, acceptable data quality is crucial to operational and transactional processes and to the reliability of analytics, business intelligence, and reporting. Data quality is affected by the way data are entered, stored and managed. Maintaining data quality requires going through the data periodically and scrubbing it. Typically this involves updating, standardising, and de-duplicating records to create a single view of the data, even if it is stored in multiple disparate systems. 

- CASRAI Dictionary

Data quality assurance

Data quality assurance (DQA) is the process of verifying the reliability and overall quality of data. 

 - CASRAI Dictionary

It is a process that should ideally be planned in advance, and integrated into your entire project workflow, from creation / sourcing of data, through processing and analysis and to storage and sharing of data.

Data quality checklists can be used to help identify potential issues with a dataset before you begin your exploratory analysis work. 

Data quality checklist

A data quality checklist is a list of possible issues with a dataset. This list can be created before you start exploring your data to help streamline your data cleaning. If there are physical or logical boundaries that your data should conform to, such as humidity not being above 100%, or age always being 0 or greater, these can form part of your checklist. As such, there is often a strong relationship between a data quality checklist and the data dictionary for your dataset.

Data Repository

1. A storage location for a collection of data that is too valuable to discard, but is only accessed occasionally.

2. An archival service providing the long-term permanent care and accessibility for digital objects with research value. The standard for such repositories is the Open Archival Information System reference model

CASRAI Dictionary

Data scientist

A person who has the knowledge and skills to conduct sophisticated and systematic analyses of data. A data scientist extracts insights from datasets for research or product development, and evaluates and identifies novel or strategic relationships or opportunities.

Data warehouse

Large, ordered repositories of data that can be used for analysis and reporting. In contrast to a data lake, a data warehouse is composed of data that has been cleaned, integrated with other sources, and is generally well-ordered. Data warehouses are often spoken about in relation to big data, but typically are components of more conventional systems.

Dataset

Any organised collection of data in a computational format, defined by a theme or category that reflects what is being measured/observed/monitored. 

- CASRAI Dictionary


Datetime

A standard way to express a numeric calendar date that eliminates ambiguity, acceptable formats being defined by ISO 8601. ISO 8601 is applicable whenever representation of dates in the Gregorian calendar, times in the 24-hour timekeeping system, time intervals and recurring time intervals or of the formats of these representations are included in information interchange. It includes calendar dates expressed in terms of calendar year, calendar month and calendar day of the month; ordinal dates expressed in terms of calendar year and calendar day of the year; week dates expressed in terms of calendar year, calendar week number and calendar day of the week; local time based upon the 24-hour timekeeping system; Coordinated Universal Time of day; local time and the difference from Coordinated Universal Time; combination of date and time of day; time intervals; recurring time intervals. 

- CASRAI Dictionary

Demonstrator

A one-off system, often software, that shows whether or how data can be used for a specific purpose or task.

Derived research data

Research data resulting from processing or combining 'raw' data, often reproducible but expensive.

Examples: compiled databases, text mining, aggregate census data.

Descriptive statistics

If you have a large set of data, then descriptive statistics provides graphical (e.g. boxplots) and numerical (e.g. summary tables, means, quartiles) ways to make sense of the data. The branch of statistics devoted to the exploration, summary and presentation of data is called descriptive statistics. If you need to do more than descriptive summaries and presentations it is to use the data to make inferences about some larger population. Inferential statistics is the branch of statistics devoted to making generalizations.

Digital Object Identifier

A name (not a location) for an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks. A DOI is a type of Persistent Identifier (PID) issued by the International DOI Foundation. This permanent identifier is associated with a digital object that permits it to be referenced reliably even if its location and metadata undergo change over time.

- CASRAI Dictionary

Digitisation

The process of creating digital files by scanning or otherwise converting analogue materials. The resulting digital copy, or digital surrogate, would then be classed as digital material and then subject to the same broad challenges involved in preserving access to it, as "born digital" materials.



Discrete variable

A set of data is discrete if the values belonging to it are distinct, i.e. they can be counted. Examples are the number of children in a family, the number of rain days in the month, the length (in days) of the longest dry spell in the growing season. (See also continuous variable for a more complete discussion.)

Disseminative Visualisation

Data visualisation designed as a presentational aid for disseminating information or insight, with no purpose other than communication.

Dublin Core

An initiative to create a digital "library card catalog" for the Web. Dublin Core is made up of 15 metadata elements that offer expanded cataloging information and improved document indexing for search engine programs. The 15 metadata elements used by Dublin Core are:

  • title (the name given the resource), 
  • creator (the person or organisation responsible for the content), 
  • subject (the topic covered), 
  • description (a textual outline of the content), 
  • publisher (those responsible for making the resource available), 
  • contributor (those who added to the content), 
  • date (when the resource was made available), 
  • type (a category for the content), 
  • format (how the resource is presented),
  • identifier (numerical identifier for the content such as a URL),
  • source (where the content originally derived from),
  • language (in what language the content is written), 
  • relation (how the content relates to other resources, for instance, if it is a chapter in a book), 
  • coverage (where the resource is physically located), 
  • rights (a link to a copyright notice).

E

e-Infrastructure

A combination and interworking of digitally-based technology (hardware and software), resources (data, services, digital libraries), communications (protocols, access rights and networks), and the people and organisational structures needed to support modern, internationally leading collaborative research be it in the arts and humanities or the sciences. http://www.rcuk.ac.uk/research/xrcprogrammes/otherprogs/einfrastructure/

E-Research

Computationally intensive, large-scale, networked and collaborative forms of research and scholarship across all disciplines, including all of the natural and physical sciences, related applied and technological disciplines, biomedicine, social science and the digital humanities.

- CASRAI Dictionary

Earth Observation

Gathering information about the Earth's physical systems via remote sensing technologies, often satellites which look down at the Earth from their orbit.

Electromagnetic Spectrum

The range of wavelengths of electromagnetic radiation, with gamma rays having short wavelengths and high energy, to radio waves with long wavelengths and low energy. Visible light is part of the electromagnetic spectrum. Examples of the use of electromagnetic radiation https://www.bbc.co.uk/education/guides/z66g87h/revision/3

ENIAC

Electronic Numerical Integrator And Computer, the world's first general-purpose computer; designed and built to calculate artillery firing tables in the 1940s and later used for computer weather predictions. https://www.thoughtco.com/history-of-the-eniac-computer-1991601

Ensemble

In weather forecasting an ensemble is a method whereby instead of making a single forecast, a set of forecasts are produced that present a range of future weather possibilities. https://www.ecmwf.int/en/about/media-centre/fact-sheet-ensemble-weather-forecasting

Environmental analytics

Analysis of data sourced from the environment, or data with an application relating to the environment.

Environmental consultant

Works on a contractual basis for private and public sector clients, addressing environmental issues such as water pollution, air quality and soil contamination. [www.sokanu.com]

Environmental research data

Individual items or records (both digital and analogue) usually obtained by measurement, observation or modelling of the natural world and the impact of humans upon it, including all necessary calibration and quality control. This includes data generated through complex systems, such as information retrieval algorithms, data assimilation techniques and the application of numerical models. However, it does not include the models themselves. 

 - NERC Data Policy


Examples of research data:

  • Model output from running a numerical climate model
  • Time series logged by environmental instrumentation
  • Conductivity-Temperature-Deptch casts from oceanographic cruises
  • Groundwater chemistry and stable isotope measurements
  • Butterfly abundance observations.

Error

Error is the difference between the measured value and the ‘true value’ (NPL, 1999).  Errors can come from the measuring device itself, including bias, changes due to wear, instrument drift, electrical noise and device resolution.  Other errors can be introduced by difficulties in performing the measurement and by operator skill.  To avoid sampling error, sufficiently dense measurements in space and time should take place to make sure that full variability is captured e.g. diurnal cycles, variations across a site.

Errors can be random or systematic (NPL, 1999).  With random errors, each measurement gives a different result, so the more measurements (of the same thing) the better the estimate and the more certain the measurement becomes.  Systematic errors arise from a bias, e.g., a stretched tape measure, and more measurements do not produce a better estimate of the ‘true value’.


Estimation

Estimation is the process by which sample data are used to indicate the value of an unknown quantity in a population. The results of estimation can be expressed as a single value, known as a point estimate. It is usual to also give a measure of precision of the estimate. This is called the standard error of the estimate. A range of values, known as a confidence interval can also be given.

Estimator

An estimator is a quantity calculated from the sample data, which is used to give information about an unknown quantity (usually a parameter) in the population. For example, the sample mean is an estimator of the population mean.

Exa-

Prefix denoting a factor of 1018 or a billion billion

Experimental research data

Research data from experimental results, often reproducible, but can be expensive.

Examples: data from lab equipment, metagenomic sequences recovered from soil samples, results of a field experiment.

F

FAIR Data Principle

Data that follows the FAIR principles must be:

  • Findable
  • Accessible
  • Interoperable
  • Reusable

FAIR is a set of guiding principles for data management and stewardship designed by stakeholders representing interests in academia, industry, funding agencies and scholarly publishers. The FAIR principles define a set of core enabling conditions which, if fulfilled for a given set of data, would ensure that they remain accessible and re-usable over the long term. 

A key element of these principles is the focus on the use of structured information and persistent identifiers to enable machine discoverability and use of the data.

The full set of principles are published in the article "The FAIR Guiding Principles for scientific data management and stewardship". - DOIhttps://doi.org/10.1038/sdata.2016.18

Fair use

A legal concept that allows the reproduction of copyrighted material for certain purposes without obtaining permission and without paying a fee or royalty. Purposes permitting the application of fair use generally include review, news reporting, teaching, or scholarly research. When in doubt, the quickest and simplest thing may to request permission of the copyright owner.

- CASRAI Dictionary

Flop

FLOating Point operation, a single calculation on a number with a decimal point i.e. not an integer. In computing, floating point operations per second (FLOPS) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations.

FTP

File Transfer Protocol, a standardised set of rules to allow upload and download of files between two computers, commonly used for exchanging files over the Internet.
G

Geostationary

A satellite that tracks the Earth's rotation above the equator therefore appearing to remain stationary, viewing the same portion of the Earth's surface, often used for TV or radio broadcasting and some meteorological satellites.

Giga-

Prefix denoting a factor of 109 or a billion
H

Heat map

A two-dimensional representation of data in which values are represented by colors. Heat maps communicate relationships between data values that would be would be much more difficult to understand if presented numerically in a spreadsheet.

- CASRAI Dictionary

Hypothesis Test

Testing hypotheses is a common part of statistical inference. To formulate a test, the question of interest is simplified into two competing hypotheses, between which we have a choice. The first is the null hypothesis, denoted by H0, against the alternative hypothesis, denoted by H1 . For example with 50 years of annual rainfall totals a hypothesis test could be whether the mean is different in El Nino and Ordinary years. Then usually • The null hypothesis, H0, is that the two means are equal, i.e. there is no difference. • The alternative hypothesis, H1, is that the two means are unequal, i.e. there is a difference. If the 50 years were considered as being of three types, El Nino, Ordinary, La Nina then usually: • The null hypothesis, H0, is that all three means are equal. • The alternative hypothesis, H1, is that there is a difference somewhere between the means. The hypotheses are often statements about population parameters. In the first example above it might be: • H0, is that µE = µO. • H1, is that µE ≠ µO. The outcome of a hypothesis test is either • Reject H0 in favour of H1, or • Do not reject H0.
I

Inference

Inference is the process of deducing properties of the underlying distribution or population, by analysis of data. It is the process of making generalizations from the sample to a population.

Informatics

The science of collecting, classifying, storing, retrieving and disseminating data and/or knowledge.

Information

The aggregation of data to make coherent observations about the world, meaningful data, or data arranged or interpreted in a way to provide meaning.

- CASRAI Dictionary

It is often considered the job of the scientist, researcher or statistician to derive information from raw data.

Infrared

In the electromagnetic spectrum, the visible light region lies from violet at shorter wavelengths/energies to red at longer wavelengths/energies. Infrared radiation has wavelengths just greater than red light and emitted particularly by heated objects. For example, night vision goggles use infrared radiation.

Integer

A number which is not a fraction; a whole number.

Inter-quartile range

The interquartile range is the difference between the upper and lower quartiles. If the lower and upper quartiles are denoted by Q1 and Q3, respectively, the interquartile range is (Q3 - Q1). The phrase ‘interquartile range’ was first used by Galton in 1882.

Internet of Things

Abbreviated to IoT, a broad term for devices interconnected via the internet enabling sending and receiving of data or instructions. These devices include everyday items such as home appliances or cameras. Sometimes also referred to as ‘smart devices'.

Interoperability

The capability to communicate, execute programs, or transfer data among various functional units in a useful and meaningful manner that requires the user to have little or no knowledge of the unique characteristics of those units. Foundational, syntactic, and semantic interoperability are the three necessary aspects of interoperability.

- CASRAI Dictionary

J

JASMIN

Petabyte-scale easily accessible storage collocated with data analysis computing facilities run by the Scientific and Technologies Facilities Council for researchers and science community in the UK. http://www.jasmin.ac.uk/
K

Kilo-

Prefix denoting a factor of 103 or a thousand
M

Machine learning

The study and practice of designing systems that can learn, adjust, and improve automatically, based on the data fed to them. This typically involves implementation of predictive and statistical algorithms that focus on 'correct' behaviour and insights as data flows through the system.

MapReduce

A big data algorithm for scheduling work on a computing cluster. The process involves splitting the problem set up, mapping it to different nodes (map), and computing over them to produce intermediate results, shuffling the results to align like sets, and then reducing the results by outputting a single value for each set (reduce).

Mega-

Prefix denoting a factor of 106 or a million

Metadata

Background or contextual data about a dataset. Literally "data about data". Metadata is required to enable someone to properly understand and interpret a main dataset

Examples: 

  • The research questions that the data was collected to address
  • Any relevant environmental conditions affecting the main variables
  • The instruments used to collect or generate the data, including specification and calibration details.
  • The methodology used to collect or generate the data,
  • Definitions of variables, including units - also called a data dictionary


METAR

METeorological Aviation Report, a weather observation taken at a certain location, most likely an airfield, for use by pilots and weather forecasters. The METAR coding standard is agreed between civil aviation and weather authorities.

Model

Representation of a real world situation. The word “model” is used in many ways and means different things, depending on the discipline. For example a meteorologist might think of a global climate model, used for weather forecasting, while an agronomist might think of a crop simulation model, used to estimate crop growth and yields. Statistical models form the bedrock of data analysis. A statistical model is a simple description of a process that may have given rise to observed data.
N

Natural Capital

Can be defined as the world's stocks of natural assets which includes soil, water, air, flora and fauna.

Near-infrared

In the electromagnetic spectrum, near-infrared lies between red visible light and infrared. See also Infrared.

Nimbus

A programme of seven NASA missions of Earth Observation satellites, starting in 1964. Nimbus is Latin for rain cloud.

Noise

Noise in data is meaningless data or unexplained variation in data which might be due to instrument errors, corruption or other issues. Noise disguises and/or distorts the underlying data which make it harder to analyse, just as noisy environments make it more difficult to hear the sound on which you wish to focus.

Normal distribution

The normal distribution is used to model some continuous variables. It is a symmetrical bell shaped curve that is completely determined by two parameters. They are the distribution (or population) mean, μ, and the standard deviation, σ.

Numerical Variable

Refers to a variable whose possible values are numbers (as opposed to categories).
O

Observational research data

Research data captured in real time, usually unique and irreplaceable. 

Examples: Weather records, species census surveys.

Ontology

A set of terminology to describe important concepts, often specific to a particular domain or discipline. It's a way of describing a vocabulary that can be shared among practitioners in a field, to allow for easier communication and a standardised way of defining and labelling, for example when writing metadata.

Open Data

Structured data that are accessible, machine-readable, usable, intelligible, and freely shared. Open data can be freely used, re-used, built on, and redistributed by anyone - subject only, at most, to the requirement to attribute and share-alike.

-  CASRAI Dictionary

Open Science

The practice of science in such a way that others can collaborate and contribute, where research data, lab notes and other research processes are freely available, under terms that enable reuse, redistribution and reproduction of the research and its underlying data and methods. 

- FOSTER Open Science Consortium


Open Science encompasses a broad set of practices, including: 

Open Source

Open source software is software whose source code has been made freely available for re-use and modification under and Open Source license. 

There are different types of open source license, but to be truly open source they all conform to the guidelines laid out at the Open Source Initiative

Ordinal variable

An ordinal variable is a categorical variable in which the categories have an obvious order, e.g. (strongly disagree, disagree, neutral, agree, strongly agree), or (dry, trace, light rain, heavy rain).

Outlier

A data point showing an unexpected relationship or large difference to the remainder of the dataset.
P

p-value

The probability value (p-value) of a hypothesis test is the probability of getting a value of the test statistic as extreme, or more extreme, than the one observed, if the null hypothesis is true. Small p-values suggest the null hypothesis is unlikely to be true. The smaller it is, the more convincing is the evidence to reject the null hypothesis. In the pre-computer era it was common to select a particular p-value, (often 0.05 or 5%) and reject H0 if (and only if) the calculated probability was less than this fixed value. Now it is much more common to calculate the exact p-value and interpret the data accordingly.

Parameter

A parameter is a numerical value of a population, such as the population mean. The population values are often modelled from a distribution. Then the shape of the distribution depends on its parameters. For example the parameters of the normal distribution are the mean, μ and the standard deviation, σ. For the binomial distribution, the parameters are the number of trials, n, and the probability of success, θ.

Percentile

The pth percentile of a list is the number such that at least p% of the values in the list are no larger than it. So the lower quartile is the 25th percentile and the median is the 50th percentile. One definition used to give percentiles, is that the p’th percentile is the 100/p*(n+1)’th observation. For example, with 7 observations, the 25th percentile is the 100/25*8 = 2nd observation in the sorted list. Similarly, the 20th percentile = 100/20*8 = 1.6th observation.

Peta-

Prefix denoting a factor of 1015 or a million billion

Physical data

Data in the form of physical samples.

Examples: Soil samples, ice cores.

Polar orbiting

A satellite orbit passing above or nearly above both poles on each orbit. Polar orbiting satellites have a lower altitude above the Earth's surface than geostationary satellites and therefore an increased resolution.

Population

A population is a collection of units being studied. This might be the set of all people in a country. Units can be people, places, objects, years, drugs, or many other things. The term population is also used for the infinite population of all possible results of a sequence of statistical trials, for example, tossing a coin. Much of statistics is concerned with estimating numerical properties (parameters) of an entire population from a random sample of units from the population.

Precision

Precision is a measure of how close an estimator is expected to be to the true value of a parameter. Precision is usually expressed in terms of the standard error of the estimator. Less precision is reflected by a larger standard error.

Primary Data

Data that has been created or collected first hand to answer the specific research question.

Proportion

For a variable with n observations, of which the frequency of a particular characteristic is r, the proportion is r/n. For example if the frequency of replanting was 11 times in 55 years, then the proportion was 11/55 = 0.2 of the years, or one fifth of the years. (See also percentages.)

Provenance

In the case of data, the process of tracing and recording the origins of data and its movements between databases. Data's full history including how and why it got to its present palace.

Proxy

In the case of data, other data that you may use and/or transform when you do not have a direct measurement of the data you require.
Q

Qualitative data

This is data regarding 'qualities', which do not take the form of numbers. 

Examples: interview transcripts, ethnographic materials, photos

Quantitative data

Data that is a measure of quantity. It comes in the form of numbers. 

Examples: Measurements, sample analyses, observations and model outputs.

Quartiles

There are three quartiles. To find them, first sort the list into increasing order. The first or lower quartile of a list is a number (not necessarily in the list) such that 1/4 of the values in the sorted list are no larger than it, and at least 3/4 are no smaller than it.
R

Range

The range is the difference between the maximum and the minimum values. It is a simple measure of the spread of the data.

Raw Data

Raw data are data that have not been processed for meaningful use. A raw dataset is exactly what is collected, before any data cleaning, processing or analysis has been completed. 

It is often useful to store raw data as well as the cleaned, processed data, as it can help your work to be more easily reproduced. If another researcher has your raw data and the steps you used to process and analyse, they can recreate your results. This has to be balanced with the cost of storing raw data, and the likelihood of the raw data being useful compared to data that has undergone an initial process of data cleaning.

Reference research data

A static or organic conglomeration or collection of smaller (peer reviewed) datasets, most probably published and curated, e.g. UK Tide Gauge Network, IUCN Red List of Endangered Species

Research Data

The evidence that underpins the answer to the research question

 - UK Concordat on Open Research Data (2016)

Recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings.

 - EPSRC Policy Framework on Research Data

Data that are used as primary sources to support technical or scientific enquiry, research, scholarship, or artistic activity, and that are used as evidence in the research process and/or are commonly accepted in the research community as necessary to validate research findings and results. All other digital and non-digital content have the potential of becoming research data. Research data may be experimental data, observational data, operational data, third party data, public sector data, monitoring data, processed data, or repurposed data.


- CASRAI Dictionary

Research Data Lifecycle

A model to conceptualise the different stages through which data pass during the research process, and the data management activities that relate to those stages.

The model used throughout Data Tree has six stages, corresponding to different activities during the life of a research project. Other institutions or paradigms have slight variations on these stages, but the broad concepts are applicable no matter how you choose to categorise your research activities. 

Our model is based on the UK Data Service model from 2011, and has the following stages: 

  • Re-using Data: Often considered both the start and the end of the cycle. Your research might start by gathering secondary data, and your own research outputs might be later used by yourself or others in different sectors.
  • Creating Data: Data collection or generation activities.
  • Processing Data: The tasks of turning raw data into analysis-ready data. This includes quality control checks, data cleaning and documentation.
  • Analysing Data: Includes data visualisations and statistical analysis; tasks that involve the process of getting information out of your data.
  • Preserving Data: The tasks of putting your data into a location for long-term storage and access, such as a data repository.
  • Making Data Accessible: This includes not just ensuring your data can be accessed, but also making others aware of your data. This might include publishing in a data journal and adding appropriate licences to your preserved data.

RGB

In satellite imagery, the satellite's sensors operate in three channels, red, green and blue separately, and can be combined to give a colour image.
S

Sample

A sample is a group of units, selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions (inferences) about the population. A sample is usually used because the population is too large to study in its entirety. The sample should be representative of the population. This is best achieved by random sampling. The sample is then called a random sample.

Sampling Distribution

A sampling distribution describes the probabilities associated with an estimator, when a random sample is drawn from a population. The random sample is considered as one of the many samples that might have been taken. Each would have given a different value for the estimator. The distribution of these different values is called the sampling distribution of the estimator. Deriving the sampling distribution is the first step in calculating a confidence interval, or in conducting a hypothesis test.

Satellite imagery

An image of part of the Earth taken using artificial satellites in orbit around the Earth. These images have a variety of uses including

Secondary Data

Existing data which is being reused for a purpose other than the one for which it was collected.

Sentinel satellites

A family of Earth Observation satellite missions by the European Space Agency http://m.esa.int/Our_Activities/Observing_the_Earth/Copernicus/Overview4

Signal to noise ratio

A measure of how much useful information there is in a system, a phrase applied generally but originating in electrical systems to indicate the strength of the information (signal) compared to unwanted interference (noise), a low signal to noise ratio means that it is difficult to determine the useful information.

Simulation research data

Research data generated from test models where the model and metadata may be more important than the output data from the model.

Examples: Climate or ocean circulation models.

Skew

If the distribution (or “shape”) of a variable is not symmetrical about the median or the mean it is said to be skew. The distribution has positive skewness if the tail of high values is longer than the tail of low values, and negative skewness if the reverse is true.

Smart Meter

A new kind of energy meter that can digitally send meter readings to your energy supplier and come with in home display units, to see in real-time how much energy is being used in a household.

Software developer

A person who researches, designs, programs and tests computer code.

Stakeholder

Individuals, groups or organisations that have an interest or share in an undertaking or relationship and its outcome - they may be affected by it, impact or influence it, and in some way be accountable for it.

- CASRAI Dictionary

Standard deviation

The standard deviation (s.d.) is a commonly used summary measure of variation or spread of a set of data. It is a “typical” distance from the mean. Usually, about 70% of the observations are closer than 1 standard deviation from the mean and most (about 95%) are within 2 s.d. of the mean.

Standard error

The standard error (s.e.) is a measure of precision. It is a key component of statistical inference. The standard error of an estimator is a measure of how close it is likely to be, to the parameter it is estimating.

Stream processing

The practice of computing over individual data items as they move through a system. This allows for real-time analysis of the data being fed to the system and is useful for time-sensitive operations using high velocity metrics.
T

Tera-

Prefix denoting a factor of 1012 or a thousand billion (also a million million)
U

Urban heat island

A built-up area that is warmer than the surrounding rural areas due to human activities.
V

Variance

The variance is a measure of variability, and is often denoted by s2. In simple statistical methods the square root of the variance, s, which is called the standard deviation, is often used more. The standard deviation has the same units as the data themselves and is therefore easier to interpret. The variance becomes more useful in its own right when the contribution of different sources of variation are being assessed. This leads to a presentation called the “analysis of variance”, often written as ANOVA.

Version control

Control over time of data, computer code, software, and documents that allows for the ability to revert to a previous revision, which is critical for data traceability, tracking edits, and correcting mistakes. Version control generates a (changed) copy of a data object that is uniquely labeled with a version number. The intent is to track changes to a data object, by making versioned copies. Note that a version is different from a backup copy, which is typically a copy made at a specific point in time, or a replica. 

- CASRAI Dictionary

Version control is very popular in programming, and many coders use Git or Subversion to track changes to their scripts  and other text-based files. Other, simpler version control systems include things like MS Word's "track changes" feature, and the feature that many cloud storage facilities such as Dropbox and Google Drive have that allows users to revert to previous versions of documents stored in their systems for limited periods.


Visualisation

Representing data visually to enable understanding of its significance; to highlight patterns and trends that might otherwise be missed; to communicate data quickly and in a meaningful way.
W

Weather

Specific atmospheric conditions around us which can change minute-by-minute, day-to-day.

Weather forecast

A prediction of specific future weather conditions, such as daily maximum temperature at a location, up to several days ahead, with the estimate frequently becoming more uncertain with increasing lead-time. Weather forecasts are often based on computer simulations of the atmosphere known as NWP, Numerical Weather Prediction.

Weather observation

Also known as a weather report, is a snapshot of the weather conditions at a certain location and at a certain time. An observation may be as basic as an air temperature reading but can include wind speed and direction, visibility, humidity, precipitation, cloud cover or soil surface temperature. A long-term average of a location's weather observations e.g. 30 years, determines the location's climate.
Y

Yotta-

Prefix denoting a factor of 1024 or a million billion billion and the largest unit prefix in the metric system
Z

Zetta-

Prefix denoting a factor of 1021 or a thousand billion billion