Glossary of Terms
All complex subjects have their own terminology that sometimes makes it hard for new people to break into the field. This sometimes includes uncommon words, but more often than not a subject will have very specific meanings for common words - the discussion of errors vs mistakes in this video is a good example of this.
This glossary is a reference of some of the uncommon terms and specific definitions of more common words that you will encounter throughout Data Tree and your broader dealings with data.
Many of these definitions come from the course materials and experts that helped develop Data Tree. Others come from the CASRAI Dictionary. Those definitions are kindly made available under a Creative Commons Attribution 4.0 International License.
Special | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | ALL
A |
---|
Absolute valueThe absolute value is the value of a number, disregarding its sign. It is denoted by a pair of “|” signs. For example the modulus of –2.5 is |-2.5| = 2.5. | |
AccessThe continued, available for use, ongoing usability of a digital resource, retaining all qualities of authenticity, accuracy and functionality deemed to be essential for the purposes the digital material was created and/or acquired for. Users who have access can retrieve, manipulate, copy, and store copies on a wide range of hard drives and external devices. | |
Administrative dataInformation collected primarily for administrative, and not research purposes. It includes profiles and curriculum vitae of researchers, the scope and impact of research projects, funding, citations, and research outcomes. This type of data is collected by government departments and other organisations for the purposes of registration, transaction and record keeping, usually during the delivery of a service. These data are also recognized as having research value. | |
Aggregated data | |
Algorithm | |
Alternative hypothesisThe alternative hypothesis, H1, is a statement of what the test is set up to establish. For example if comparing average annual rainfall in El Nino and ordinary years then we could have:
| |
Analogue data | |
Analogue signals | |
Analytics | |
Anomaly | |
AnonymityA form of privacy that is not usually needed or wanted. There are occasions, however, when a user may want anonymity (for example, to report a crime). The need is sometimes met through the use of a site, called a remailer that re-posts a message from its own address, thus disguising the originator of the message. Unfortunately, many spam distributors also take advantage of remailers. | |
Applied scienceThe application of existing scientific and professional knowledge to develop practical applications in a scientific field (e.g., actuarial science, agriculture, biology, chemistry, forestry, meteorology, physics, planetary and earth sciences), scientific regulation, or patent. | |
ArchitectureFundamental organization of a system embodied in its components, their relationships to each other and to the environment, and the principles guiding its design and evolution. The term is not always used in normative or prescriptive ways. In some cases, the architecture may need to be flexible and thus more of an open framework rather than being a fixed set of components and services equal to everyone. | |
ArchiveA place or collection containing static records, documents, or other materials for long-term preservation. | |
ArchivingA curation activity that ensures that data are properly selected, stored, and can be accessed, and for which logical and physical integrity are maintained over time, including security and authenticity. | |
At-risk dataData that are at risk of being lost. At-risk data include data that are not easily accessible, have been dispersed, have been separated from the research output object, are stored on a medium that is obsolete or at risk of deterioration, data that were not recorded in digital form, and digital data that are available but are not useable because they have been detached from supporting data, metadata, and information needed to use and interpret them intelligently. | |
AverageFor a numeric variable the average is a loosely used term for a measure of location. It is usually taken to be the mean, but it can also denote the median, the mode, among other things. | |
B |
---|
Best practiceA technique or methodology that, through experience and research, has proven to reliably lead to a desired result. A commitment to using the best practices in any field is a commitment to using all the knowledge and technology at one's disposal to ensure success. The term is used frequently in the fields of health care, government administration, the education system, project management, hardware and software product development, analytical chemistry, and elsewhere. | ||
Big data1. An evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that have the potential to be mined for information. 2. Data that would take too much time and cost too much money to load into relational databases for analysis (typically petabytes and exabytes of data). 3. Extensive datasets/collections/linked data primarily characterized by big volume, extensive variety, high velocity (creation and use), and/or variability that together require a scalable architecture for efficient data storage, manipulation, and analysis. In general, the size is beyond the ability of typical database software tools to capture, store, manage and analyze. It is assumed that as technology advances over time, the size of datasets that qualify as big data will increase. Also the definition can vary by sector, depending on what kind of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, big data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes). | |
Black box | |
Born digitalDigital materials which are not intended to have an analogue equivalent, either as the originating source or as a result of conversion to analogue form. This term is used to differentiate them from
| |
Box PlotA graphical representation of numerical data, based on the five-number summary and introduced by John Wilder Turkey in 1970. The diagram has a scale in one direction only. A rectangular box is drawn, extending from the lower quartile to the upper quartile, with the median shown dividing the box.
‘Whiskers’ are then drawn extending from the end of the box to the greatest and least values. Multiple boxplots, arranged side by side, can be used for the comparison of several samples. | |
BugA coding error in a computer program which causes the program to perform in an unintended or unanticipated manner. | |
C |
---|
CatalogueA type of collection that describes, and points to features of another collection. | |
CataloguingAn intellectual process of describing objects in accordance with accepted library principles, particularly those of subject and classification order. | |
Categorical VariableA variable with values that range over categories, rather than being numerical. Examples include gender (male, female), paint colour (red, white, blue), type of animal (elephant, leopard, lion). Some categorical variables are ordinal. | |
CausationThe capacity of one variable to influence another. The first variable may bring the second into existence or may cause the incidence of the second variable to fluctuate. RELATED TERM. Correlation | |
Change logTracks the progress of each change from submission through review, approval, implementation and closure. The log can be managed manually by using a document or spreadsheet, or it can be managed automatically with a software or Web-based tool. | |
ChecksumTo test if a file has changed over time. A checksum is a type of metadata and an important property of a data object to allow verifying identity and integrity. Also called a hash, a checksum is a randomly generated piece of data that is used to verify the fixity or stability of a digital object. It is most commonly used to detect whether some representation of digital object has changed over time. This is associated with PIDs but can be found and tested independently of PID systems. | |
Citable dataA type of referable data that has undergone quality assessment and can be referred to as citations in publications and as part of research objects. | |
Climate | |
Climate simulationUsing computer models and quantitative methods to represent the atmosphere, oceans, land, ice and energy budget of the Earth. | |
Cloud computingA large-scale distributed computing paradigm that is driven by economies of scale, in which a pool of abstracted, virtualised, dynamically- scalable, managed computing power, storage, platforms and services are delivered on demand to external customers over the Internet. Key elements:
| |
Cluster computingUsing multiple machines linked together and managing their collective capabilities to complete tasks. Computer clusters require a cluster management layer which handles communication between the individual nodes and coordinates work assignment. | |
Comma separated valuesA file that contains the values in a table as a series of ASCII text lines organized so that each column value is separated by a comma from the next column's value and each row starts a new line. | ||
Compute intensiveAny computer application that requires a lot of computation, such as meteorology programs and other scientific applications. | |
Computer code1. Computer code, or source code: A series of computer instructions written in some human readable computer language, usually stored in a text file. Computer code should include explanatory comments. 2. Machine code: Source code is 'compiled' or 'interpreted' to produce computer executable code. 3. A code is a collection of mandatory standards, which has been codified by a governmental authority and thus become part of the law for the jurisdiction represented by that authority. Examples include the National Building Code and the National Electrical Code. SYNONYM. Code; Source code; Script | |
Confidence intervalA confidence interval gives an estimated range of values that is likely to include an unknown population parameter.
For example suppose a study of planting dates for maize, and the interest is in estimating the upper quartile, i.e. the date by which a farmer will be able to plant in ¾ of the years. Suppose the estimate from the sample is day 332, i.e. 27th November and the 95% confidence interval is from day 325 to 339, i.e. 20th November to 4th December. Then the interpretation is that the true upper quartile is highly likely to be within this period.
The width of the confidence interval gives an idea of how uncertain we are about the unknown parameter (see precision). A very wide interval (in the example it is ± 7 days) may indicate that more data needs to be collected before an effective analysis can be undertaken. | |
Confidential informationAny information obtained by a person on the understanding that they will not disclose it to others, or obtained in circumstances where it is expected that they will not disclose it. For example, the law assumes that whenever people give personal information to health professionals caring for them, it is confidential as long as it remains personally identifiable. | ||
Confidentiality1. The duties and practices of people and organizations to ensure that individualsí personal information only flows from one entity to another according to legislated or otherwise broadly accepted norms and policies. 2. In the context of of health data: Confidentiality is breached whenever personal information is communicated that is not authorized by legislation, professional obligations, or under contractual duties. | |
Continuous variableA numeric variable is continuous if the observations may take any value within an interval. Variables such as height, weight and temperature are continuous.
In descriptive statistics the distinction between discrete and continuous variables is not very important. The same summary measures, like mean, median and standard deviation can be used.
There is often a bigger difference once inferential methods are used in the analysis. The model that is assumed to generate a discrete variable is different to models that are appropriate for a continuous variable. Hence different parameters are estimated and used. (See also discrete variable, mixed variable.) | |
CopernicusEarth Observation programme of the European Space Agency, primarily using the Sentinel series of satellites, to improve the understanding of and management of the environment. | |
CorrelationA statistical measure that indicates the extent to which two or more variables fluctuate together. Correlation does not imply causation. There may be, for example, an unknown factor that influences both variables similarly. | |
CryosphereThe part of the Earth-system where water is frozen, including glaciers and sea-ice. | |
CurationThe activity of managing and promoting the use of data from their point of creation to ensure that they are fit for contemporary purpose and available for discovery and reuse. For dynamic datasets this may mean continuous enrichment or updating to keep them fit for purpose. Higher levels of curation will also involve links with annotation and with other published materials. | |
D |
---|
Data AnalysisA research data lifecycle stage that involves the techniques that produce synthesised knowledge from organised information. A process of inspecting, cleaning, transforming, and modelling data with the goal of highlighting useful information suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. | ||
Data Catalogue | |
Data CentreA facility providing IT services, such as servers, massive storage, and network connectivity. Many data centres act as a data repository for certain types of research data. | |
Data cleaningData cleaning is a continuous process that requires corrective actions throughout the data lifecycle. Data cleaning is the process of detecting and correcting corrupt or inaccurate records from a dataset. Data cleaning involves identifying, replacing, modifying, or deleting incomplete, incorrect, inaccurate, inconsistent, irrelevant, and improperly formatted, data. Typically, the process involves updating, correcting, standardising, and de-duplicating records to create a single view of the data, even if they are stored in multiple disparate systems. - CASRAI Dictionary The most important thing to realise about data cleaning is that it is not just a one-time activity. Cleaning can (and should!) occur at every stage of the research data lifecycle. | |
Data curationA managed process, throughout the data lifecycle, by which data/data collections are cleansed, documented, standardised, formatted and inter-related. This includes versioning data, or forming a new collection from several data sources, annotating with metadata, adding codes to raw data (e.g., classifying a galaxy image with a galaxy type such as "spiral"). Higher levels of curation involve maintaining links with annotation and with other published materials. Thus a dataset may include a citation link to publication whose analysis was based on the data. The goal of curation is to manage and promote the use of data from its point of creation to ensure it is fit for contemporary purpose and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose. Special forms of curation may be available in data repositories. The data curation process itself must be documented as part of curation. Thus curation and provenance are highly related. - CASRAI Dictionary | |
Data dictionaryA data dictionary, at its simplest, is a list and description of every variable within a dataset, including information such as the units of measurement and what the variable represents. For more complex datasets, including multi-level or larger database structures, the data dictionary also includes descriptions of the relationships between tables, and for categorical data with a pre-defined set of possible options (e.g. an "enum" datatype in SQL, data coming from a "select" question in a survey, or other defined data such as days of the week), the data dictionary should also include the list of all possible values. | |
Data dredgingThis term defines the practise of analysing large volumes of data, seeking possible relationships without any pre-defined hypothesis or goal. This practise is sometimes employed when working with 'big data' and often leads to premature conclusions. When working with big data, it is very easy to find statistically significant results, but it is much harder to spot if those results are actually meaningful, especially without a specific hypothesis. Data dredging is sometimes described as "seeking more information from a dataset than it actually contains" (CASRAI). | ||
Data drivenAnalysis and decision making led by the numbers, facts and statistical analysis, rather than intuition or experience. | |
Data explorationData exploration, often called exploratory analysis, uses descriptive statistical methods to learn about and understand the characteristics of a dataset. This includes exploring measures of central tendency (e.g. mean, median), measures of spread (standard deviation, range, variance). It also might include exploring the structure of the data, for example splitting the dataset by a categorical variable, or creating visualisations to view the data in different ways. This stage of analysis is often where a lot of data cleaning happens, as you can often spot missing data or outliers during this process. | ||
Data file formatThe layout of a file in terms of how the data within the file are organised. A program that uses the data in a file must be able to recognise and possibly access data within the file. A particular file format is often indicated as part of a file's name by a filename extension (suffix). Conventionally, the extension is separated by a period from the name and contains three or four letters that identify the format. There are as many different file formats as there are different programs to process the files. Examples include: Word documents (.doc), Web text pages (.htm or .html), Web page images (.gif and .jpg), Adobe Postscript files (.ps), Adobe Acrobat files (.pdf), Executable programs (.exe), Multimedia files (.mp3 and others). Preferred formats are those designated by a data repository for which the digital content is maintained. If a data file is not in a preferred format, a data curator will often convert the file into a preferred format, thus ensuring that the digital content remains readable and usable. Usually, preferred formats are the de facto standard employed by a particular community. - CASRAI Dictionary | ||
Data integrity1. In the context of data and network security: The assurance that information can only be accessed or modified by those authorized to do so. 2. In the context of data quality: The assurance the data are clean, traceable, and fit for purpose. - CASRAI Dictionary | |
Data management planA formal statement describing how research data will be managed and documented throughout a research project and the terms regarding the subsequent deposit of the data with a data repository for long-term management and preservation. - CASRAI Dictionary | |
Data miningThe process of analysing multivariate datasets using pattern recognition or other knowledge discovery techniques to identify potentially unknown and potentially meaningful data content, relationships, classification, or trends. Data mining parameters include: Association (looking for patterns where one event is connected to another event); Sequence or path analysis (looking for patterns where one event leads to another later event); Classification (looking for new patterns); Clustering (finding and visually documenting groups of facts not previously known); Forecasting, or predictive analytics (discovering patterns in data that can lead to reasonable predictions about the future. - CASRAI Dictionary | |
Data pointOne measurement, observation or element, a single member of a larger dataset. | |
Data qualityThe reliability and application efficiency of data. It is a perception or an assessment of dataset's fitness to serve its purpose in a given context. Aspects of data quality include: Accuracy, Completeness, Update status, Relevance, Consistency across data sources, Reliability, Appropriate presentation, Accessibility. Within an organisation, acceptable data quality is crucial to operational and transactional processes and to the reliability of analytics, business intelligence, and reporting. Data quality is affected by the way data are entered, stored and managed. Maintaining data quality requires going through the data periodically and scrubbing it. Typically this involves updating, standardising, and de-duplicating records to create a single view of the data, even if it is stored in multiple disparate systems. - CASRAI Dictionary | |
Data quality assuranceData quality assurance (DQA) is the process of verifying the reliability and overall quality of data. - CASRAI Dictionary It is a process that should ideally be planned in advance, and integrated into your entire project workflow, from creation / sourcing of data, through processing and analysis and to storage and sharing of data. Data quality checklists can be used to help identify potential issues with a dataset before you begin your exploratory analysis work. | |
Data quality checklistA data quality checklist is a list of possible issues with a dataset. This list can be created before you start exploring your data to help streamline your data cleaning. If there are physical or logical boundaries that your data should conform to, such as humidity not being above 100%, or age always being 0 or greater, these can form part of your checklist. As such, there is often a strong relationship between a data quality checklist and the data dictionary for your dataset. | ||
Data Repository1. A storage location for a collection of data that is too valuable to discard, but is only accessed occasionally. 2. An archival service providing the long-term permanent care and accessibility for digital objects with research value. The standard for such repositories is the Open Archival Information System reference model. | ||
Data scientistA person who has the knowledge and skills to conduct sophisticated and systematic analyses of data. A data scientist extracts insights from datasets for research or product development, and evaluates and identifies novel or strategic relationships or opportunities. | |
Data warehouseLarge, ordered repositories of data that can be used for analysis and reporting. In contrast to a data lake, a data warehouse is composed of data that has been cleaned, integrated with other sources, and is generally well-ordered. Data warehouses are often spoken about in relation to big data, but typically are components of more conventional systems. | |
DatasetAny organised collection of data in a computational format, defined by a theme or category that reflects what is being measured/observed/monitored. - CASRAI Dictionary | |
DatetimeA standard way to express a numeric calendar date that eliminates ambiguity, acceptable formats being defined by ISO 8601. ISO 8601 is applicable whenever representation of dates in the Gregorian calendar, times in the 24-hour timekeeping system, time intervals and recurring time intervals or of the formats of these representations are included in information interchange. It includes calendar dates expressed in terms of calendar year, calendar month and calendar day of the month; ordinal dates expressed in terms of calendar year and calendar day of the year; week dates expressed in terms of calendar year, calendar week number and calendar day of the week; local time based upon the 24-hour timekeeping system; Coordinated Universal Time of day; local time and the difference from Coordinated Universal Time; combination of date and time of day; time intervals; recurring time intervals. - CASRAI Dictionary | ||
DemonstratorA one-off system, often software, that shows whether or how data can be used for a specific purpose or task. | |
Derived research dataResearch data resulting from processing or combining 'raw' data, often reproducible but expensive. Examples: compiled databases, text mining, aggregate census data. | ||
Descriptive statisticsIf you have a large set of data, then descriptive statistics provides graphical (e.g. boxplots) and numerical (e.g. summary tables, means, quartiles) ways to make sense of the data. The branch of statistics devoted to the exploration, summary and presentation of data is called descriptive statistics.
If you need to do more than descriptive summaries and presentations it is to use the data to make inferences about some larger population. Inferential statistics is the branch of statistics devoted to making generalizations. | |
Digital Object IdentifierA name (not a location) for an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks. A DOI is a type of Persistent Identifier (PID) issued by the International DOI Foundation. This permanent identifier is associated with a digital object that permits it to be referenced reliably even if its location and metadata undergo change over time. - CASRAI Dictionary | ||
DigitisationThe process of creating digital files by scanning or otherwise converting analogue materials. The resulting digital copy, or digital surrogate, would then be classed as digital material and then subject to the same broad challenges involved in preserving access to it, as "born digital" materials. | |
Discrete variableA set of data is discrete if the values belonging to it are distinct, i.e. they can be counted. Examples are the number of children in a family, the number of rain days in the month, the length (in days) of the longest dry spell in the growing season. (See also continuous variable for a more complete discussion.) | |
Disseminative VisualisationData visualisation designed as a presentational aid for disseminating information or insight, with no purpose other than communication. | |
Dublin CoreAn initiative to create a digital "library card catalog" for the Web. Dublin Core is made up of 15 metadata elements that offer expanded cataloging information and improved document indexing for search engine programs. The 15 metadata elements used by Dublin Core are:
| |
E |
---|
e-InfrastructureA combination and interworking of digitally-based technology (hardware and software), resources (data, services, digital libraries), communications (protocols, access rights and networks), and the people and organisational structures needed to support modern, internationally leading collaborative research be it in the arts and humanities or the sciences. http://www.rcuk.ac.uk/research/xrcprogrammes/otherprogs/einfrastructure/ | |
E-ResearchComputationally intensive, large-scale, networked and collaborative forms of research and scholarship across all disciplines, including all of the natural and physical sciences, related applied and technological disciplines, biomedicine, social science and the digital humanities. - CASRAI Dictionary | |
Earth ObservationGathering information about the Earth's physical systems via remote sensing technologies, often satellites which look down at the Earth from their orbit. | |
Electromagnetic SpectrumThe range of wavelengths of electromagnetic radiation, with gamma rays having short wavelengths and high energy, to radio waves with long wavelengths and low energy. Visible light is part of the electromagnetic spectrum. Examples of the use of electromagnetic radiation https://www.bbc.co.uk/education/guides/z66g87h/revision/3 | |
ENIACElectronic Numerical Integrator And Computer, the world's first general-purpose computer; designed and built to calculate artillery firing tables in the 1940s and later used for computer weather predictions. https://www.thoughtco.com/history-of-the-eniac-computer-1991601 | |
Ensemble | |
Environmental analyticsAnalysis of data sourced from the environment, or data with an application relating to the environment. | |
Environmental consultantWorks on a contractual basis for private and public sector clients, addressing environmental issues such as water pollution, air quality and soil contamination. [www.sokanu.com] | |
Environmental research dataIndividual items or records (both digital and analogue) usually obtained by measurement, observation or modelling of the natural world and the impact of humans upon it, including all necessary calibration and quality control. This includes data generated through complex systems, such as information retrieval algorithms, data assimilation techniques and the application of numerical models. However, it does not include the models themselves. - NERC Data Policy Examples of research data: | |
ErrorError is the difference between the measured value and the ‘true value’ (NPL, 1999). Errors can come from the measuring device itself, including bias, changes due to wear, instrument drift, electrical noise and device resolution. Other errors can be introduced by difficulties in performing the measurement and by operator skill. To avoid sampling error, sufficiently dense measurements in space and time should take place to make sure that full variability is captured e.g. diurnal cycles, variations across a site. Errors can be random or systematic (NPL, 1999). With random errors, each measurement gives a different result, so the more measurements (of the same thing) the better the estimate and the more certain the measurement becomes. Systematic errors arise from a bias, e.g., a stretched tape measure, and more measurements do not produce a better estimate of the ‘true value’. | |
EstimationEstimation is the process by which sample data are used to indicate the value of an unknown quantity in a population.
The results of estimation can be expressed as a single value, known as a point estimate. It is usual to also give a measure of precision of the estimate. This is called the standard error of the estimate.
A range of values, known as a confidence interval can also be given. | |
EstimatorAn estimator is a quantity calculated from the sample data, which is used to give information about an unknown quantity (usually a parameter) in the population. For example, the sample mean is an estimator of the population mean. | |
Exa-Prefix denoting a factor of 1018 or a billion billion | |
Experimental research dataResearch data from experimental results, often reproducible, but can be expensive. Examples: data from lab equipment, metagenomic sequences recovered from soil samples, results of a field experiment. | |
F |
---|
FAIR Data PrincipleData that follows the FAIR principles must be:
FAIR is a set of guiding principles for data management and stewardship designed by stakeholders representing interests in academia, industry, funding agencies and scholarly publishers. The FAIR principles define a set of core enabling conditions which, if fulfilled for a given set of data, would ensure that they remain accessible and re-usable over the long term. A key element of these principles is the focus on the use of structured information and persistent identifiers to enable machine discoverability and use of the data. The full set of principles are published in the article "The FAIR Guiding Principles for scientific data management and stewardship". - DOI: https://doi.org/10.1038/sdata.2016.18 | ||
Fair useA legal concept that allows the reproduction of copyrighted material for certain purposes without obtaining permission and without paying a fee or royalty. Purposes permitting the application of fair use generally include review, news reporting, teaching, or scholarly research. When in doubt, the quickest and simplest thing may to request permission of the copyright owner. - CASRAI Dictionary | |
FlopFLOating Point operation, a single calculation on a number with a decimal point i.e. not an integer. In computing, floating point operations per second (FLOPS) is a measure of computer performance, useful in fields of scientific computations that require floating-point calculations. | |
FTPFile Transfer Protocol, a standardised set of rules to allow upload and download of files between two computers, commonly used for exchanging files over the Internet. | |
G |
---|
GeostationaryA satellite that tracks the Earth's rotation above the equator therefore appearing to remain stationary, viewing the same portion of the Earth's surface, often used for TV or radio broadcasting and some meteorological satellites. | |
Giga-Prefix denoting a factor of 109 or a billion | |
H |
---|
Heat mapA two-dimensional representation of data in which values are represented by colors. Heat maps communicate relationships between data values that would be would be much more difficult to understand if presented numerically in a spreadsheet. - CASRAI Dictionary | |
Hypothesis TestTesting hypotheses is a common part of statistical inference. To formulate a test, the question of interest is simplified into two competing hypotheses, between which we have a choice. The first is the null hypothesis, denoted by H0, against the alternative hypothesis, denoted by H1 .
For example with 50 years of annual rainfall totals a hypothesis test could be whether the mean is different in El Nino and Ordinary years. Then usually
• The null hypothesis, H0, is that the two means are equal, i.e. there is no difference.
• The alternative hypothesis, H1, is that the two means are unequal, i.e. there is a difference.
If the 50 years were considered as being of three types, El Nino, Ordinary, La Nina then usually:
• The null hypothesis, H0, is that all three means are equal.
• The alternative hypothesis, H1, is that there is a difference somewhere between the means.
The hypotheses are often statements about population parameters. In the first example above it might be:
• H0, is that µE = µO.
• H1, is that µE ≠ µO.
The outcome of a hypothesis test is either
• Reject H0 in favour of H1, or
• Do not reject H0. | |
I |
---|
InferenceInference is the process of deducing properties of the underlying distribution or population, by analysis of data. It is the process of making generalizations from the sample to a population. | |
InformaticsThe science of collecting, classifying, storing, retrieving and disseminating data and/or knowledge. | |
InformationThe aggregation of data to make coherent observations about the world, meaningful data, or data arranged or interpreted in a way to provide meaning. - CASRAI Dictionary It is often considered the job of the scientist, researcher or statistician to derive information from raw data. | |
InfraredIn the electromagnetic spectrum, the visible light region lies from violet at shorter wavelengths/energies to red at longer wavelengths/energies. Infrared radiation has wavelengths just greater than red light and emitted particularly by heated objects. For example, night vision goggles use infrared radiation. | |
IntegerA number which is not a fraction; a whole number. | |
Inter-quartile range | |
Internet of ThingsAbbreviated to IoT, a broad term for devices interconnected via the internet enabling sending and receiving of data or instructions. These devices include everyday items such as home appliances or cameras. Sometimes also referred to as ‘smart devices'. | |
InteroperabilityThe capability to communicate, execute programs, or transfer data among various functional units in a useful and meaningful manner that requires the user to have little or no knowledge of the unique characteristics of those units. Foundational, syntactic, and semantic interoperability are the three necessary aspects of interoperability. - CASRAI Dictionary | |
J |
---|
JASMINPetabyte-scale easily accessible storage collocated with data analysis computing facilities run by the Scientific and Technologies Facilities Council for researchers and science community in the UK. http://www.jasmin.ac.uk/ | |
K |
---|
Kilo-Prefix denoting a factor of 103 or a thousand | |
M |
---|
Machine learningThe study and practice of designing systems that can learn, adjust, and improve automatically, based on the data fed to them. This typically involves implementation of predictive and statistical algorithms that focus on 'correct' behaviour and insights as data flows through the system. | |
MapReduceA big data algorithm for scheduling work on a computing cluster. The process involves splitting the problem set up, mapping it to different nodes (map), and computing over them to produce intermediate results, shuffling the results to align like sets, and then reducing the results by outputting a single value for each set (reduce). | |
Mega-Prefix denoting a factor of 106 or a million | |
MetadataBackground or contextual data about a dataset. Literally "data about data". Metadata is required to enable someone to properly understand and interpret a main dataset. Examples:
| |
METARMETeorological Aviation Report, a weather observation taken at a certain location, most likely an airfield, for use by pilots and weather forecasters. The METAR coding standard is agreed between civil aviation and weather authorities. | |
ModelRepresentation of a real world situation.
The word “model” is used in many ways and means different things, depending on the discipline. For example a meteorologist might think of a global climate model, used for weather forecasting, while an agronomist might think of a crop simulation model, used to estimate crop growth and yields.
Statistical models form the bedrock of data analysis. A statistical model is a simple description of a process that may have given rise to observed data. | |
N |
---|
Natural CapitalCan be defined as the world's stocks of natural assets which includes soil, water, air, flora and fauna. | |
Near-infraredIn the electromagnetic spectrum, near-infrared lies between red visible light and infrared. See also Infrared. | |
NimbusA programme of seven NASA missions of Earth Observation satellites, starting in 1964. Nimbus is Latin for rain cloud. | |
NoiseNoise in data is meaningless data or unexplained variation in data which might be due to instrument errors, corruption or other issues. Noise disguises and/or distorts the underlying data which make it harder to analyse, just as noisy environments make it more difficult to hear the sound on which you wish to focus. | |
Normal distributionThe normal distribution is used to model some continuous variables. It is a symmetrical bell shaped curve that is completely determined by two parameters. They are the distribution (or population) mean, μ, and the standard deviation, σ. | |
Numerical VariableRefers to a variable whose possible values are numbers (as opposed to categories). | |
O |
---|
Observational research dataResearch data captured in real time, usually unique and irreplaceable. Examples: Weather records, species census surveys. | |
OntologyA set of terminology to describe important concepts, often specific to a particular domain or discipline. It's a way of describing a vocabulary that can be shared among practitioners in a field, to allow for easier communication and a standardised way of defining and labelling, for example when writing metadata. | |
Open DataStructured data that are accessible, machine-readable, usable, intelligible, and freely shared. Open data can be freely used, re-used, built on, and redistributed by anyone - subject only, at most, to the requirement to attribute and share-alike. - CASRAI Dictionary | |
Open ScienceThe practice of science in such a way that others can collaborate and contribute, where research data, lab notes and other research processes are freely available, under terms that enable reuse, redistribution and reproduction of the research and its underlying data and methods. - FOSTER Open Science Consortium Open Science encompasses a broad set of practices, including:
| ||
Open SourceOpen source software is software whose source code has been made freely available for re-use and modification under and Open Source license. There are different types of open source license, but to be truly open source they all conform to the guidelines laid out at the Open Source Initiative. | |
Ordinal variableAn ordinal variable is a categorical variable in which the categories have an obvious order, e.g.
(strongly disagree, disagree, neutral, agree, strongly agree), or (dry, trace, light rain, heavy rain). | |
OutlierA data point showing an unexpected relationship or large difference to the remainder of the dataset. | |
P |
---|
p-valueThe probability value (p-value) of a hypothesis test is the probability of getting a value of the test statistic as extreme, or more extreme, than the one observed, if the null hypothesis is true.
Small p-values suggest the null hypothesis is unlikely to be true. The smaller it is, the more convincing is the evidence to reject the null hypothesis.
In the pre-computer era it was common to select a particular p-value, (often 0.05 or 5%) and reject H0 if (and only if) the calculated probability was less than this fixed value. Now it is much more common to calculate the exact p-value and interpret the data accordingly. | |
ParameterA parameter is a numerical value of a population, such as the population mean. The population values are often modelled from a distribution. Then the shape of the distribution depends on its parameters. For example the parameters of the normal distribution are the mean, μ and the standard deviation, σ. For the binomial distribution, the parameters are the number of trials, n, and the probability of success, θ. | |
PercentileThe pth percentile of a list is the number such that at least p% of the values in the list are no larger than it. So the lower quartile is the 25th percentile and the median is the 50th percentile. One definition used to give percentiles, is that the p’th percentile is the 100/p*(n+1)’th observation. For example, with 7 observations, the 25th percentile is the 100/25*8 = 2nd observation in the sorted list.
Similarly, the 20th percentile = 100/20*8 = 1.6th observation. | |
Peta-Prefix denoting a factor of 1015 or a million billion | |
Physical dataData in the form of physical samples. Examples: Soil samples, ice cores. | |
Polar orbitingA satellite orbit passing above or nearly above both poles on each orbit. Polar orbiting satellites have a lower altitude above the Earth's surface than geostationary satellites and therefore an increased resolution. | |
PopulationA population is a collection of units being studied. This might be the set of all people in a country. Units can be people, places, objects, years, drugs, or many other things. The term population is also used for the infinite population of all possible results of a sequence of statistical trials, for example, tossing a coin.
Much of statistics is concerned with estimating numerical properties (parameters) of an entire population from a random sample of units from the population. | |
PrecisionPrecision is a measure of how close an estimator is expected to be to the true value of a parameter. Precision is usually expressed in terms of the standard error of the estimator. Less precision is reflected by a larger standard error. | |
Primary DataData that has been created or collected first hand to answer the specific research question. | |
ProportionFor a variable with n observations, of which the frequency of a particular characteristic is r, the proportion is r/n. For example if the frequency of replanting was 11 times in 55 years, then the proportion was 11/55 = 0.2 of the years, or one fifth of the years. (See also percentages.) | |
ProvenanceIn the case of data, the process of tracing and recording the origins of data and its movements between databases. Data's full history including how and why it got to its present palace. | |
ProxyIn the case of data, other data that you may use and/or transform when you do not have a direct measurement of the data you require. | |
Q |
---|
Qualitative dataThis is data regarding 'qualities', which do not take the form of numbers. Examples: interview transcripts, ethnographic materials, photos | |
Quantitative data | |
QuartilesThere are three quartiles. To find them, first sort the list into increasing order. The first or lower quartile of a list is a number (not necessarily in the list) such that 1/4 of the values in the sorted list are no larger than it, and at least 3/4 are no smaller than it. | |
R |
---|
RangeThe range is the difference between the maximum and the minimum values. It is a simple measure of the spread of the data. | |
Raw DataRaw data are data that have not been processed for meaningful use. A raw dataset is exactly what is collected, before any data cleaning, processing or analysis has been completed. It is often useful to store raw data as well as the cleaned, processed data, as it can help your work to be more easily reproduced. If another researcher has your raw data and the steps you used to process and analyse, they can recreate your results. This has to be balanced with the cost of storing raw data, and the likelihood of the raw data being useful compared to data that has undergone an initial process of data cleaning. | |
Reference research dataA static or organic conglomeration or collection of smaller (peer reviewed) datasets, most probably published and curated, e.g. UK Tide Gauge Network, IUCN Red List of Endangered Species | ||
Research DataThe evidence that underpins the answer to the research question - UK Concordat on Open Research Data (2016) Recorded factual material commonly retained by and accepted in the scientific community as necessary to validate research findings. - EPSRC Policy Framework on Research Data Data that are used as primary sources to support technical or scientific enquiry, research, scholarship, or artistic activity, and that are used as evidence in the research process and/or are commonly accepted in the research community as necessary to validate research findings and results. All other digital and non-digital content have the potential of becoming research data. Research data may be experimental data, observational data, operational data, third party data, public sector data, monitoring data, processed data, or repurposed data. - CASRAI Dictionary | |
Research Data LifecycleA model to conceptualise the different stages through which data pass during the research process, and the data management activities that relate to those stages. The model used throughout Data Tree has six stages, corresponding to different activities during the life of a research project. Other institutions or paradigms have slight variations on these stages, but the broad concepts are applicable no matter how you choose to categorise your research activities. Our model is based on the UK Data Service model from 2011, and has the following stages:
| |
RGBIn satellite imagery, the satellite's sensors operate in three channels, red, green and blue separately, and can be combined to give a colour image. | |
S |
---|
SampleA sample is a group of units, selected from a larger group (the population). By studying the sample it is hoped to draw valid conclusions (inferences) about the population.
A sample is usually used because the population is too large to study in its entirety. The sample should be representative of the population. This is best achieved by random sampling. The sample is then called a random sample. | |
Sampling DistributionA sampling distribution describes the probabilities associated with an estimator, when a random sample is drawn from a population. The random sample is considered as one of the many samples that might have been taken. Each would have given a different value for the estimator. The distribution of these different values is called the sampling distribution of the estimator. Deriving the sampling distribution is the first step in calculating a confidence interval, or in conducting a hypothesis test. | |
Satellite imageryAn image of part of the Earth taken using artificial satellites in orbit around the Earth. These images have a variety of uses including | |
Secondary DataExisting data which is being reused for a purpose other than the one for which it was collected. | |
Sentinel satellitesA family of Earth Observation satellite missions by the European Space Agency http://m.esa.int/Our_Activities/Observing_the_Earth/Copernicus/Overview4 | |
Signal to noise ratioA measure of how much useful information there is in a system, a phrase applied generally but originating in electrical systems to indicate the strength of the information (signal) compared to unwanted interference (noise), a low signal to noise ratio means that it is difficult to determine the useful information. | |
Simulation research dataResearch data generated from test models where the model and metadata may be more important than the output data from the model. Examples: Climate or ocean circulation models. | |
SkewIf the distribution (or “shape”) of a variable is not symmetrical about the median or the mean it is said to be skew. The distribution has positive skewness if the tail of high values is longer than the tail of low values, and negative skewness if the reverse is true. | |
Smart MeterA new kind of energy meter that can digitally send meter readings to your energy supplier and come with in home display units, to see in real-time how much energy is being used in a household. | |
Software developerA person who researches, designs, programs and tests computer code. | |
StakeholderIndividuals, groups or organisations that have an interest or share in an undertaking or relationship and its outcome - they may be affected by it, impact or influence it, and in some way be accountable for it. - CASRAI Dictionary | |
Standard deviationThe standard deviation (s.d.) is a commonly used summary measure of variation or spread of a set of data. It is a “typical” distance from the mean. Usually, about 70% of the observations are closer than 1 standard deviation from the mean and most (about 95%) are within 2 s.d. of the mean. | |
Standard error | |
Stream processingThe practice of computing over individual data items as they move through a system. This allows for real-time analysis of the data being fed to the system and is useful for time-sensitive operations using high velocity metrics. | |
T |
---|
Tera-Prefix denoting a factor of 1012 or a thousand billion (also a million million) | |
U |
---|
Urban heat islandA built-up area that is warmer than the surrounding rural areas due to human activities. | |
V |
---|
VarianceThe variance is a measure of variability, and is often denoted by s2. In simple statistical methods the square root of the variance, s, which is called the standard deviation, is often used more. The standard deviation has the same units as the data themselves and is therefore easier to interpret. The variance becomes more useful in its own right when the contribution of different sources of variation are being assessed. This leads to a presentation called the “analysis of variance”, often written as ANOVA. | |
Version controlControl over time of data, computer code, software, and documents that allows for the ability to revert to a previous revision, which is critical for data traceability, tracking edits, and correcting mistakes. Version control generates a (changed) copy of a data object that is uniquely labeled with a version number. The intent is to track changes to a data object, by making versioned copies. Note that a version is different from a backup copy, which is typically a copy made at a specific point in time, or a replica. - CASRAI Dictionary Version control is very popular in programming, and many coders use Git or Subversion to track changes to their scripts and other text-based files. Other, simpler version control systems include things like MS Word's "track changes" feature, and the feature that many cloud storage facilities such as Dropbox and Google Drive have that allows users to revert to previous versions of documents stored in their systems for limited periods. | |
VisualisationRepresenting data visually to enable understanding of its significance; to highlight patterns and trends that might otherwise be missed; to communicate data quickly and in a meaningful way. | |
W |
---|
WeatherSpecific atmospheric conditions around us which can change minute-by-minute, day-to-day. | |
Weather forecastA prediction of specific future weather conditions, such as daily maximum temperature at a location, up to several days ahead, with the estimate frequently becoming more uncertain with increasing lead-time. Weather forecasts are often based on computer simulations of the atmosphere known as NWP, Numerical Weather Prediction. | |
Weather observationAlso known as a weather report, is a snapshot of the weather conditions at a certain location and at a certain time. An observation may be as basic as an air temperature reading but can include wind speed and direction, visibility, humidity, precipitation, cloud cover or soil surface temperature. A long-term average of a location's weather observations e.g. 30 years, determines the location's climate. | |
Y |
---|
Yotta-Prefix denoting a factor of 1024 or a million billion billion and the largest unit prefix in the metric system | |
Z |
---|
Zetta-Prefix denoting a factor of 1021 or a thousand billion billion | |