Data cleaning is a continuous process that requires corrective actions throughout the data lifecycle. Data cleaning is the process of detecting and correcting corrupt or inaccurate records from a dataset. Data cleaning involves identifying, replacing, modifying, or deleting incomplete, incorrect, inaccurate, inconsistent, irrelevant, and improperly formatted, data. Typically, the process involves updating, correcting, standardising, and de-duplicating records to create a single view of the data, even if they are stored in multiple disparate systems.
- CASRAI Dictionary
The most important thing to realise about data cleaning is that it is not just a one-time activity. Cleaning can (and should!) occur at every stage of the research data lifecycle.