Data science is the theory and practice powering the data-driven transformations we are seeing across industry and society today. Artificial intelligence (AI), self-driving cars, and predictive analytics are just a few of the breakthroughs that have been made thanks to our ever-growing ability to collect and analyze data.
Just as with big data (see my terminology pieces here: Part I, Part II), the field of data science has developed its own lexicon, which can be confusing for beginners. But an understanding of the basic terminology and frequently used words is essential for anyone thinking about how to apply this technology. So here is Part I of my run through of some of the technologies, phrases, and buzzwords you are likely to come across in the field of data science. (Part II will follow shortly.)
When carrying out scientific data analysis using personal data (data that identifies a person), anonymization refers to the process of removing or obfuscating indicators in the data that show who it specifically refers to. This is not always as simple as it sounds, as people can be identified by more than just their name. Properly anonymized data is no longer considered “personal,” and there are commonly less legal and ethical restrictions on how it can be used.
Algorithms are repeatable sets of instructions that people or machines can use to process data. Typically, algorithms are constructed by feeding data into them and adjusting variables until a desired outcome is achieved. Thanks to breakthroughs in AI such as machine learning and neural networks, machines generally do this today, as they can do it far more quickly than any human.
One way to categorize the latest wave of “intelligent” machines is as machines that are capable of performing data science for themselves. Rather than simply processing the data they are fed in the way they are told to, they can learn and adapt to become better at processing it. This is how Google Translate become s better at understanding language, and how autonomous cars will navigate areas they have never visited before.
This is a mathematical formula used to predict the probability of one event occurring in relation to if another event has occurred. It is a common technique used in data science to establish probabilities and outcomes that are dependent on unknown variables, and it is used to build Bayesian Networks, where the principle is applied across large datasets.
The use of data on a person or object’s behaviour to make predictions on how that behavior might change in the future (see predictive modelling in Part II) or to determine the variables that affect it so more favorable or efficient outcomes might be achieved.
Big Data is the “buzzword” term that has come to represent the vast increase in the amount of data that has become available in recent years, particularly as the world has increasingly become connected through the internet. This data is distinguished from data previously available not just by its size, but also the high speed at which it is generated and the large variation in the forms it can take. It greatly expands the potential of what can be achieved with data science, which was previously hampered by slow computer processing speeds and the difficulty of capturing accurate information in large volumes.
Citizen Data Scientist
Sometimes also referred to as an “armchair data scientist,” this is one of the growing number of people who, although not academically trained or professionally employed as data scientists, are able to use data science tools and techniques to improve the use of information in their own field of study or work. This is increasingly becoming possible thanks to the growing number of automated or “self-service” tools and platforms for data analytics.
The ability to use data (about an object, event, or anything else) to determine which of a number of predetermined groups an item belongs in. For a basic example, image recognition analysis might classify all shapes with four equal sides as squares and all shapes with three sides as triangles.
Analysis of the way humans interact with computers or use machinery. The name refers to recording and analyzing where a mouse is clicked on a screen (with the sequence of interactive actions taken by the user known as the “clickstream”), but it can be applied to any method of interaction that can be measured – such as manual operation of machinery using a joystick or control panel, or voice recognition.
Clustering, like classification, is about grouping objects together, but it differs because it is used when there are no predetermined groups. Objects (or events) are clustered together due to similarities they share, and algorithms determine what that common relationship between them may be. Clustering is a data science technique that makes unsupervised learning possible.
Rules that establish how data should be used in order to both comply with legislation and ensure the integrity of data and data-driven initiatives.
The process of examining a set of data to determine relationships between variables that could affect outcomes, generally done at a large scale and by machines. Data mining is an older term used by computer scientists and in business to describe the basic function of a data scientist or a data science initiative.