Big Data Terminology: 16 Key Concepts Everyone Should Understand (Part I)

These definitions are for anyone who wants to know more about Big Data and of which they should have a general understanding.

As-a-Service Infrastructure

Data-as-a-service, software-as-a-service, platform-as-a-service, these all refer to the idea that rather than selling data, licences to use data, or platforms for running Big Data technology, it can be provided “as-a-service,” rather than as a distinct product. This reduces the upfront capital investment necessary for customers to begin putting their data, or platforms, to work for them, as the provider bears all the costs of setting up and hosting the infrastructure. As a customer, as-a-service infrastructure can greatly reduce the initial costs and setup time for getting Big Data initiatives up and running.

Data Science

Data science is the professional field that deals with turning data into value, such as new insights or predictive models. It brings together expertise from fields including statistics, mathematics, computer science, communication as well as domain expertise such as business knowledge. The role of data scientist has recently been voted the number 1 job in the U.S., based on current demand and salary and career opportunities.

Data Mining

Data mining is the process of discovering insights from data. In terms of Big Data, because it is so large, this is generally done by computational methods in an automated way using methods such as decision trees, clustering analysis and, most recently, machine learning. Think of this as using the brute mathematical power of computers to spot patterns in data that would not be visible due to the complexity of the dataset.


Hadoop is a framework for Big Data computing that has been released into the public domain as open-source software, so it can be freely used by anyone. It consists of several modules, all tailored for a different vital step of the Big Data process, from file storage (Hadoop File System, HDFS) to database (HBase) to carrying out data operations (Hadoop MapReduce, see below). Due to its power and flexibility, it has become so popular that it has developed its own industry of retailers (selling tailored versions), support service providers and consultants.

Predictive Modelling

Simply, this is predicting what will happen next based on data about what has happened previously. In the age of Big Data, because there is more data around than ever before, predictions are becoming more and more accurate. Predictive modelling is a core component of most Big Data initiatives, which are formulated to help us choose the course of action that will lead to the most desirable outcome. The speed of modern computers and the volume of available data means that predictions can be made based on a huge number of variables, allowing an ever-increasing number of variables to be assessed and leading to more successful results.


MapReduce is a computing procedure for working with large datasets. It was created in response to the difficulty of reading and analyzing really Big Data using conventional computing methodologies. As its name suggests, it consists of two procedures: mapping (sorting information into the format needed for analysis—for example, sorting a list of people according to their ages) and reducing (performing an operation, such as checking the age of everyone in the dataset to see who is over 21).


NoSQL refers to a database format that is designed to hold more than data that is simply arranged into tables, rows and columns, as is the case in a conventional relational database. This database format has proven very popular in Big Data applications because Big Data is often messy, unstructured and does not easily fit into traditional database frameworks.


Python is a programming language that has become very popular in the Big Data space due to its ability to work well with large, unstructured datasets. (See my upcoming Part II post for a discussion of the difference between structured and unstructured data). Python is easier for a data-science beginner to learn than other languages such as R (see also Part II) and more flexible.