This is the Part II of my blog series that simply distils the key terminology of Big Data (see Part I here). These are the remaining 16 key concepts that you should understand if you want to learn more about Big Data.
Structured data is data that can be arranged neatly into charts and tables consisting of rows, columns or multi-dimensioned matrixes. This is traditionally the way that computers have stored data, and information in this format can be easily and simply processed and mined for insights. Data gathered from machines is often a good example of structured data, where various data points—speed, temperature, rate of failure, RPM, etc.—can be neatly recorded and tabulated for analysis.
Unstructured data is any data that cannot be easily put into conventional charts or tables. This can include video data, pictures, recorded sounds, text written in human languages and a great deal more. This data has traditionally been far harder to draw insight from using computers, which were generally designed to read and analyze structured information. However, since it has become apparent that a huge amount of value can be locked away in this unstructured data, great efforts have been made to create applications that are capable of understanding unstructured data—for example, visual recognition and natural language processing.
R is another programming language commonly used in Big Data, and can be thought of as more specialized than Python, being geared towards statistics. Its strength lies in its powerful handling of structured data. Like Python, it has an active community of users who are constantly expanding and adding to its capabilities by creating new libraries and extensions.
A recommendation engine is basically an algorithm, or collection of algorithms, designed to match an entity (for example, a customer) with something they are looking for. Recommendation engines used by the Like functionalities of Netflix or Amazon heavily rely on Big Data technology to gain an overview of their customers and, using predictive modelling, match them with products to buy or content to consume. The economic incentives offered by recommendation engines has been a driving force behind many commercial Big Data initiatives and developments over the last decade.
Real time means “as it happens” and, in Big Data, specifically refers to a system or process that gives data-driven insights based on what is happening now. Recently, there has been a big push for the development of systems that are capable of processing and offering insights in real time (or near-real time), and advances in computing power, as well as development of techniques such as machine learning, have made it a reality in many applications.
The crucial last step of many Big Data initiatives involves getting the right information to the people who need it to make decisions, at the right time. When this step is automated, analytics is applied to the insights themselves to ensure that they are communicated in a way that they will be understood and easy to act on. This usually involves creating multiple reports based on the same data or insights, but each report is intended for a different audience (for example, an in-depth technical analysis report for engineers and an overview of the impact on the bottom line for C-level executives).
Spark is another open-source framework like Hadoop (discussed in my Part 1 post), but more recently developed and more suited to handling cutting-edge Big Data tasks involving real time analytics and machine learning. Unlike Hadoop, Spark does not include its own file system, though it is designed to work with Hadoop’s HDFS or a number of other options. However, for certain data-related processes Spark is able to calculate at over 100 times the speed of Hadoop, thanks to its in-memory processing capability. This means it is becoming an increasingly popular choice for projects involving deep learning, neural networks and other compute-intensive tasks.
Humans find it very hard to understand and draw insights from large amounts of text or numerical data. It can be done, but it takes time, and our concentration and attention is limited. For this reason, an effort is underway to develop computer applications that are capable of rendering information in a visual form. For example, charts and graphics that highlight the most important insights that are the result of our Big Data projects. A subfield of reporting (see above), visualizing is now often an automated process, with visualizations that are customized by algorithm to be understandable to the people who need to act or take decisions based on them.