What is Data Virtualization and Why Does It Matter?

In business today, knowledge is power—just as it has always been. The most intelligent and successful businesses collect the right data, which they effectively turn into information and then into knowledge.

We now live in an increasingly data-filled and data-driven society, and businesses that can thrive in this environment—by understanding and harnessing the data that flows within and around their operations, and what it means—are likely to have a bright future.

Data virtualization is just another tool (or technique) that an organization can employ to help with this task. It’s a term that is being heard more and more frequently, but there is often confusion about precisely what it means. In this post, I want to take a look at what it is, and why it can be a powerful aid when gearing up for data-driven change.

What is virtualization?

Just as virtual reality denotes a reality that is abstracted from the “actual” reality we all live in, “virtual” data, at its most basic, is simply a dataset that is abstracted, in some way, from the actual, physical electromagnetic data it represents. This usually exists, at its most “real” level, as 1s and 0s encoded magnetically onto a physical hard drive somewhere.

Modern smartphones, computers and tablets all use virtualization to some extent—to make them work more in-line with the way our brains expect them to behave. Files sit within folders and are grouped together according to their type. If you want to look at a picture you’ve taken with your smartphone camera, you open the photo gallery, which is a virtualized dataset based on the actual picture files stored on your phone or memory card. You probably get a little thumbnail of the image and, depending on your “smartphone’s”? settings, you also see information like the image size and details of when and where it was taken. Working with a “virtualized” dataset like this makes it easier to search through and sort to find the information you want.

Facebook is another good example (borrowed from here). When you want to look at a photograph or video, you can access it through the virtualized environment of the Facebook app or website. You don’t need to know the physical location—or any other information about the file you want to view—you can simple access it by looking directly in the place you’d expect it to be, be that your own photo album or a group dedicated to funny cat pictures.

At the enterprise scale, data virtualization is based on the same principles. Because enterprise data is very often Big Data, it can be messy—if a company is collecting even a sliver of the data available, it will (or should) have machine data, transactional data, financial data, customer feedback data, operational data and curated external data at its fingertips. The complex nature of this data and the growing plethora of ways in which it can be leveraged for insight mean that specialist tools have become available for virtualizing it.

Why is data virtualization useful?

The main benefit is that any operations carried out on virtualized data involve only the curated, “useful” information that has been grabbed from the “actual” dataset.

For example, if the data-driven project you are currently working on involves improving the speed of rocket-powered cars, and all you needed for one particular query was the time it takes the car to go from 0 – 100 MPH, working on a virtualized dataset containing just this information would result in quicker, simpler calculations to get the information you need.

Virtualization is widely seen as an aid to productivity because it means data can be accessed in a variety of ways depending on what it is used for, and this transformation can take place in the virtualization layer without affecting the source data. Large datasets do not need to be loaded entirely into memory for simple, frequently-repeated operations, again improving speed.

Widely-used virtualization solutions (e.g., Denodo, Cisco, Delphix and Informatica) are built to interface directly with data sources and client applications read purely from the virtualized datasets that are produced.

Data virtualization also has benefits for compliance and governance. It is often used to restrict access to data based on credentials or clearance level. It also provides tools for oversight of how data is used, what types of data are most frequently accessed and what changes or transformations are being applied to data before it is put to use.

Virtual data to actual results

Overall, data virtualization has enormous potential to help businesses become more agile as well as more focused as they make the transition to becoming intelligent, data-driven organizations.

It allows disparate and often siloed datasets to be brought together and analyzed in the context of everything that can be measured and known about an organization’s operations. Conversely it also means valueless “noise” can be filtered out, and stopped from consuming increasingly valuable computer and storage resources.

The increase in speed and efficiency means that cutting-edge Big Data projects involving advanced technologies, such as predictive analytics and machine learning, are within the grasp of an ever-growing number of businesses. This is likely to continue to be a strong driver of innovation for the foreseeable future.