Written by: Kevin Jacquier
This document has been written to provide an entry level introduction to Data Virtualization. The majority of the information in this paper is based on information gathered during a one day class presented by Dave Wells at a TDWI Conference entitled “ TDWI Data Virtualization: Solving Complex Data Integration Challenges.”
What is Data Virtualization
There are many definitions of “Virtualization” and “Data Virtualization” that one can obtain from dictionaries, Wikipedia, and industry experts. To keep it very simple, “Data Virtualization” is providing access to data directly from one or more disparate data sources, without physically moving the data, and providing it in such a manner that the technical aspects of location, structure, and access language are transparent to the Data Consumer.
What this really means is that Data Virtualization makes data available for Business Analytics, without the need to move all the information to a single physical database. It is important to keep in mind that there are times when this is not possible directly from the source. There are also times when it is much more efficient to use Data Virtualization directly from the source rather than replicating the information.
There is one key aspect of Data Virtualization to keep in mind. Data Virtualization does not replace ETL; it complements it. What this means is that Data Virtualization doesn’t work well when there needs to be significant transformations or complex business logic from the source before it can be used by the Data Consumer. Data Virtualization works well when a Data Warehouse is its source or when the Source Data can be accessed with minimal complexity. So why Use Data Virtualization?
Data Virtualization removes the need to move data from Data Warehouse to Data Warehouse or even Data Warehouse to Data marts, which many companies do, as this is the only way to make the data available to their applications. Data Virtualization works very well when the source data is well defined and readily accessible for business logic.
Data Virtualization is primarily based on a Semantic Layer that creates views over the Source Data. These views are set in layers that consist of 1) Physical Layer or Connection Views [access to the source data], 2) Business Layer or Integration Views [linking data from the different sources], and 3) Application Layer or Consumer Views [presenting data in a manner that is understandable by the Data Consumer]. There is no actual ETL with Data Virtualization; it is all just a series of Views. This serves as both advantages as well as disadvantages of Data Virtualization. The Data Virtualization Semantic Layer sits between the Source Data and the Delivery Applications (i.e. Business Analytics).
The diagram on the following page represents these layers, as well as the Source and Data Consumer layers. This diagram was taken from a document that is available through “Composite Software” which is one of the leading Data Virtualization software companies.
Please note within the Diagram above, that the Data Sources Can be data in any format, including “Big Data” which could include NoSQL, Hadoop, Web Services, and Internal and external Cloud data. In addition, there can be a multitude of different Data Consumers, including all the major “Business Analytics” vendors.
The Data Virtualization software will have an Optimizer that optimizes the Semantic Layer Query SQL generated before sending it to the actual Data Sources. In addition, the Data Virtualization software has very hefty in memory caching. This is the primary difference between what Data Virtualization software offers over some of the BI Vendor’s “Caching” capabilities. Data Virtualization can keep a lot of data within its Memory; therefore, it can provide efficient response time. Data Virtualization works very well when the Sources have High Volume data, but the Data Consumers ask for Low Volume Summaries. In other words, data virtualization should not be used as a data extraction or data dump source, but as a Business Analytics source.
Although we may have strayed a little bit above, one of the really key aspects of Data Virtualization is to make data available and allow Data Consumers from different areas to access the data virtually, not physically. Application Specific Consumer Views can be created within the virtual space, eliminating the need to copy (materialize) the data into another application or database.
In addition to the above mentioned features of Data Virtualization software, the following are also features found within the main Data Virtualization vendors:
- Data Governance: Data governance functions of User Security, Data Lineage, and tracking and logging of access and use.
- Data Quality: Data Cleansing, Quality Metadata, Monitoring, and Defect prevention.
- Security: User Security can be applied once in the Data Virtualization software and then any application accessing data via Data Virtualization has this security automatically applied.
- Management Functions: Management functions of the Data Virtualization environment such as server and storage monitoring, network load monitoring, cache management, access monitoring, performance monitoring, Security Management (Domain, Group, and user levels).
The following diagram from Composite Software, shows their “Platform” or Functional Areas.
Benefits of Data Virtualization
Data Virtualization allows for a very agile development environment. The primary reason for this is due to the fact that Data Virtualization is based on the creation of Views into the data, not actual coding of Database Objects (Tables, Views, Procedures) and ETL Code as needed to support Data Materialization (data warehouse/ETL). This allows for much quicker, cost effective development cycles over data Materialization projects. Please note, however, that Data Virtualization isn’t always a replacement for Data Materialization, there are times when ETL is necessary and Data Virtualization isn’t the only solution. However, Data Virtualization can still compliment the Materialized solution.
Data Virtualization provides the following Business Benefits over Data Materialization:
- Supports Fast Prototype Development
- Can be an interim solution to final ETL Project
- Quicker time-to-solutions for business
- Respond to increasing volumes and types of data
- Increased data analysis opportunities
- Information completeness
- Improved information quality
- Reduced data governance complexity
- Better able to balance tie, resources, and results
- Reduced Infrastructure Costs
The following Technical Benefits are obtained through Data Virtualization over Data Materialization:
- Ease of Data Integration
- Iterative Development
- Shorter and Faster Development Cycles
- Increased Developer Productivity
- Enables Agile data integration projects
- Works with unstructured and semi-structured data
- Easy Access to cloud hosted data
- Query performance Optimization
- Less maintenance & management of integration systems
- Complements Existing ETL data integration base
- Extension and migration – not radical change
When to Use Data Virtualization
Data Virtualization can always be used when companies have data in disparate data sources or need to merge data with other data resources such as Social Media or External Data. The questions really being asked here is if Data Virtualization can be applied to the Source data, or if the Source data first needs to be materialized before it can be utilized by anything, including Data Virtualization.
The following are examples of factors that make Data Virtualization a good candidate. If too many of these factors sway in the other direction, Data Materialization may be required.
- Time Urgency for Solution Implementation
- Cost / Budget limitations
- Unclear or Volatile Requirements
- Replication Restraints
- Risk Aversion of Organization
- Network Uptime
- Source System Availability
- Quality Data Available
- Source System Load
- Minor Business & Data Rules
- Availability of History in Source System
- Data Freshness requirements
- Small Data Query Result Sets
When not to Use Data Virtualization
The question here is more of determining if Data Materialization is required before any access to data can be performed regardless if Data Virtualization is used or not. Data Virtualization can always complement Materialized data; it’s more of if Data Virtualization can go directly to source, or if source needs to be materialized first.
The following factors would tend to influence a project to materialize data for any type of access.
- Low Availability of the Source Data (not available when needed for reporting) *
- Heavy load already on source Data *
- Poor Data Quality requiring Significant Data Cleansing
- Complex Data Transformation Requirements
- High Volume Result Sets for Data Consumers
- Data Source is Multidimensional (Cubes)
- History Not available in Source Systems
* For some of the Factors that will influence the need to Materialize data first, it should be asked if Data Replication can be a resolution to that rather than actually Materializing (ETL to Warehouse).
Data Virtualization Software
There are many Data Virtualization software vendors in the market today. The following are a few that are leading the Market and identified by Analyst as the leaders in the Data Virtualization. At this time, I have not researched cost nor server requirements for any of these vendors.
- Composite Software
- Denodo Technologies
- IBM (InfoSphere Federation Server)
Many of these companies will be willing to provide a proof of concept before committing to a purchase agreement.
Data Virtualization has been around for many years, but is really becoming mainstream today. It will however, become much more of the norm in the very near future. It is no longer necessary to continuously copy data from location to location as different groups need data. Data Virtualization allows an organization to make data accessible virtually to all different groups to utilize within the organization without physically storing it over and over again. Data Virtualization will allow projects to be developed much quicker and allow them to change much faster as there is no ETL layer that requires database changes as well as code modifications.
For more information about implementing Data Virtualization in your organization, contact FYI Solutions.