Data Consolidation

Data consolidation is all about lots and lots of data, most commonly referred to as Big Data in 2020, and while Big Data sites are evident on-prem in large corporations, for the most part they are a cloud option. The vendor choices most often compared when choosing to adopt cloud based Big Data platforms nearly always comes down to the Big Three — Amazon Web Services vs. Microsoft’s Azure vs. Google Cloud Platform.

Novon can provide advice, solutions and implement Big Data platforms, Data Lakes, Data Catalogues and Data Warehouse solutions.

But when should you accept you have a Big Data challenge? Big Data size is a constantly moving target. In 2012 Big Data was a few dozen terabytes to many zettabytes of data. Now in 2020 hundreds of terabytes to 10’s of zettabytes would classify as Big Data, but only if you need to query and analyse the data to help your organisation satisfy it business goals. If it just sits there and has no on-going relevance to the business, there is no Big Data challenge. In summary Big Data can best be classified as data that is so large and complex that none of the traditional data management tools can store it or process it efficiently. Please see below for some of the tools that have emerged and Novon recommends you adopt to manage and gain the maximum insight of your data.

A data lake is a repository of data stored in its natural or raw format, usually as objects or files. A data lake is usually a single store of all enterprise data including raw copies of source system data and transformed data used for tasks such as reporting, visualisation, advanced analytics, and machine learning. A data lake can include structured data from relational databases, semi-structured data, unstructured data, and binary data. A data lake can be established on premises or in the cloud.

A data lake can contain a wide assortment of data, but companies can still run cloud analytics on the data, they can still operate a business dashboard, and they can still use the data in an app or for other processing duties. A complete data lake solution consists of both storage and processing. Data lake storage is designed for fault-tolerance, infinite scalability, and high-throughput ingestion of data with varying shapes and sizes.

Data is never thrown away because the data is stored in its raw format. This is especially useful in a Big Data environment, when you may not know in advance what insights are available from the data. Users can explore the data and create their own queries. Data lakes may be faster than traditional ETL tools, however they are certainly more flexible than a data warehouse because it can store unstructured and semi-structured data.

Typical uses for a data lake include data exploration, data analytics, artificial intelligence, and machine learning. A data lake often involves machine learning, which is a way to understand and process data using automated methods. For example, a retailer who needs to access product information, machine learning can determine which SKUs are stored in a data lake and pull that data into an application of choice, without having to first organise the data.

A data lake can also act as the data source for a data warehouse. With this approach, the raw data is ingested into the data lake and then transformed into a structured quarriable format. Typically, this transformation uses an ELT (extract-load-transform) pipeline, where the data is ingested and transformed in place. Source data that is already relational may go directly into the data warehouse, using an ETL process, skipping the data lake.

Data lake stores are often used in event streaming or IoT scenarios, because they can persist large amounts of relational and nonrelational data without transformation or schema definition. They are built to handle high volumes of small writes at low latency and are optimized for massive throughput.

By their very definition data lakes can create significant headaches if not governed properly and if a way to search and retrieve data is not provided. Today the best way to manage and ensure good governance practices is to adopt data cataloguing across your data lakes.

One of the keys to understanding the term data lake is to think about how companies access data in the first place. It is not quite as “clean” as you would think. Sometimes, data arrives in a haphazard fashion (called unstructured data) and it’s dumped to a repository; companies don’t always know the original source of the data. Sometimes, it’s stored in a relational database used for a business app, or it’s a collection of social media data, or something that feeds a mobile app used by external customers. The main point to make here is that a data lake provides increased flexibility over how a company can use the data.

A data catalogue is a detailed inventory of all data assets in an organisation, designed to help data professionals quickly find the most appropriate data for any analytical or business purpose.

A data catalogue uses metadata; data that describes and summarises data that creates an informative and searchable inventory of all data assets within an organisation. A data catalogue typically includes capabilities for collecting and continually enriching or modifying the metadata associated with each data asset to make each asset easier to identify, evaluate, and use properly.

Connections to a wide variety of data sources: To serve as an enterprise-wide data asset inventory, a data catalogue needs to connect to all the assets in your enterprise. Look for connections to all the types of assets you have now and a commitment to building out connections going forward. Also look for a catalogue you can deploy wherever your data resides—on-premises or in a public, private, hybrid, or hybrid multi-cloud environment.

Support for quality and governance that ensures trusted data: A data catalogue should integrate seamlessly with any quality and governance programs and tools you have in place, including data quality rules, business glossaries, and workflows.

There are many classes of metadata, but Novon believes the three primary data catalogue sections you should focus on are: technical metadata, process metadata, and business metadata.

Technical metadata – describes how the data is organised and displayed to users by describing the structure of the data objects, such as tables, columns, rows, indexes, and connections. Technical metadata tells data professionals how they will need to work with the data—for example, if they can work with it as is, or if they need to transform it for analysis or integration.

Process metadata – describes the circumstances of the data asset’s creation and when, how, and by whom it has been accessed, used, updated, or changed. It should also describe who has permission to access and use the data. Process metadata provides information about the asset’s history and lineage, which can help an analyst decide if the asset is recent enough for the task at hand, if it comes from a reliable source, if it has been updated by trustworthy individuals, and so on. Process metadata can also be used to troubleshoot queries. And increasingly, process metadata is mined for information on software users or customers, such as what software they are using and the level of service they are experiencing.

Business metadata – describes the business aspects of the data asset—the business value it has to the organization, its fitness for a particular purpose or various purposes, information about regulatory compliance, and more. Business metadata is where data professionals and line-of-business users speak the same language about data assets.

At a minimum, a data catalogue should make it easy to find and organise all the existing metadata associated with any data asset in your organisation. It should also provide tools that enable data experts to modify / enhance and enrich that metadata with tags, associations, ratings, annotations, and any other information and context that helps users find data faster and use it with confidence.

Talk to Novon today about how we can help you find the best solution or implement a solution you have discovered and want to use in your organisation.

At its simplest, a data warehouse is a system used for storing and reporting on data. The data typically originates in multiple systems, then it is moved into the data warehouse for long-term storage and analysis. This storage is structured such that users from many divisions within your business can access and analyse the data according to their needs.

For the most part a data warehouse uses a typical extract, transform, load (ETL) based engine which performs the following key functions: staging, data integration, and access layer. The staging layer or staging database stores raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer often storing this transformed data in an operational data store (ODS) database. The integrated data is then moved to yet another database, often called the data warehouse database, where the data is arranged into hierarchical groups, often called dimensions, and into facts and aggregate facts. The combination of facts and dimensions is sometimes called a star schema. The access layer helps users retrieve data.

The main source of the data is cleansed, transformed, catalogued, and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support.

However, the means to retrieve and analyse data, to extract, transform, and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence and Analytics tools, tools to extract, transform, and load data into the repository, and tools to manage and retrieve metadata.

If you are a medium business, but you want best of breed solutions and you have a relatively small IT team, we suggest you excluded the very large IT vendor products from consideration, such as Informatica, IBM and Oracle. The reasons are the cost to license, effort to implement and maintain and the amount of customisation required to satisfy your business requirements. However, if you are a corporation of more than 3,500 to 5,000 staff then those same large vendors and the functionality offered will probably best suit your requirements.

Regardless of what you may choose we considered the following key criteria critical when comparing and analysing appropriate vendors:

  1. The solution can manage large volumes of data – Supports shared nothing / multi parallel processing, linear scaling, data location and skewing
  2. The solution can load data continuously – Attaching partitions, support multi-version concurrency control and can provide continuous loading and manage other data types beyond structured
  3. The solution supports repetitive queries – Workload management, indexing, partition elimination and zone maps, caching optimisation and efficient query planning and support for repetitive queries
  4. The solution can run queries to support advanced analytics, queries on many data types / sources, support for operational BI queries and fully redundant system availability
  5. The solution must offer self-service BI tools which enable business users to add data and define new metrics when performing their analysis without requiring IT intervention. Traditional solutions only offer guided discovery tools which operate with a pre-set collection of data and metrics
  6. The solution must offer advanced analytics that encompasses tools that research /data scientists use to create predictive and prescriptive analytical models. This includes predictive analytics, statistical modelling, data mining and Big Data analytics software
  7. The solution must be able to support near real-time data and streaming data including large data volumes
  8. The solution must support integration with various vendor solution types, such as, Microsoft Office, collaboration, social interaction, storyboarding and must be able to scale to fit your forecasted growth
  9. You should consider the type of user skill level support and the user type support required, administration / management capabilities and administration / management skills required
  10. There should be a plentiful and at a reasonable cost, supply of technical resources available in the market
  11. The vendor must have a cloud as well as an on-premise solution
  12. In addition to these criteria, a vendor should be in the top right hand corner of Gartner’s magic quadrant or so close it was irrelevant, they have to be of a size that they would stay in business for a further 7 years minimum, they have to have references easily obtained and of a high quality, and skilled resources needed to be optimally available