Check out this article to get some information about model normalization patterns for data lakes!
The data lake can be considered the consolidation point for all of the data which is of value for use across different aspects of the enterprise. There is a significant range of the different types of potential data repositories that are likely to be part of a typical data lake. These data repositories are likely to be required to address a number of different roles to address the needs of the different users, for example:
- Storing of raw unprocessed data v’s storage of highly processed data
- Managing normalized data v’s managing aggregated data
- Data for general use across the enterprise v’s data for use for a specific business purpose or for a specific set of business users
- Data structures with no predefined schema v’s data structures with a predefined schema.
The other key question in building out the data lake repositories is to what level is some standardization or consistency of schema desirable or even necessary. It is a valid choice for an organization to decide that there is no need to enforce any degree of standardization of schema across the data lake. In this case, the expectation is that whatever virtualization layer is in place is capable of guiding the different users through the array of different structures, the duplication, the different terminology. In other cases, the decision is taken that at least some parts of the data lake need to comply with some degree of standardization in the database schemas, even in cases where such databases are still doing a range of different jobs and so may need to be structured differently.
The typical data models traditionally used in the construction of data warehouse structures would often start with a model structure that has a reasonably high degree of normalization, typically a degree of normalization that provides the necessary level of flexibility to allow the effective representation of the various business needs. When these models are then transformed from such a logical platform-independent format into a more platform-specific format, varying degrees of denormalization takes place in order to ensure physical models that are performant in the specific physical environment
The focus on denormalization becomes critical in the context of the data lake and specifically in terms of any associated Hadoop/HDFS data structures. The traditional trade-off to make when considering the appropriate level of normalization/denormalization to define in physical platform-specific structure is between
- the flexibility of storing data elements in their most granular and most atomic form, which enables the ability to store data to address many different business issues, in many case even to address as-yet-unanticipated business questions.
- The storage of data in a format that is close to that required by the immediate and/or known business requirements. Adopting this denormalized approach enables simpler ETL, simple access SQL, but at the cost of flexibility.