Saturday, December 3, 2022
HomeBig DataWhat’s a Dimensional Mannequin and Easy methods to Implement It on the...

What’s a Dimensional Mannequin and Easy methods to Implement It on the Databricks Lakehouse Platform


Dimensional modeling is among the hottest information modeling methods for constructing a contemporary information warehouse. It permits clients to shortly develop info and dimensions primarily based on enterprise wants for an enterprise. When serving to clients within the subject, we discovered many are searching for greatest practices and implementation reference structure from Databricks.

On this article, we goal to dive deeper into the very best apply of dimensional modeling on Databricks’ Lakehouse Platform and supply a reside instance to load an EDW dimensional mannequin in real-time utilizing Delta Stay Tables.

Listed here are the high-level steps we’ll cowl on this weblog:

  1. Outline a enterprise drawback
  2. Design a dimensional mannequin
  3. Greatest practices and suggestions for dimensional modeling
  4. Implementing a dimensional mannequin in a Databricks Lakehouse
  5. Conclusion

1. Outline a enterprise drawback

Dimensional modeling is business-oriented; it all the time begins with a enterprise drawback. Earlier than constructing a dimensional mannequin, we have to perceive the enterprise drawback to resolve, because it signifies how the information asset can be introduced and consumed by finish customers. We have to design the information mannequin to help extra accessible and quicker queries.

The Enterprise Matrix is a elementary idea in Dimensional Modeling, beneath is an instance of the enterprise matrix, the place the columns are shared dimensions and rows symbolize enterprise processes. The outlined enterprise drawback determines the grain of the actual fact information and required dimensions. The important thing concept right here is that we might incrementally construct further information property with ease primarily based on the Enterprise Matrix and its shared or conformed dimensions.

A Business Matrix with Shared Dimensions and Business Processes
A Enterprise Matrix with Shared Dimensions and Enterprise Processes

Right here we assume that the enterprise sponsor want to staff to construct a report to provide insights on:

  1. What are the highest promoting merchandise to allow them to perceive product reputation
  2. What are the very best performing shops to study good retailer practices

2. Design a dimensional mannequin

Primarily based on the outlined enterprise drawback, the information mannequin design goals to symbolize the information effectively for reusability, flexibility and scalability. Right here is the high-level information mannequin that might resolve the enterprise questions above.

A diagram shows how star schema works, with facts and dimensions built in Lakehouse
Dimensional Mannequin on the Lakehouse

The design needs to be straightforward to know and environment friendly with totally different question patterns on the information. From the mannequin, we designed the gross sales reality desk to reply our enterprise questions; as you possibly can see, aside from the overseas keys (FKs) to the scale, it solely comprises the numeric metrics used to measure the enterprise, e.g. sales_amount.

We additionally designed dimension tables reminiscent of Product, Retailer, Buyer, Date that present contextual info on the actual fact information. Dimension tables are sometimes joined with reality tables to reply particular enterprise questions, reminiscent of the most well-liked merchandise for a given month, which shops are the best-performing ones for the quarter, and many others.

3. Greatest practices and suggestions for dimensional modeling

With the Databricks Lakehouse Platform, one can simply design & implement dimensional fashions, and easily construct the info and dimensions for the given topic space.

Beneath are a number of the greatest practices really helpful whereas implementing a dimensional mannequin:

  • One ought to denormalize the dimension tables. As a substitute of the third regular type or snowflake kind of mannequin, dimension tables sometimes are extremely denormalized with flattened many-to-one relationships inside a single dimension desk.
  • Use conformed dimension tables when attributes in numerous dimension tables have the identical column names and area contents. This benefit is that information from totally different reality tables may be mixed in a single report utilizing conformed dimension attributes related to every reality desk.
  • A ordinary development in dimension tables is round monitoring modifications to dimensions over time to help as-is or as-was reporting. You may simply apply the next primary methods for dealing with dimensions primarily based on totally different necessities.
    • The kind 1 approach overwrites the dimension attribute’s preliminary worth.
    • With the sort 2 approach, the commonest SCD approach, you employ it for correct change monitoring over time.

    This may be simply achieved out of the field with Delta Stay Tables implementation.

    • One can simply carry out SCD kind 1 or SCD kind 2 utilizing Delta Stay Tables utilizing APPLY CHANGES INTO
  • Major + International Key Constraints permit finish customers like yourselves to know relationships between tables.
  • Utilization of IDENTITY Columns routinely generates distinctive integer values when new rows are added. Identification columns are a type of surrogate keys. Consult with the weblog hyperlink for extra particulars.
  • Enforced CHECK Constraints to by no means fear about information high quality or information correctness points sneaking up on you.

4. Implementing a dimensional mannequin in a Databricks Lakehouse

Now, allow us to take a look at an instance of Delta Stay Tables primarily based dimensional modeling implementation:

The instance code beneath reveals us how you can create a dimension desk (dim_store) utilizing SCD Sort 2, the place change information is captured from the supply system.


-- create the gold desk
CREATE INCREMENTAL LIVE TABLE dim_store
TBLPROPERTIES ("high quality" = "gold")
COMMENT "Slowly Altering Dimension Sort 2 for retailer dimension within the gold layer";

-- retailer all modifications as SCD2
APPLY CHANGES INTO reside.dim_store
FROM STREAM(reside.silver_store)
  KEYS (store_id)
  SEQUENCE BY updated_date
  COLUMNS * EXCEPT (_rescued_data, input_file_name)
  STORED AS SCD TYPE 2;

The instance code beneath reveals us how you can create a reality desk (fact_sale), with the constraint of valid_product_id we’re ready to make sure all reality information which are loaded have a legitimate product related to it.


-- create the actual fact desk for gross sales in gold layer
CREATE STREAMING LIVE TABLE fact_sale (
  CONSTRAINT valid_store_business_key EXPECT (store_business_key IS NOT NULL) ON VIOLATION DROP ROW,
  CONSTRAINT valid_product_id EXPECT (product_id IS NOT NULL) ON VIOLATION DROP ROW
) 
TBLPROPERTIES ("high quality" = "gold", "ignoreChanges" = "true")
COMMENT "gross sales reality desk within the gold layer" AS
  SELECT
    sale.transaction_id,
    date.date_id,
    buyer.customer_id,
    product.product_id AS product_id,
    retailer.store_id,
    retailer.business_key AS store_business_key,
    sales_amount
  FROM STREAM(reside.silver_sale) sale
  INNER JOIN reside.dim_date date
  ON to_date(sale.transaction_date, 'M/d/yy') = to_date(date.date, 'M/d/yyyy') 
  -- solely be part of with the energetic clients
  INNER JOIN (SELECT * FROM reside.dim_customer WHERE __END_AT IS NULL) buyer
  ON sale.customer_id = buyer.customer_id
  -- solely be part of with the energetic merchandise
  INNER JOIN (SELECT * FROM reside.dim_product WHERE __END_AT IS NULL) product
  ON sale.product = product.SKU
  -- solely be part of with the energetic shops
  INNER JOIN (SELECT * FROM reside.dim_store WHERE __END_AT IS NULL) retailer
  ON sale.retailer = retailer.business_key

The Delta Stay Desk pipeline instance might be discovered right here. Please confer with Delta Stay Tables quickstart on how you can create a Delta Stay Desk pipeline. As seen beneath, DLT affords full visibility of the ETL pipeline and dependencies between totally different objects throughout bronze, silver, and gold layers following the lakehouse medallion structure.

End to End DLT Pipeline
Finish to Finish DLT Pipeline

Right here is an instance of how the dimension desk dim_store will get up to date primarily based on the incoming modifications. Beneath, the Retailer Brisbane Airport was up to date to Brisbane Airport V2, and with the out-of-box SCD Sort 2 help, the unique document ended on Jan 07 2022, and a brand new document was created which begins on the identical day with an open finish date (NULL) – which signifies the newest document for the Brisbane airport.

SCD Type 2 for Store Dimension
SCD Sort 2 for Retailer Dimension

For extra implementation particulars, please confer with right here for the complete pocket book instance.

5. Conclusion

On this weblog, we realized about dimensional modeling ideas intimately, greatest practices, and how you can implement them utilizing Delta Stay Tables.

Study extra about dimensional modeling at Kimball Expertise.

Get began on constructing your dimensional fashions within the Lakehouse

Strive Databricks free for 14 days.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments