Data Architecture

Data lakehouse architecture: What it is and how LakeStack implements it on AWS

‍

The data lake category has matured decisively in 2026, and the lakehouse architecture is now the dominant pattern. Enterprise data teams can have the economics and flexibility of a data lake alongside the performance and reliability of a data warehouse without choosing between them or paying for both.

Manpreet Kour

June 12, 2026

10 min

Share this Article:

Table of content

Heading

What is data lakehouse architecture

A data lakehouse is a unified data architecture that combines the advantages of a data warehouse and a data lake. It provides high-performance, affordable, and governance-friendly storage space for various data types including structured, semi-structured, and unstructured data.

Unlike traditional approaches that require separate data lakes and data warehouse infrastructures, a data lakehouse brings them together in one platform. This eliminates the need for complex data copies and separate systems while supporting both business intelligence (BI) and machine learning (ML) workloads.

The architecture stores raw data in a range of formats from thousands to hundreds of thousands of sources in a central location. Data becomes immediately usable by integrated analytics tools for training AI models and generating reports and dashboards.

Key features that define data lakehouse architecture

The data lakehouse has emerged as the dominant pattern in 2026, fundamentally maturing the data lake category. This architecture solves the most persistent failure modes of the original data lake design by adding transactional reliability and warehouse-grade query performance on top of cheap cloud object storage.

Three core layers of data lakehouse architecture

A data lakehouse uses low-cost cloud object storage of data lakes to provide on-demand, scalable storage for massive volumes of data in its raw form. It then integrates metadata layers over this store to provide warehouse-like performance and optimization.

The architecture consists of three core layers:

Storage layer

A low-cost object store for all raw datasets, decoupled from compute resources to allow independent scaling. This is typically AWS S3 for cloud deployments.

Staging layer

A metadata layer that provides a detailed catalog, applying management features such as indexing, caching, and access control. This layer enables schema enforcement and ACID transactions.

Semantic layer

The user-facing layer where client apps, analytics tools, and data scientists access data for experimentation and BI presentation. This is where SQL analytics and machine learning workloads converge.

This three-layer structure provides a single platform for business intelligence, predictive analytics, and generative AI workflows. By unifying siloed data, organizations avoid maintaining separate systems and complex synchronization overhead.

Medallion data architecture: The implementation pattern

Medallion Architecture is a data management framework that logically organizes data into 3 layers within a data lakehouse, bronze, silver, and gold. Each layer contains data at a different processing stage.

This pattern is frequently referred to as medallion architecture and is the recommended design approach for modern lakehouse implementations.

How data flows through medallion architecture

Raw data enters the system in the bronze layer, gets cleaned and transformed in the silver layer, and then turns into fully processed data at the gold layer. As data flows through medallion architecture, each layer stores, processes, and manages the data during a different stage in its lifecycle.

Bronze layer: The foundation

The Bronze layer is the foundation of the Medallion Architecture, where raw data first lands. Its defining characteristic is immutability, this data should never change.

Silver layer: Refined data

The data is cleaned, transformed, and structured to make it more usable for analytics and reporting. Silver is refined data that has passed quality checks.

Gold layer: Curated truth

Gold is curated business-ready data. This layer contains fully processed, business-ready data that can be modeled and analyzed.

The medallion architecture is a multi-layered data management approach for data lakehouses, where data quality gradually improves as it moves through each layer. This progressive improvement makes data more suitable for business intelligence and machine learning applications.

Apache Iceberg: The open table format for modern lakehouses

Apache Iceberg has emerged as the leader in open table formats in 2026, culminating years of development and industry competition. By 2026, it has become clear that Apache Iceberg has won the format wars.

What was once considered less performant just a few years ago has now become the industry standard, backed by major technology companies and enjoying widespread adoption.

Why Iceberg data lakehouse dominates 2026

Apache Iceberg has moved beyond early adoption and into the core of enterprise data platforms. It is now widely used to support large-scale analytics, AI and ML workloads, and business-critical reporting across modern lakehouse environments.

The next generation of Amazon SageMaker is built on an open lakehouse architecture, fully compatible with Apache Iceberg. Apache Iceberg in combination with supported AWS services enables a transactional data lake, often based on storage in S3.

Key Iceberg features for enterprise deployments

ACID guarantees ensure data consistency
Time travel enables queries against historical snapshots
Schema evolution allows table structure changes without data reload
Partition evolution supports performance optimization over time

Data lakehouse market growth and enterprise adoption

The Data Lakes Market reached an estimated USD 20.18 billion in 2025 and is projected to climb to USD 24.62 billion in 2026. The market will surge to USD 148.50 billion by 2035, registering a CAGR of 23.50% over the 2026–2035 forecast window.

Cloud deployment held the majority share of the Data Lakes Market in 2025, valued at approximately USD 13.70 billion. North America dominated the Data Lakes Market with a 40.10% share in 2025.

Market trajectory by deployment type

Hybrid and multi-cloud architectures are forecast to grow at a 24.60% CAGR to 2035 as organizations pursue multi-cloud portability strategies.

Solutions captured roughly 73% of revenue in the Data Lakes Market in 2025, reflecting strong demand for integrated platforms supporting Apache Spark processing for data lake analytics. Services are poised to expand at a 26.40% CAGR through 2035 as enterprises seek managed deployment and data lake governance consulting.

The global data lake market size was USD 11.07 billion in 2025 and is projected to expand to USD 84.27 billion by 2034, growing at a CAGR of 25.30%.

Why organizations choose data lakehouse implementation

A data lakehouse combines the openness and scalability of data lakes with the reliability and governance of data warehouses in a single platform. By keeping all data in one place and supporting both BI and advanced analytics, a lakehouse removes the need for separate systems and complex data copies.

Business benefits of data lakehouse architecture

Data lakehouses provide scalable storage and processing capabilities for modern organizations that want to avoid isolated systems for processing different workloads like machine learning and business intelligence.

The data lakehouse combines the benefits of data lakes and data warehouses and provides open, direct access to data stored in standard data formats. It offers indexing protocols optimized for machine learning and data science, plus low query latency and high reliability for BI and advanced analytics.

By combining an optimized metadata layer with validated data stored in standard formats in cloud object storage, the Data Lakehouse allows you to work from the same data and in the same platform across different use cases.

How LakeStack implements data lakehouse architecture on AWS

LakeStack is a no code, AWS native data platform that unifies data lakes, ETL pipelines, BI, and AI into one plug and play stack. It deploys AI-ready, governed data infrastructure directly into your own AWS account in under 2 weeks.

LakeStack's data lakehouse implementation approach

LakeStack deploys directly into your own cloud (built on AWS) so you own and control your data. This means no data leaves your AWS account and you maintain full governance control.

Medallion architecture in LakeStack

LakeStack implements medallion data architecture with Bronze, Silver, and Gold layers automatically configured. Raw data lands in Bronze, gets cleaned in Silver, and becomes business-ready in Gold, all managed through AWS Glue pipelines.

This progressive data refinement makes information more suitable for business intelligence and machine learning applications without manual pipeline construction.

Apache Iceberg integration

LakeStack uses Apache Iceberg as its open table format, providing:

ACID transactions for data consistency
Time travel for historical queries
Schema evolution for flexible table structure changes
Full compatibility with AWS SageMaker's open lakehouse architecture

Apache Iceberg powers both batch and real-time data pipelines in LakeStack, enabling consistent, low-latency data access across all workloads.

AWS Glue optimization for lakehouse

LakeStack leverages AWS Glue for ETL with Iceberg-specific optimizations:

S3 Table Buckets for managed Iceberg table maintenance and storage optimization
Materialized views in Glue to pre-compute aggregates and boost query performance
Catalog management and federation to enable agile, cross-source analytics
Iceberg integration with AWS compute services like Athena, Redshift, and EMR Spark

AWS Glue Iceberg views offer valuable benefits by reducing data pipeline complexity and delivering results quickly.

Real-time data lakehouse implementation on AWS

Modern lakehouse implementations enable real-time data pipelines alongside traditional batch processing. In one implementation using AWS Glue and Iceberg, teams built a real-time pipeline that ingests, enriches, and aggregates streaming sales data directly into a lakehouse.

Real-time pipeline architecture pattern

HTTP Endpoint → Iceberg Bronze → Postgres Lookup → Iceberg Silver → Aggregate → Iceberg Gold

This pattern includes:

HTTP Endpoint source for continuously ingesting sales events
Iceberg Sink for writing bronze layer raw sales records
Postgres lookup enricher for getting product data
Iceberg Sink for writing silver layer enriched records
Aggregate component for stateful computation of real-time metrics
Iceberg Sink for writing gold layer sales aggregates

LakeStack supports this same real-time pattern natively, eliminating the need to build custom pipelines from scratch.

Best practices for data lakehouse implementation on AWS

AWS re:Invent 2025 recommendations

Significant reduction in data latency and improved data consistency come from following these best practices:

Use S3 Table Buckets for all Apache Iceberg projects if possible, and size files appropriately
Set up snapshots and partitioning to ensure performance into the future
Leverage materialized views for pre-aggregated, pre-computed results
Prioritize catalog management to enable agile, cross-source analytics
Integrate Iceberg with AWS compute services like Athena, Redshift, and EMR Spark

Medallion architecture best practices

When implementing medallion architecture in AWS, follow these patterns:

Use one workspace per medallion layer for clear separation
Inside each workspace, use one lakehouse per data source
Combine sources in silver layer before curating to gold
Gold layer = business-ready data for semantic models

Bonus tip: Use schemas inside lakehouses to separate domains (e.g., sales, finance) if you want fewer lakehouses.

Don't overbuild your medallion architecture

Most teams implement Bronze, Silver, Gold without asking if the dataset is actually complex enough to require three physical layers. Before automatically building three schemas, consider:

How often does the source schema change?
Are multiple teams modifying the structure?
Is this shared infrastructure across multiple domains?

If the answer is no, you might not need physical separation three times. Layers should represent responsibility and boundaries, not habits.

AI readiness and data lakehouse architecture

78% of organizations reported using AI in 2024, up from 55% the year before. However, only 13% of organizations are AI Pacesetters, leaders who architect differently and deploy AI at the scale and speed needed to realize value.

The critical evolution of 2026 is the near-universal adoption of the lakehouse architecture, which solves the most persistent failure modes of the original data lake design. This architecture is essential for AI readiness because it:

Provides open, direct access to data stored in standard formats for ML training
Offers indexing protocols optimized for machine learning
Supports generative AI workflows from the same underlying data store
Enables consistent, low-latency data access for both batch and real-time AI pipelines

Apache Iceberg, Polaris, Parquet, and Arrow all pushed forward in 2025 with practical features that improve performance, governance, or compatibility. Together, they form a foundation for a warehouse experience on open data that enables AI workloads at scale.

LakeStack positions itself as the AI-ready data foundation you own precisely because it combines lakehouse architecture with Apache Iceberg and native AWS AI services.

When to choose data lakehouse over traditional architectures

Data lakehouse vs data warehouse

Unlike a data warehouse, a data lakehouse can store all types of semi-structured and unstructured data for machine learning purposes.

Data lakehouse vs data lake

The original promise of cheap, flexible storage has been fulfilled, and the architecture has evolved well beyond it. The modern data lakehouse adds transactional reliability, governance, and warehouse-grade query performance to the economics of object storage.

Start your data lakehouse implementation with LakeStack

LakeStack unifies data lakes, ETL pipelines, BI, and AI into one plug-and-play stack deployed in your AWS account. You get AI-ready, governed data infrastructure in under 2 weeks without writing code.

What you get with LakeStack

Complete data lakehouse architecture with Bronze, Silver, Gold layers
Apache Iceberg integration for ACID transactions and schema evolution
AWS Glue ETL pipelines optimized for lakehouse performance
Medallion architecture automatically configured
AI readiness with native AWS SageMaker compatibility
Full ownership: data stays in your AWS account

Ready to implement data lakehouse architecture on AWS? LakeStack deploys the complete stack in your account so you own and control your AI-ready data foundation.