当前位置：首页 > news >正文

【ClickHouse源码】物化视图的写入过程

news 2025/7/11 15:01:35

本文对 ClickHouse 物化视图的写入流程源码做个详细说明，基于 v22.8.14.53-lts 版本。

StorageMaterializedView

首先来看物化视图的构造函数：

StorageMaterializedView::StorageMaterializedView(const StorageID & table_id_,ContextPtr local_context,const ASTCreateQuery & query,const ColumnsDescription & columns_,bool attach_,const String & comment): IStorage(table_id_), WithMutableContext(local_context->getGlobalContext())
{StorageInMemoryMetadata storage_metadata;storage_metadata.setColumns(columns_);......if (!has_inner_table){target_table_id = query.to_table_id;}else if (attach_){/// If there is an ATTACH request, then the internal table must already be created.target_table_id = StorageID(getStorageID().database_name, generateInnerTableName(getStorageID()), query.to_inner_uuid);}else{/// We will create a query to create an internal table.auto create_context = Context::createCopy(local_context);auto manual_create_query = std::make_shared<ASTCreateQuery>();manual_create_query->setDatabase(getStorageID().database_name);manual_create_query->setTable(generateInnerTableName(getStorageID()));manual_create_query->uuid = query.to_inner_uuid;auto new_columns_list = std::make_shared<ASTColumns>();new_columns_list->set(new_columns_list->columns, query.columns_list->columns->ptr());manual_create_query->set(manual_create_query->columns_list, new_columns_list);manual_create_query->set(manual_create_query->storage, query.storage->ptr());InterpreterCreateQuery create_interpreter(manual_create_query, create_context);create_interpreter.setInternal(true);create_interpreter.execute();target_table_id = DatabaseCatalog::instance().getTable({manual_create_query->getDatabase(), manual_create_query->getTable()}, getContext())->getStorageID();}
}

通过以上代码可以发现物化视图支持几种创建语法，总的来说可以归为 3 类：

指定了目的表的情况：

create table src(id Int32) Engine=Memory();
create table dest(id Int32) Engine=Memory();create materialized view mv to dest as select * from src;

使用以上形式时，target_table_id 会选择 dest 表的 table_id。

不指定目的表的情况：
```
create table src(id Int32) Engine=Memory();create materialized view mv Engine=Memory() as select * from src;
```
使用以上形式时，首先会根据源表的 table_id 生成一个以 .inner. 开头的目的表名，如 .inner.5ef4ec2c-efb1-4918-bf6c-59de2edb54cf，然后在生成一个随机的 uuid 作为目的表的 table_id 并同时作为 target_table_id 。
第 3 种其实不是创建语法，而是在 ClickHouse 启动或者物化视图被 detach 掉后，执行 attach 的实现。

StorageMaterializedView::read

void StorageMaterializedView::read(QueryPlan & query_plan,const Names & column_names,const StorageSnapshotPtr & storage_snapshot,SelectQueryInfo & query_info,ContextPtr local_context,QueryProcessingStage::Enum processed_stage,const size_t max_block_size,const size_t num_streams)
{/// 获取目的表实例auto storage = getTargetTable();auto lock = storage->lockForShare(local_context->getCurrentQueryId(), local_context->getSettingsRef().lock_acquire_timeout);auto target_metadata_snapshot = storage->getInMemoryMetadataPtr();auto target_storage_snapshot = storage->getStorageSnapshot(target_metadata_snapshot, local_context);if (query_info.order_optimizer)query_info.input_order_info = query_info.order_optimizer->getInputOrder(target_metadata_snapshot, local_context);storage->read(query_plan, column_names, target_storage_snapshot, query_info, local_context, processed_stage, max_block_size, num_streams);if (query_plan.isInitialized()){/// 获取物化视图 stream 中对应的 block 结构auto mv_header = getHeaderForProcessingStage(column_names, storage_snapshot, query_info, local_context, processed_stage);/// 获取查询语句中所需的列对应的 block 结构auto target_header = query_plan.getCurrentDataStream().header;/// 从查询的列中去除那些mv不存在的列removeNonCommonColumns(mv_header, target_header);/// 分布式表引擎在查询处理到指定阶段，header 中可能不包含物化视图中的所有列，例如 group by/// 所以从 mv_header 中去除那些查询不需要的列removeNonCommonColumns(target_header, mv_header);/// 当查询中得到的 mv_header 和 target_header 有不同结构时，会通过在 pipeline 中添加表达式计算来进行转换/// 比如 Decimal(38, 6) -> Decimal(16, 6)，或者一些聚合运算，如 sum 等if (!blocksHaveEqualStructure(mv_header, target_header)){auto converting_actions = ActionsDAG::makeConvertingActions(target_header.getColumnsWithTypeAndName(),mv_header.getColumnsWithTypeAndName(),ActionsDAG::MatchColumnsMode::Name);auto converting_step = std::make_unique<ExpressionStep>(query_plan.getCurrentDataStream(), converting_actions);converting_step->setStepDescription("Convert target table structure to MaterializedView structure");query_plan.addStep(std::move(converting_step));}query_plan.addStorageHolder(storage);query_plan.addTableLock(std::move(lock));}
}

通过以上代码可以看出，物化视图是一种逻辑描述，数据都是存储在目的表中，读取时实际操作的目的表，并且在在查询过程中还会涉及到多阶段 block 的转换，以及表达式的计算。

StorageMaterializedView::write

SinkToStoragePtr StorageMaterializedView::write(const ASTPtr & query, const StorageMetadataPtr & /*metadata_snapshot*/, ContextPtr local_context)
{auto storage = getTargetTable();auto lock = storage->lockForShare(local_context->getCurrentQueryId(), local_context->getSettingsRef().lock_acquire_timeout);auto metadata_snapshot = storage->getInMemoryMetadataPtr();auto sink = storage->write(query, metadata_snapshot, local_context);sink->addTableLock(lock);return sink;
}

同样写也是将数据存入了目的表。

我们都知道数据写源表时会触发写物化视图，从而将数据写入目的表，下面就看一下是如何实现的。SQL 的执行都是通过 IInterpreter 到 InterpreterXxx 的，这里就不再多说，一个写入操作最中会调用 InterpreterInsertQuery，所以从 InterpreterInsertQuery::execute() 开始跟踪。

InterpreterInsertQuery::execute()

BlockIO InterpreterInsertQuery::execute()
{......std::vector<Chain> out_chains;if (!distributed_pipeline || query.watch){size_t out_streams_size = 1;......for (size_t i = 0; i < out_streams_size; ++i){auto out = buildChainImpl(table, metadata_snapshot, query_sample_block, nullptr, nullptr);out_chains.emplace_back(std::move(out));}}......
}

execute() 中通过 buildChainImpl() 来构建输出链， buildChainImpl() 会判断当前表是否有物化视图关联，如果有就会调用 buildPushingToViewsChain() 。

buildPushingToViewsChain()

这个方法非常长，这里只展示和本文想说明的问题相关的部分。

Chain buildPushingToViewsChain(const StoragePtr & storage,const StorageMetadataPtr & metadata_snapshot,ContextPtr context,const ASTPtr & query_ptr,bool no_destination,ThreadStatusesHolderPtr thread_status_holder,std::atomic_uint64_t * elapsed_counter_ms,const Block & live_view_header)
{......auto table_id = storage->getStorageID();auto views = DatabaseCatalog::instance().getDependentViews(table_id);......std::vector<Chain> chains;for (const auto & view_id : views){auto view = DatabaseCatalog::instance().tryGetTable(view_id, context);......if (auto * materialized_view = dynamic_cast<StorageMaterializedView *>(view.get())){......StoragePtr inner_table = materialized_view->getTargetTable();auto inner_table_id = inner_table->getStorageID();auto inner_metadata_snapshot = inner_table->getInMemoryMetadataPtr();query = view_metadata_snapshot->getSelectQuery().inner_query;target_name = inner_table_id.getFullTableName();Block header;/// Get list of columns we get from select query.if (select_context->getSettingsRef().allow_experimental_analyzer)header = InterpreterSelectQueryAnalyzer::getSampleBlock(query, select_context);elseheader = InterpreterSelectQuery(query, select_context, SelectQueryOptions().analyze()).getSampleBlock();/// Insert only columns returned by select.Names insert_columns;const auto & inner_table_columns = inner_metadata_snapshot->getColumns();for (const auto & column : header){/// But skip columns which storage doesn't have.if (inner_table_columns.hasPhysical(column.name))insert_columns.emplace_back(column.name);}InterpreterInsertQuery interpreter(nullptr, insert_context, false, false, false);out = interpreter.buildChain(inner_table, inner_metadata_snapshot, insert_columns, thread_status_holder, view_counter_ms);out.addStorageHolder(view);out.addStorageHolder(inner_table);}else if (auto * live_view = dynamic_cast<StorageLiveView *>(view.get())){runtime_stats->type = QueryViewsLogElement::ViewType::LIVE;query = live_view->getInnerQuery(); // Used only to log in system.query_views_logout = buildPushingToViewsChain(view, view_metadata_snapshot, insert_context, ASTPtr(), true, thread_status_holder, view_counter_ms, storage_header);}else if (auto * window_view = dynamic_cast<StorageWindowView *>(view.get())){runtime_stats->type = QueryViewsLogElement::ViewType::WINDOW;query = window_view->getMergeableQuery(); // Used only to log in system.query_views_logout = buildPushingToViewsChain(view, view_metadata_snapshot, insert_context, ASTPtr(), true, thread_status_holder, view_counter_ms);}elseout = buildPushingToViewsChain(view, view_metadata_snapshot, insert_context, ASTPtr(), false, thread_status_holder, view_counter_ms);......
}

buildPushingToViewsChain() 会检查当前表是否有视图依赖，通过几个判断可以看出视图分为三种：物化视图、实时视图和窗口视图，最后的 else 是指当前表是个普通表。如果当前表是源表且有物化视图依赖，就会调用 buildPushingToViewsChain() 来构建链，这是个递归调用，首次进入当前表是普通表，其依赖的物化视图会再次调用该方法，再次进入就会物化视图的 if 逻辑，最终是通过 buildChain() 来构建链。

buildChainImpl

buildChain() 中是调用了 buildChainImpl() 这个实现类。

Chain InterpreterInsertQuery::buildChainImpl(const StoragePtr & table,const StorageMetadataPtr & metadata_snapshot,const Block & query_sample_block,ThreadStatusesHolderPtr thread_status_holder,std::atomic_uint64_t * elapsed_counter_ms)
{....../// We create a pipeline of several streams, into which we will write data.Chain out;/// Keep a reference to the context to make sure it stays alive until the chain is executed and destroyedout.addInterpreterContext(context_ptr);/// NOTE: we explicitly ignore bound materialized views when inserting into Kafka Storage.///       Otherwise we'll get duplicates when MV reads same rows again from Kafka.if (table->noPushingToViews() && !no_destination){auto sink = table->write(query_ptr, metadata_snapshot, context_ptr);sink->setRuntimeData(thread_status, elapsed_counter_ms);out.addSource(std::move(sink));}else{out = buildPushingToViewsChain(table, metadata_snapshot, context_ptr, query_ptr, no_destination, thread_status_holder, elapsed_counter_ms);}......
}

buildChainImpl() 会根据当前表（或视图）是否有依赖的视图或目的表，来做不同的操作，这里就可以处理视图级连视图的情况，会不断递归构造相应的链节点，使之连接起来。

Chain InterpreterInsertQuery::buildChainImpl(const StoragePtr & table,const StorageMetadataPtr & metadata_snapshot,const Block & query_sample_block,ThreadStatusesHolderPtr thread_status_holder,std::atomic_uint64_t * elapsed_counter_ms)
{.../// We create a pipeline of several streams, into which we will write data.Chain out;/// Keep a reference to the context to make sure it stays alive until the chain is executed and destroyedout.addInterpreterContext(context_ptr);/// NOTE: we explicitly ignore bound materialized views when inserting into Kafka Storage.///       Otherwise we'll get duplicates when MV reads same rows again from Kafka.if (table->noPushingToViews() && !no_destination)  // table->noPushingToViews() 用于禁止物化视图插入数据到 KafkaEngine{auto sink = table->write(query_ptr, metadata_snapshot, context_ptr);sink->setRuntimeData(thread_status, elapsed_counter_ms);out.addSource(std::move(sink));}else  // 构建物化视图插入 pushingToViewChain，重点！！！{out = buildPushingToViewsChain(table, metadata_snapshot, context_ptr, query_ptr, no_destination, thread_status_holder, elapsed_counter_ms);}...return out;
}

小结

所以源表和物化视图在写入时是构造了多个输出链，数据也是只能对当前写入的数据做操作，不会影响源表现有数据。而且写入源表和目的表的过程是一个 pipeline，需要全部完成才算写入成功，当然 pipeline 可以并行处理，可以加快写入速度。

欢迎添加微信：xiedeyantu，讨论技术问题。

【ClickHouse源码】物化视图的写入过程

StorageMaterializedView

StorageMaterializedView::read

StorageMaterializedView::write

InterpreterInsertQuery::execute()

buildPushingToViewsChain()

buildChainImpl

小结

相关文章：

【ClickHouse源码】物化视图的写入过程

.NET 使用NLog增强日志输出

一道阿里类的初始化顺序笔试题

cuda找不到路径报错

Elasticsearch进阶之（核心概念、系统架构、路由计算、倒排索引、分词、Kibana）

Android包体积缩减

【华为OD机试】网上商城优惠活动（C++ Java Javascript Python）

GWT安装过程

代码随想录算法训练营第一天| 704. 二分查找、27. 移除元素

office@word@ppt启用mathtype组件方法整理

计算机大小端

Matplotlib绘图从零入门到实践（含各类用法详解）

C语言入门教程｜｜C语言指针｜｜C语言字符串

Nacos2.x+Nginx集群配置

Android源码分析 - InputManagerService与触摸事件

python库--urllib

美团前端二面常考react面试题及答案

环境搭建04-Ubuntu16.04更改conda，pip的镜像源

【C++进阶】四、STL---set和map的介绍和使用

JavaSE学习进阶 day1_01 static关键字和静态代码块的使用

【JavaEE】-- HTTP

阿里云ACP云计算备考笔记 (5)——弹性伸缩

Qt Widget类解析与代码注释

AtCoder 第409场初级竞赛 A~E题解

连锁超市冷库节能解决方案：如何实现超市降本增效

select、poll、epoll 与 Reactor 模式

使用Spring AI和MCP协议构建图片搜索服务

CSS | transition 和 transform的用处和区别

自然语言处理——文本分类

渗透实战PortSwigger Labs指南：自定义标签XSS和SVG XSS利用