当前位置: 首页 > news >正文

clickhouse MPPDB数据库--新特性使用示例

clickhouse 新特性:

从clickhouse 22.3至最新的版本24.3.2.23,clickhouse在快速发展中,每个版本都增加了一些新的特性,在数据写入、查询方面都有性能加速。
本文根据clickhouse blog中的clickhouse release blog中,学习并梳理了一些在实际工作中可能用到的新特性。

以下是如何基于docker,如果试用这些新性

docker run -d --name=ch -p 8123:8123 -p 9000:9000 -p 9009:9009 --ulimit nofile=262144:262144 -v D:/ch/latest/external:/external:rw -v  chlatest:/var/lib/clickhouse:rw -v D:/ch/latest/logs:/var/log/clickhouse-server:rw -v D:/ch/latest/etc/clickhouse-server:/etc/clickhouse-server:rw clickhouse/clickhouse-server:24.3.2.23docker exec -it bashclickhouse-client --format_csv_delimiter=','

transform函数

进行字典替换

transform(x, array_from, array_to, default)
transform(T, Array(T), Array(U), U) -> U
transform(x, array_from, array_to)

UK-house-price-dataset.csv

CREATE TABLE uk_price_paid
(price UInt32,date Date,postcode1 LowCardinality(String),postcode2 LowCardinality(String),type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),is_new UInt8,duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),addr1 String,addr2 String,street LowCardinality(String),locality LowCardinality(String),town LowCardinality(String),district LowCardinality(String),county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2);INSERT INTO uk_price_paid
WITHsplitByChar(' ', postcode) AS p
SELECTtoUInt32(price_string) AS price,parseDateTimeBestEffortUS(time) AS date,p[1] AS postcode1,p[2] AS postcode2,transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,b = 'Y' AS is_new,transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
);SELECT transform(number, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'], NULL) AS numbers
FROM system.numbers
LIMIT 10

读取文件

可以自动识别文件的类型,推荐字段类型

SELECT * FROM (
WITHsplitByChar(' ', postcode) AS p
SELECTtoUInt32(price_string) AS price,parseDateTimeBestEffortUS(time) AS date,p[1] AS postcode1,p[2] AS postcode2,transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,b = 'Y' AS is_new,transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
) SETTINGS format_csv_delimiter=','
) LIMIT 2;

自定义函数

根据需要,编写自定义函数

CREATE OR REPLACE TABLE line_changes
(version UInt32,line_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3),line_number UInt32,line_content String,time datetime default now()
)
ENGINE = MergeTree
ORDER BY time;INSERT INTO default.line_changes (version,line_change_type,line_number,line_content) VALUES
(1, 'Add'   , 1, 'ClickHouse provides SQL'),
(2, 'Add'   , 2, 'with improvements'),
(3, 'Add'   , 3, 'that makes it more friendly for analytical tasks.'),
(4, 'Add'   , 2, 'with many extensions'),
(5, 'Modify', 3, 'and powerful improvements'),
(6, 'Delete', 1, ''),
(7, 'Add'   , 1, 'ClickHouse provides a superset of SQL');-- add a string (str) into an array (arr) at a specific position (pos)
CREATE OR REPLACE FUNCTION add AS (arr, pos, str) -> arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos));-- delete the element at a specific position (pos) from an array (arr)
CREATE OR REPLACE FUNCTION delete AS (arr, pos) -> arrayConcat(arraySlice(arr, 1, pos-1), arraySlice(arr, pos+1));-- replace the element at a specific position (pos) in an array (arr)
CREATE OR REPLACE FUNCTION modify AS (arr, pos, str) -> arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos+1));

arrayFold

SELECT arrayFold((acc, v) -> (acc + v), [10, 20, 30],  0::UInt64) AS sum;CREATE OR REPLACE VIEW text_version AS
WITH T1 AS (SELECT arrayZip(groupArray(line_change_type),groupArray(line_number),groupArray(line_content)) as line_opsFROM (SELECT * FROM line_changes WHERE version <= {version:UInt32} ORDER BY version ASC)
)
SELECT arrayJoin(arrayFold((acc, v) -> if(v.'change_type' = 'Add',       add(acc, v.'line_nr', v.'content'),if(v.'change_type' = 'Delete', delete(acc, v.'line_nr'),if(v.'change_type' = 'Modify', modify(acc, v.'line_nr', v.'content'), []))),line_ops::Array(Tuple(change_type String, line_nr UInt32, content String)),[]::Array(String))) as lines
FROM T1;SELECT * FROM text_version(version = 3);

Parallel window functions

窗口函数采用并行计算,性能大幅提升

SELECTcountry,day,max(tempAvg) AS temperature,avg(temperature) OVER (PARTITION BY country ORDER BY day ASC ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS moving_avg_temp
FROM noaa
WHERE country != ''
GROUP BYcountry,date AS day
ORDER BYcountry ASC,day ASC

FINAL

基于FINAL及enable_vertical_final,在如下引擎
ReplacingMergeTree、 AggregatingMergeTree引擎中,可以快速查询到最新的数据

SELECTpostcode1,formatReadableQuantity(avg(price))
FROM uk_property_offers FINAL
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3;SELECTpostcode1,formatReadableQuantity(avg(price))
FROM uk_property_offers
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3
SETTINGS enable_vertical_final = 1;

Variant Type

SET allow_experimental_variant_type=1, use_variant_as_common_type = 1;SELECTmap('Hello', 1, 'World', 'Mark') AS x,toTypeName(x) AS type
FORMAT Vertical;SELECTarrayJoin([1, true, 3.4, 'Mark']) AS value,toTypeName(value)
Row 1:
──────
x:    {'Hello':1,'World':'Mark'}
type: Map(String, Variant(String, UInt8))┌─value─┬─toTypeName(value)─────────────────────┐
1. │ true  │ Variant(Bool, Float64, String, UInt8) │
2. │ true  │ Variant(Bool, Float64, String, UInt8) │
3. │ 3.4   │ Variant(Bool, Float64, String, UInt8) │
4. │ Mark  │ Variant(Bool, Float64, String, UInt8) │└───────┴───────────────────────────────────────┘

字符相似性函数

  • byteHammingDistance: the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

    • karolin” and “kathrin” is 3.
    • karolin” and “kerstin” is 3.
    • kathrin” and “kerstin” is 4.
    • 0000 and 1111 is 4.
    • 2173896 and 2233796 is 3.
  • editDistance:a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.

  • damerauLevenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

  • jaroWinklerSimilarity: a string metric measuring an edit distance between two sequences. It is a variant of the Jaro distance metric

  • levenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

https://clickhouse.com/docs/en/sql-reference/functions/string-functions#dameraulevenshteindistance

CREATE TABLE domains
(`domain` String,`rank` Float64
)
ENGINE = MergeTree
ORDER BY domain;INSERT INTO domains SELECTc2 AS domain,1 / c1 AS rank
FROM url('domains.csv', 'CSV');SELECTdomain,levenshteinDistance(domain, 'facebook.com') AS d1,damerauLevenshteinDistance(domain, 'facebook.com') AS d2,jaroSimilarity(domain, 'facebook.com') AS d3,jaroWinklerSimilarity(domain, 'facebook.com') AS d4
FROM domains
ORDER BY d1 ASC
LIMIT 10 
Query id: 6f499f27-8274-4787-819a-b510322bdce3┌─domain────────┬─d1─┬─d2─┬─────────────────d3─┬─────────────────d4─┐1. │ facebook.com  │  0 │  0 │                  1 │                  1 │2. │ facebonk.com  │  1 │  1 │ 0.8838383838383838 │ 0.9303030303030303 │3. │ fabebook.com  │  1 │  1 │  0.914141414141414 │ 0.9313131313131312 │4. │ facabook.com  │  1 │  1 │ 0.9444444444444443 │  0.961111111111111 │5. │ facobook.com  │  1 │  1 │ 0.8535353535353535 │ 0.8974747474747474 │6. │ facebook1.com │  1 │  1 │ 0.9743589743589745 │ 0.9846153846153847 │7. │ faceook.com   │  1 │  1 │ 0.9722222222222221 │ 0.9833333333333333 │8. │ faacebook.com │  1 │  1 │ 0.9743589743589745 │ 0.9794871794871796 │9. │ faceboock.com │  1 │  1 │ 0.9326923076923077 │ 0.9596153846153846 │
10. │ facebool.com  │  1 │  1 │ 0.9444444444444443 │ 0.9666666666666666 │└───────────────┴────┴────┴────────────────────┴────────────────────┘

Vectorized distance functions

可以作为向量数据库使用,支持L2,cosineDistance,IP三种向量相似度的度量方法

https://clickhouse.com/blog/clickhouse-release-24-02

WITH 'dog' AS search_term,
(SELECT vectorFROM gloveWHERE word = search_termLIMIT 1
) AS target_vector
SELECT word, cosineDistance(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;WITH'dog' AS search_term,(SELECT vectorFROM gloveWHERE word = search_termLIMIT 1) AS target_vector
SELECTword,1 - dotProduct(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;

Adaptive asynchronous inserts

Asynchronous inserts shift data batching from the client side to the server side: data from insert queries is inserted into a buffer first and then written to the database storage later or asynchronously respectively.
在这里插入图片描述

相关文章:

clickhouse MPPDB数据库--新特性使用示例

clickhouse 新特性&#xff1a; 从clickhouse 22.3至最新的版本24.3.2.23&#xff0c;clickhouse在快速发展中&#xff0c;每个版本都增加了一些新的特性&#xff0c;在数据写入、查询方面都有性能加速。 本文根据clickhouse blog中的clickhouse release blog中&#xff0c;学…...

MATLAB多级分组绘图及图例等细节处理 ; MATLAB画图横轴时间纵轴数值按照不同sensorCode分组画不同sensorCode的曲线

平时研究需要大量的绘图Excel有时候又臃肿且麻烦 尤其是当处理大量数据时可能会拖死Windows 示例代码及数据量展示 因为数据量是万级别的折线图也变成"柱状图"了, 不过还能看出大致趋势! 横轴是时间纵轴是传感器数值图例是传感器所在深度 % data readtable(C:\U…...

20240405,数据类型,运算符,程序流程结构

是我深夜爆炸&#xff0c;不能再去补救C了&#xff0c;真的来不及了&#xff0c;不能再三天打鱼两天晒网了&#xff0c;真的来不及了呜呜呜呜 我实在是不知道看什么课&#xff0c;那黑马吧……MOOC的北邮的C正在进行呜呜 #include <iostream> using namespace std; int…...

Prometheus+grafana环境搭建Nginx(docker+二进制两种方式安装)(六)

由于所有组件写一篇幅过长&#xff0c;所以每个组件分一篇方便查看&#xff0c;前五篇链接如下 Prometheusgrafana环境搭建方法及流程两种方式(docker和源码包)(一)-CSDN博客 Prometheusgrafana环境搭建rabbitmq(docker二进制两种方式安装)(二)-CSDN博客 Prometheusgrafana环…...

贝叶斯逻辑回归

贝叶斯逻辑回归&#xff08;Bayesian Logistic Regression&#xff09;是一种机器学习算法&#xff0c;用于解决分类问题。它基于贝叶斯定理&#xff0c;通过建立一个逻辑回归模型&#xff0c;结合先验概率和后验概率&#xff0c;对数据进行分类。 贝叶斯逻辑回归的基本原理是…...

Win10 下 Vision Mamba(Vim-main)的环境配置(libcuda.so文件无法找到,windows系统运行失败)

目录 1、下载NVIDIA 驱动程序、cuda11.8、cudnn8.6.0 2、在Anaconda中创建环境并激活 3、下载gpu版本的torch 4、配置环境所需要的包 5、安装causal_conv1d和mamba-1p1p1 安装causal_conv1d 安装mamba-1p1p1 6、运行main.py失败 请直接拉到最后查看运行失败的原因&am…...

4 万字全面掌握数据库、数据仓库、数据集市、数据湖、数据中台

如今&#xff0c;随着诸如互联网以及物联网等技术的不断发展&#xff0c;越来越多的数据被生产出来-据统计&#xff0c;每天大约有超过2.5亿亿字节的各种各样数据产生。这些数据需要被存储起来并且能够被方便的分析和利用。 随着大数据技术的不断更新和迭代&#xff0c;数据管…...

Leetcode 64. 最小路径和

心路历程&#xff1a; 第一反应像是一个回溯问题&#xff0c;但是看到题目中要求最值&#xff0c;大概率是一道DP问题。并且这里面的递推关系也很明显。 这里面边界条件可以有多种处理方法。 解法&#xff1a;动态规划 class Solution:def minPathSum(self, grid: List[List…...

FANUC机器人故障诊断—报警代码更新(三)

FANUC机器人故障诊断中&#xff0c;有些报警代码&#xff0c;继续更新如下。 一、报警代码&#xff08;SRVO-348&#xff09; SRVO-348DCS MCC关闭报警a&#xff0c;b [原因]向电磁接触器发出了关闭指令&#xff0c;而电磁接触器尚未关闭。 [对策] 1.当急停单元上连接了CRMA…...

mysql 本地电脑服务部署

前提&#xff1a; 下载mysql 新建配置文档 在安装mysql目录新建 my.ini [mysqld] # 设置3306端口 port3306#设置mysql的安装目录 basedirC:\Program Files\MySQL\MySQL Server 8.3 #切记此处一定要用双斜杠\\,单斜杠我这里会出错&#xff0c;不过看别人的教程&#xff0c;有…...

爬虫学习第一天

爬虫-1 爬虫学习第一天1、什么是爬虫2、爬虫的工作原理3、爬虫核心4、爬虫的合法性5、爬虫框架6、爬虫的挑战7、难点8、反爬手段8.1、Robots协议8.2、检查 User-Agent8.3、ip限制8.4、SESSION访问限制8.5、验证码8.6、数据动态加载8.7、数据加密-使用加密算法 9、用python学习爬…...

labview如何创建2D多曲线XY图和3D图

1如何使用labview创建2D多曲线图 使用“索引与捆绑簇数组”函数将多个一维数组捆绑成一个簇的数组&#xff0c;然后将结果赋值给XY图&#xff0c;这样一个多曲线XY图就生成了。也可以自己去手动索引&#xff0c;手动捆绑并生成数组&#xff0c;结果是一样的 2.如何创建3D图 在…...

【华为OD机试】芯片资源限制(贪心算法—JavaPythonC++JS实现)

本文收录于专栏:算法之翼 本专栏所有题目均包含优质解题思路,高质量解题代码(Java&Python&C++&JS分别实现),详细代码讲解,助你深入学习,深度掌握! 文章目录 一. 题目-芯片资源限制二.解题思路三.题解代码Python题解代码JAVA题解代码C/C++题解代码JS题解代码四…...

服务器硬件构成与性能要点:CPU、内存、硬盘、RAID、网络接口卡等关键组件的基础知识总结

文章目录 服务器硬件基础知识CPU&#xff08;中央处理器&#xff09;内存&#xff08;RAM&#xff09;硬盘RAID&#xff08;磁盘阵列&#xff09;网络接口卡&#xff08;NIC&#xff09;电源散热器主板显卡光驱 服务器硬件基础知识 服务器是一种高性能计算机&#xff0c;用于在…...

STC89C51学习笔记(四)

STC89C51学习笔记&#xff08;四&#xff09; 综述&#xff1a;本文讲述了在STC89C51中数码管、模块化编程、LCD1602的使用。 一、数码管 1.数码管显示原理 位选&#xff1a;对74HC138芯片的输入端的配置&#xff08;P22、P23、P24&#xff09;&#xff0c;来选择实现位选&…...

Arcgis Pro地理配准

目录 一、目的 二、配准 1、找到配准工具 2、添加控制点 3、选择控制点 4、添加更多控制点 5、配准完成、保存 三、附录 1、查看控制点或删除控制点 2、效果不好怎么办 一、目的 下面我们将两张地图进行配准&#xff0c;其中一张有地理位置&#xff0c;而另外一张没…...

数字转型新动力,开源创新赋能数字经济高质量发展

应开放原子开源基金会的邀请&#xff0c;软通动力董事、鸿湖万联董事长黄颖基于对软通动力开源战略的思考&#xff0c;为本次专题撰文——数字转型新动力&#xff0c;开源创新赋能数字经济高质量发展。本文首发于2023年12月12日《中国电子报》“开源发展与开发者”专题第8版。以…...

解决JavaWeb中IDEA2023新版本无法创建Servlet的问题

出现问题&#xff1a;IDEA右键创建Servlet时&#xff0c;找不到选项 原因分析&#xff1a;IDEA的2023版的已经不支持Servlet了&#xff0c;如果还要使用的话&#xff0c;需要自己创建模板使用 创建模板 右击设置&#xff0c;选择&#xff08;File and Code Templates&#x…...

关于oracle切换mysql8总结

最近由于项目换库&#xff0c;特此记录 1.字段类型 number(8) -> int(8) number(16) -> bigint(16) varchar2() -> varchar() 2.导数据 从oracle迁移数据到mysql&#xff0c;除了用专门的数据泵&#xff0c;经常需要用csv导入到mysql&#xff1b; 导出的csv数据如果…...

Docker 容器编排技术解析与实践

探索了容器编排技术的核心概念、工具和高级应用&#xff0c;包括 Docker Compose、Kubernetes 等主要平台及其高级功能如网络和存储管理、监控、安全等。此外&#xff0c;文章还探讨了这些技术在实际应用中的案例&#xff0c;提供了对未来趋势的洞见。 一、容器编排介绍 容器编…...

SpringBoot-17-MyBatis动态SQL标签之常用标签

文章目录 1 代码1.1 实体User.java1.2 接口UserMapper.java1.3 映射UserMapper.xml1.3.1 标签if1.3.2 标签if和where1.3.3 标签choose和when和otherwise1.4 UserController.java2 常用动态SQL标签2.1 标签set2.1.1 UserMapper.java2.1.2 UserMapper.xml2.1.3 UserController.ja…...

简易版抽奖活动的设计技术方案

1.前言 本技术方案旨在设计一套完整且可靠的抽奖活动逻辑,确保抽奖活动能够公平、公正、公开地进行,同时满足高并发访问、数据安全存储与高效处理等需求,为用户提供流畅的抽奖体验,助力业务顺利开展。本方案将涵盖抽奖活动的整体架构设计、核心流程逻辑、关键功能实现以及…...

React hook之useRef

React useRef 详解 useRef 是 React 提供的一个 Hook&#xff0c;用于在函数组件中创建可变的引用对象。它在 React 开发中有多种重要用途&#xff0c;下面我将全面详细地介绍它的特性和用法。 基本概念 1. 创建 ref const refContainer useRef(initialValue);initialValu…...

MongoDB学习和应用(高效的非关系型数据库)

一丶 MongoDB简介 对于社交类软件的功能&#xff0c;我们需要对它的功能特点进行分析&#xff1a; 数据量会随着用户数增大而增大读多写少价值较低非好友看不到其动态信息地理位置的查询… 针对以上特点进行分析各大存储工具&#xff1a; mysql&#xff1a;关系型数据库&am…...

服务器硬防的应用场景都有哪些?

服务器硬防是指一种通过硬件设备层面的安全措施来防御服务器系统受到网络攻击的方式&#xff0c;避免服务器受到各种恶意攻击和网络威胁&#xff0c;那么&#xff0c;服务器硬防通常都会应用在哪些场景当中呢&#xff1f; 硬防服务器中一般会配备入侵检测系统和预防系统&#x…...

MVC 数据库

MVC 数据库 引言 在软件开发领域,Model-View-Controller(MVC)是一种流行的软件架构模式,它将应用程序分为三个核心组件:模型(Model)、视图(View)和控制器(Controller)。这种模式有助于提高代码的可维护性和可扩展性。本文将深入探讨MVC架构与数据库之间的关系,以…...

Python实现prophet 理论及参数优化

文章目录 Prophet理论及模型参数介绍Python代码完整实现prophet 添加外部数据进行模型优化 之前初步学习prophet的时候&#xff0c;写过一篇简单实现&#xff0c;后期随着对该模型的深入研究&#xff0c;本次记录涉及到prophet 的公式以及参数调优&#xff0c;从公式可以更直观…...

2025 后端自学UNIAPP【项目实战:旅游项目】6、我的收藏页面

代码框架视图 1、先添加一个获取收藏景点的列表请求 【在文件my_api.js文件中添加】 // 引入公共的请求封装 import http from ./my_http.js// 登录接口&#xff08;适配服务端返回 Token&#xff09; export const login async (code, avatar) > {const res await http…...

Linux 内存管理实战精讲:核心原理与面试常考点全解析

Linux 内存管理实战精讲&#xff1a;核心原理与面试常考点全解析 Linux 内核内存管理是系统设计中最复杂但也最核心的模块之一。它不仅支撑着虚拟内存机制、物理内存分配、进程隔离与资源复用&#xff0c;还直接决定系统运行的性能与稳定性。无论你是嵌入式开发者、内核调试工…...

Qemu arm操作系统开发环境

使用qemu虚拟arm硬件比较合适。 步骤如下&#xff1a; 安装qemu apt install qemu-system安装aarch64-none-elf-gcc 需要手动下载&#xff0c;下载地址&#xff1a;https://developer.arm.com/-/media/Files/downloads/gnu/13.2.rel1/binrel/arm-gnu-toolchain-13.2.rel1-x…...