当前位置: 首页 > news >正文

hive中get_json_object函数不支持解析json中文key

问题

今天在 Hive 中 get_json_object 函数解析 json 串的时候,发现函数不支持解析 json 中文 key。
例如:

select get_json_object('{ "姓名":"张三" , "年龄":"18" }', '$.姓名');

我们希望的结果是得到姓名对应的值张三,而运行之后的结果为 NULL 值。

select get_json_object('{ "abc姓名":"张三" , "abc":"18" }', '$.abc姓名');

我们希望的结果是得到姓名对应的值张三,而运行之后的结果为 18

产生问题的原因

是什么原因导致的呢?我们查看 Hive 官网中 get_json_object 函数的介绍,可以发现 get_json_object 函数不能解析 json 里面中文的 key,如下图所示:
在这里插入图片描述
json 路径只能包含字符 [0-9a-z_],即不能包含 大写或特殊字符 。此外,键不能以数字开头。

那为什么 json 路径只能包含字符 [0-9a-z_] 呢?

通过查看源码我们发现 get_json_object 对应的 UDF 类的源码如下:

import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import com.fasterxml.jackson.core.json.JsonReadFeature;
import com.fasterxml.jackson.databind.JavaType;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.Iterators;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;/*** UDFJson.*/
@Description(name = "get_json_object",value = "_FUNC_(json_txt, path) - Extract a json object from path ",extended = "Extract json object from a json string based on json path "+ "specified, and return json string of the extracted json object. It "+ "will return null if the input json string is invalid.\n"+ "A limited version of JSONPath supported:\n"+ "  $   : Root object\n"+ "  .   : Child operator\n"+ "  []  : Subscript operator for array\n"+ "  *   : Wildcard for []\n"+ "Syntax not supported that's worth noticing:\n"+ "  ''  : Zero length string as key\n"+ "  ..  : Recursive descent\n"+ "  @   : Current object/element\n"+ "  ()  : Script expression\n"+ "  ?() : Filter (script) expression.\n"+ "  [,] : Union operator\n"+ "  [start:end:step] : array slice operator\n")//定义了一个名为UDFJson的类,继承自UDF类。
public class UDFGetJsonObjectCN extends UDF {//定义一个静态正则表达式模式,用于匹配JSON路径中的键。//匹配英文key:匹配一个或多个大写字母、小写字母、数字、下划线、连字符、冒号或空格。private static final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s]+).*");//定义一个静态正则表达式模式,用于匹配JSON路径中的索引。private static final Pattern patternIndex = Pattern.compile("\\[([0-9]+|\\*)\\]");//创建一个ObjectMapper对象,用于解析JSON字符串。private static final ObjectMapper objectMapper = new ObjectMapper();//创建一个JavaType对象,用于表示Map类型。private static final JavaType MAP_TYPE = objectMapper.getTypeFactory().constructType(Map.class);//创建一个JavaType对象,用于表示List类型。private static final JavaType LIST_TYPE = objectMapper.getTypeFactory().constructType(List.class);//静态代码块,用于配置ObjectMapper的一些特性。static {// Allows for unescaped ASCII control characters in JSON valuesobjectMapper.enable(JsonReadFeature.ALLOW_UNESCAPED_CONTROL_CHARS.mappedFeature());// Enabled to accept quoting of all character backslash qooting mechanismobjectMapper.enable(JsonReadFeature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER.mappedFeature());}// An LRU cache using a linked hash map//定义了一个静态内部类HashCache,用作LRU缓存。static class HashCache<K, V> extends LinkedHashMap<K, V> {private static final int CACHE_SIZE = 16;private static final int INIT_SIZE = 32;private static final float LOAD_FACTOR = 0.6f;HashCache() {super(INIT_SIZE, LOAD_FACTOR);}private static final long serialVersionUID = 1;@Overrideprotected boolean removeEldestEntry(Map.Entry<K, V> eldest) {return size() > CACHE_SIZE;}}//声明了一个名为extractObjectCache的HashMap对象,用于缓存已提取的JSON对象。Map<String, Object> extractObjectCache = new HashCache<String, Object>();//声明了一个名为pathExprCache的HashMap对象,用于缓存已解析的JSON路径表达式。Map<String, String[]> pathExprCache = new HashCache<String, String[]>();//声明了一个名为indexListCache的HashMap对象,用于缓存已解析的JSON路径中的索引列表。Map<String, ArrayList<String>> indexListCache =new HashCache<String, ArrayList<String>>();//声明了一个名为mKeyGroup1Cache的HashMap对象,用于缓存JSON路径中的键。Map<String, String> mKeyGroup1Cache = new HashCache<String, String>();//声明了一个名为mKeyMatchesCache的HashMap对象,用于缓存JSON路径中的键是否匹配的结果。Map<String, Boolean> mKeyMatchesCache = new HashCache<String, Boolean>();//构造函数,没有参数。public UDFGetJsonObjectCN() {}/*** Extract json object from a json string based on json path specified, and* return json string of the extracted json object. It will return null if the* input json string is invalid.** A limited version of JSONPath supported: $ : Root object . : Child operator* [] : Subscript operator for array * : Wildcard for []** Syntax not supported that's worth noticing: '' : Zero length string as key* .. : Recursive descent &amp;#064; : Current object/element () : Script* expression ?() : Filter (script) expression. [,] : Union operator* [start:end:step] : array slice operator** @param jsonString*          the json string.* @param pathString*          the json path expression.* @return json string or null when an error happens.*///evaluate方法,用于提取指定路径的JSON对象并返回JSON字符串。public Text evaluate(String jsonString, String pathString) {if (jsonString == null || jsonString.isEmpty() || pathString == null|| pathString.isEmpty() || pathString.charAt(0) != '$') {return null;}int pathExprStart = 1;boolean unknownType = pathString.equals("$");boolean isRootArray = false;if (pathString.length() > 1) {if (pathString.charAt(1) == '[') {pathExprStart = 0;isRootArray = true;} else if (pathString.charAt(1) == '.') {isRootArray = pathString.length() > 2 && pathString.charAt(2) == '[';} else {return null;}}// Cache pathExprString[] pathExpr = pathExprCache.get(pathString);if (pathExpr == null) {pathExpr = pathString.split("\\.", -1);pathExprCache.put(pathString, pathExpr);}// Cache extractObjectObject extractObject = extractObjectCache.get(jsonString);if (extractObject == null) {if (unknownType) {try {extractObject = objectMapper.readValue(jsonString, LIST_TYPE);} catch (Exception e) {// Ignore exception}if (extractObject == null) {try {extractObject = objectMapper.readValue(jsonString, MAP_TYPE);} catch (Exception e) {return null;}}} else {JavaType javaType = isRootArray ? LIST_TYPE : MAP_TYPE;try {extractObject = objectMapper.readValue(jsonString, javaType);} catch (Exception e) {return null;}}extractObjectCache.put(jsonString, extractObject);}for (int i = pathExprStart; i < pathExpr.length; i++) {if (extractObject == null) {return null;}extractObject = extract(extractObject, pathExpr[i], i == pathExprStart && isRootArray);}Text result = new Text();if (extractObject instanceof Map || extractObject instanceof List) {try {result.set(objectMapper.writeValueAsString(extractObject));} catch (Exception e) {return null;}} else if (extractObject != null) {result.set(extractObject.toString());} else {return null;}return result;}//extract方法,递归地提取JSON对象。private Object extract(Object json, String path, boolean skipMapProc) {// skip MAP processing for the first path element if root is arrayif (!skipMapProc) {// Cache patternkey.matcher(path).matches()Matcher mKey = null;Boolean mKeyMatches = mKeyMatchesCache.get(path);if (mKeyMatches == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);}if (!mKeyMatches.booleanValue()) {return null;}// Cache mkey.group(1)String mKeyGroup1 = mKeyGroup1Cache.get(path);if (mKeyGroup1 == null) {if (mKey == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);if (!mKeyMatches.booleanValue()) {return null;}}mKeyGroup1 = mKey.group(1);mKeyGroup1Cache.put(path, mKeyGroup1);}json = extract_json_withkey(json, mKeyGroup1);}// Cache indexListArrayList<String> indexList = indexListCache.get(path);if (indexList == null) {Matcher mIndex = patternIndex.matcher(path);indexList = new ArrayList<String>();while (mIndex.find()) {indexList.add(mIndex.group(1));}indexListCache.put(path, indexList);}if (indexList.size() > 0) {json = extract_json_withindex(json, indexList);}return json;}//创建一个名为jsonList的AddingList对象,用于存储提取出来的JSON对象。private transient AddingList jsonList = new AddingList();//定义了一个静态内部类AddingList,继承自ArrayList<Object>,用于添加JSON对象到jsonList中。private static class AddingList extends ArrayList<Object> {private static final long serialVersionUID = 1L;@Overridepublic Iterator<Object> iterator() {return Iterators.forArray(toArray());}@Overridepublic void removeRange(int fromIndex, int toIndex) {super.removeRange(fromIndex, toIndex);}};//extract_json_withindex方法,根据JSON路径中的索引提取JSON对象。@SuppressWarnings("unchecked")private Object extract_json_withindex(Object json, ArrayList<String> indexList) {jsonList.clear();jsonList.add(json);for (String index : indexList) {int targets = jsonList.size();if (index.equalsIgnoreCase("*")) {for (Object array : jsonList) {if (array instanceof List) {for (int j = 0; j < ((List<Object>)array).size(); j++) {jsonList.add(((List<Object>)array).get(j));}}}} else {for (Object array : jsonList) {int indexValue = Integer.parseInt(index);if (!(array instanceof List)) {continue;}List<Object> list = (List<Object>) array;if (indexValue >= list.size()) {continue;}jsonList.add(list.get(indexValue));}}if (jsonList.size() == targets) {return null;}jsonList.removeRange(0, targets);}if (jsonList.isEmpty()) {return null;}return (jsonList.size() > 1) ? new ArrayList<Object>(jsonList) : jsonList.get(0);}//extract_json_withkey方法,根据JSON路径中的键提取JSON对象。@SuppressWarnings("unchecked")private Object extract_json_withkey(Object json, String path) {if (json instanceof List) {List<Object> jsonArray = new ArrayList<Object>();for (int i = 0; i < ((List<Object>) json).size(); i++) {Object json_elem = ((List<Object>) json).get(i);Object json_obj = null;if (json_elem instanceof Map) {json_obj = ((Map<String, Object>) json_elem).get(path);} else {continue;}if (json_obj instanceof List) {for (int j = 0; j < ((List<Object>) json_obj).size(); j++) {jsonArray.add(((List<Object>) json_obj).get(j));}} else if (json_obj != null) {jsonArray.add(json_obj);}}return (jsonArray.size() == 0) ? null : jsonArray;} else if (json instanceof Map) {return ((Map<String, Object>) json).get(path);} else {return null;}}
}

代码做了一些注释,我们可以发现 private final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s]+).*"); 这个就是匹配 key 的模式串,它的意思是匹配以数字、字母、_、-、:、空格 为开头的字符串。那么这个匹配模式串就决定了,get_json_object 函数无法匹配出 key 中带中文的键值对,即select get_json_object('{ "姓名":"张三" , "年龄":"18" }', '$.姓名'); 结果为 null;而 select get_json_object('{ "abc姓名":"张三" , "abc":"18" }', '$.abc姓名'); 中只能匹配以数字、字母、_、-、:、空格 为开头的字符串,所以将 abc姓名 中的 abc 当作 key 去取 value 值,所以得到的值为18

解决办法

知道问题的原因了,那么我们怎么解决这个问题呢,其实很简单,我们只需要修改代码中匹配 key 的正则表达式就可以了。
其实我们可以将 get_json_object 函数的源码拿出来重新写一个 UDF 函数就可以了。

Hive-2.1.1 版本

需要注意自己 Hive 的版本,我们以 Hive-2.1.1 版本为例,代码如下:

package com.yan.hive.udf;import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import com.google.common.collect.Iterators;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import org.codehaus.jackson.JsonFactory;
import org.codehaus.jackson.JsonParser.Feature;
import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.map.type.TypeFactory;
import org.codehaus.jackson.type.JavaType;/*** UDFJson.**/
@Description(name = "get_json_object_cn",value = "_FUNC_(json_txt, path) - Extract a json object from path ",extended = "Extract json object from a json string based on json path "+ "specified, and return json string of the extracted json object. It "+ "will return null if the input json string is invalid.\n"+ "A limited version of JSONPath supported:\n"+ "  $   : Root object\n"+ "  .   : Child operator\n"+ "  []  : Subscript operator for array\n"+ "  *   : Wildcard for []\n"+ "Syntax not supported that's worth noticing:\n"+ "  ''  : Zero length string as key\n"+ "  ..  : Recursive descent\n"+ "  &amp;#064;   : Current object/element\n"+ "  ()  : Script expression\n"+ "  ?() : Filter (script) expression.\n"+ "  [,] : Union operator\n"+ "  [start:end:step] : array slice operator\n")
public class UDFGetJsonObjectCN extends UDF {//private final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s]+).*");private final Pattern patternKey = Pattern.compile("^([^\\[\\]]+).*");private final Pattern patternIndex = Pattern.compile("\\[([0-9]+|\\*)\\]");private static final JsonFactory JSON_FACTORY = new JsonFactory();static {// Allows for unescaped ASCII control characters in JSON valuesJSON_FACTORY.enable(Feature.ALLOW_UNQUOTED_CONTROL_CHARS);// Enabled to accept quoting of all character backslash qooting mechanismJSON_FACTORY.enable(Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER);}private static final ObjectMapper MAPPER = new ObjectMapper(JSON_FACTORY);private static final JavaType MAP_TYPE = TypeFactory.fromClass(Map.class);private static final JavaType LIST_TYPE = TypeFactory.fromClass(List.class);// An LRU cache using a linked hash mapstatic class HashCache<K, V> extends LinkedHashMap<K, V> {private static final int CACHE_SIZE = 16;private static final int INIT_SIZE = 32;private static final float LOAD_FACTOR = 0.6f;HashCache() {super(INIT_SIZE, LOAD_FACTOR);}private static final long serialVersionUID = 1;@Overrideprotected boolean removeEldestEntry(Map.Entry<K, V> eldest) {return size() > CACHE_SIZE;}}static Map<String, Object> extractObjectCache = new HashCache<String, Object>();static Map<String, String[]> pathExprCache = new HashCache<String, String[]>();static Map<String, ArrayList<String>> indexListCache =new HashCache<String, ArrayList<String>>();static Map<String, String> mKeyGroup1Cache = new HashCache<String, String>();static Map<String, Boolean> mKeyMatchesCache = new HashCache<String, Boolean>();Text result = new Text();public UDFGetJsonObjectCN() {}/*** Extract json object from a json string based on json path specified, and* return json string of the extracted json object. It will return null if the* input json string is invalid.** A limited version of JSONPath supported: $ : Root object . : Child operator* [] : Subscript operator for array * : Wildcard for []** Syntax not supported that's worth noticing: '' : Zero length string as key* .. : Recursive descent &amp;#064; : Current object/element () : Script* expression ?() : Filter (script) expression. [,] : Union operator* [start:end:step] : array slice operator** @param jsonString*          the json string.* @param pathString*          the json path expression.* @return json string or null when an error happens.*/public Text evaluate(String jsonString, String pathString) {if (jsonString == null || jsonString.isEmpty() || pathString == null|| pathString.isEmpty() || pathString.charAt(0) != '$') {return null;}int pathExprStart = 1;boolean isRootArray = false;if (pathString.length() > 1) {if (pathString.charAt(1) == '[') {pathExprStart = 0;isRootArray = true;} else if (pathString.charAt(1) == '.') {isRootArray = pathString.length() > 2 && pathString.charAt(2) == '[';} else {return null;}}// Cache pathExprString[] pathExpr = pathExprCache.get(pathString);if (pathExpr == null) {pathExpr = pathString.split("\\.", -1);pathExprCache.put(pathString, pathExpr);}// Cache extractObjectObject extractObject = extractObjectCache.get(jsonString);if (extractObject == null) {JavaType javaType = isRootArray ? LIST_TYPE : MAP_TYPE;try {extractObject = MAPPER.readValue(jsonString, javaType);} catch (Exception e) {return null;}extractObjectCache.put(jsonString, extractObject);}for (int i = pathExprStart; i < pathExpr.length; i++) {if (extractObject == null) {return null;}extractObject = extract(extractObject, pathExpr[i], i == pathExprStart && isRootArray);}if (extractObject instanceof Map || extractObject instanceof List) {try {result.set(MAPPER.writeValueAsString(extractObject));} catch (Exception e) {return null;}} else if (extractObject != null) {result.set(extractObject.toString());} else {return null;}return result;}private Object extract(Object json, String path, boolean skipMapProc) {// skip MAP processing for the first path element if root is arrayif (!skipMapProc) {// Cache patternkey.matcher(path).matches()Matcher mKey = null;Boolean mKeyMatches = mKeyMatchesCache.get(path);if (mKeyMatches == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);}if (!mKeyMatches.booleanValue()) {return null;}// Cache mkey.group(1)String mKeyGroup1 = mKeyGroup1Cache.get(path);if (mKeyGroup1 == null) {if (mKey == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);if (!mKeyMatches.booleanValue()) {return null;}}mKeyGroup1 = mKey.group(1);mKeyGroup1Cache.put(path, mKeyGroup1);}json = extract_json_withkey(json, mKeyGroup1);}// Cache indexListArrayList<String> indexList = indexListCache.get(path);if (indexList == null) {Matcher mIndex = patternIndex.matcher(path);indexList = new ArrayList<String>();while (mIndex.find()) {indexList.add(mIndex.group(1));}indexListCache.put(path, indexList);}if (indexList.size() > 0) {json = extract_json_withindex(json, indexList);}return json;}private transient AddingList jsonList = new AddingList();private static class AddingList extends ArrayList<Object> {@Overridepublic Iterator<Object> iterator() {return Iterators.forArray(toArray());}@Overridepublic void removeRange(int fromIndex, int toIndex) {super.removeRange(fromIndex, toIndex);}};@SuppressWarnings("unchecked")private Object extract_json_withindex(Object json, ArrayList<String> indexList) {jsonList.clear();jsonList.add(json);for (String index : indexList) {int targets = jsonList.size();if (index.equalsIgnoreCase("*")) {for (Object array : jsonList) {if (array instanceof List) {for (int j = 0; j < ((List<Object>)array).size(); j++) {jsonList.add(((List<Object>)array).get(j));}}}} else {for (Object array : jsonList) {int indexValue = Integer.parseInt(index);if (!(array instanceof List)) {continue;}List<Object> list = (List<Object>) array;if (indexValue >= list.size()) {continue;}jsonList.add(list.get(indexValue));}}if (jsonList.size() == targets) {return null;}jsonList.removeRange(0, targets);}if (jsonList.isEmpty()) {return null;}return (jsonList.size() > 1) ? new ArrayList<Object>(jsonList) : jsonList.get(0);}@SuppressWarnings("unchecked")private Object extract_json_withkey(Object json, String path) {if (json instanceof List) {List<Object> jsonArray = new ArrayList<Object>();for (int i = 0; i < ((List<Object>) json).size(); i++) {Object json_elem = ((List<Object>) json).get(i);Object json_obj = null;if (json_elem instanceof Map) {json_obj = ((Map<String, Object>) json_elem).get(path);} else {continue;}if (json_obj instanceof List) {for (int j = 0; j < ((List<Object>) json_obj).size(); j++) {jsonArray.add(((List<Object>) json_obj).get(j));}} else if (json_obj != null) {jsonArray.add(json_obj);}}return (jsonArray.size() == 0) ? null : jsonArray;} else if (json instanceof Map) {return ((Map<String, Object>) json).get(path);} else {return null;}}
}

需要导入的依赖,要和自己集群的版本契合,Hadoop 的版本及 Hive 的版本。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.atguigu.hive</groupId><artifactId>hivetest</artifactId><version>1.0-SNAPSHOT</version><properties><hadoop.version>3.0.0</hadoop.version><hive.version>2.1.1</hive.version><jackson.version>1.9.2</jackson.version><guava.version>14.0.1</guava.version></properties><dependencies><dependency><groupId>org.apache.hive</groupId><artifactId>hive-exec</artifactId><version>${hive.version}</version></dependency><!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc --><dependency><groupId>org.apache.hive</groupId><artifactId>hive-jdbc</artifactId><version>${hive.version}</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>${hadoop.version}</version></dependency><dependency><groupId>com.google.guava</groupId><artifactId>guava</artifactId><version>${guava.version}</version></dependency><dependency><groupId>org.codehaus.jackson</groupId><artifactId>jackson-core-asl</artifactId><version>${jackson.version}</version></dependency><dependency><groupId>org.codehaus.jackson</groupId><artifactId>jackson-mapper-asl</artifactId><version>${jackson.version}</version></dependency><dependency><groupId>org.codehaus.jackson</groupId><artifactId>jackson-jaxrs</artifactId><version>${jackson.version}</version></dependency><dependency><groupId>org.codehaus.jackson</groupId><artifactId>jackson-xc</artifactId><version>${jackson.version}</version></dependency></dependencies></project>

注意: 因为上述UDF中也用到了 com.google.guavaorg.codehaus.jackson,所以这两个依赖要和 hive 版本中所用的依赖版本一致。

Hive-4.0.0 版本

package com.yan.hive.udf;import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import com.fasterxml.jackson.core.json.JsonReadFeature;
import com.fasterxml.jackson.databind.JavaType;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.Iterators;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;/*** @author Yan* @create 2023-08-05 22:21* hive解析json中文key*/
@Description(name = "get_json_object_cn",value = "_FUNC_(json_txt, path) - Extract a json object from path ",extended = "Extract json object from a json string based on json path "+ "specified, and return json string of the extracted json object. It "+ "will return null if the input json string is invalid.\n"+ "A limited version of JSONPath supported:\n"+ "  $   : Root object\n"+ "  .   : Child operator\n"+ "  []  : Subscript operator for array\n"+ "  *   : Wildcard for []\n"+ "Syntax not supported that's worth noticing:\n"+ "  ''  : Zero length string as key\n"+ "  ..  : Recursive descent\n"+ "  &amp;#064;   : Current object/element\n"+ "  ()  : Script expression\n"+ "  ?() : Filter (script) expression.\n"+ "  [,] : Union operator\n"+ "  [start:end:step] : array slice operator\n")//定义了一个名为UDFJson的类,继承自UDF类。
public class UDFGetJsonObjectCN extends UDF {//定义一个静态正则表达式模式,用于匹配JSON路径中的键。//匹配英文key:匹配一个或多个大写字母、小写字母、数字、下划线、连字符、冒号或空格。//private static final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s]+).*");//可以匹配中文,\\p{L}来匹配任意Unicode字母字符,包括中文字符:英文、数字、下划线、连字符、冒号、空格和中文字符。//private static final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s\\p{L}]+).*");//可以匹配中文,\\p{L}来匹配任意Unicode字母字符,包括中文字符,但不包含特殊字符,特殊字符需自己添加//private static final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s?%*+\\p{L}]+).*");//可以匹配中文,包含特殊字符,但不包含英文下的点(.);还有就是匹配不到路径中的索引了//private static final Pattern patternKey = Pattern.compile("^(.+).*");//可以匹配中文,包含特殊字符,不包中括号"[]",但不包含英文下的点(.);这样就可以匹配路径中的索引了private static final Pattern patternKey = Pattern.compile("^([^\\[\\]]+).*");//定义一个静态正则表达式模式,用于匹配JSON路径中的索引。private static final Pattern patternIndex = Pattern.compile("\\[([0-9]+|\\*)\\]");//创建一个ObjectMapper对象,用于解析JSON字符串。private static final ObjectMapper objectMapper = new ObjectMapper();//创建一个JavaType对象,用于表示Map类型。private static final JavaType MAP_TYPE = objectMapper.getTypeFactory().constructType(Map.class);//创建一个JavaType对象,用于表示List类型。private static final JavaType LIST_TYPE = objectMapper.getTypeFactory().constructType(List.class);//静态代码块,用于配置ObjectMapper的一些特性。static {// Allows for unescaped ASCII control characters in JSON valuesobjectMapper.enable(JsonReadFeature.ALLOW_UNESCAPED_CONTROL_CHARS.mappedFeature());// Enabled to accept quoting of all character backslash qooting mechanismobjectMapper.enable(JsonReadFeature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER.mappedFeature());}// An LRU cache using a linked hash map//定义了一个静态内部类HashCache,用作LRU缓存。static class HashCache<K, V> extends LinkedHashMap<K, V> {private static final int CACHE_SIZE = 16;private static final int INIT_SIZE = 32;private static final float LOAD_FACTOR = 0.6f;HashCache() {super(INIT_SIZE, LOAD_FACTOR);}private static final long serialVersionUID = 1;@Overrideprotected boolean removeEldestEntry(Map.Entry<K, V> eldest) {return size() > CACHE_SIZE;}}//声明了一个名为extractObjectCache的HashMap对象,用于缓存已提取的JSON对象。Map<String, Object> extractObjectCache = new HashCache<String, Object>();//声明了一个名为pathExprCache的HashMap对象,用于缓存已解析的JSON路径表达式。Map<String, String[]> pathExprCache = new HashCache<String, String[]>();//声明了一个名为indexListCache的HashMap对象,用于缓存已解析的JSON路径中的索引列表。Map<String, ArrayList<String>> indexListCache =new HashCache<String, ArrayList<String>>();//声明了一个名为mKeyGroup1Cache的HashMap对象,用于缓存JSON路径中的键。Map<String, String> mKeyGroup1Cache = new HashCache<String, String>();//声明了一个名为mKeyMatchesCache的HashMap对象,用于缓存JSON路径中的键是否匹配的结果。Map<String, Boolean> mKeyMatchesCache = new HashCache<String, Boolean>();//构造函数,没有参数。public UDFGetJsonObjectCN() {}/*** Extract json object from a json string based on json path specified, and* return json string of the extracted json object. It will return null if the* input json string is invalid.** A limited version of JSONPath supported: $ : Root object . : Child operator* [] : Subscript operator for array * : Wildcard for []** Syntax not supported that's worth noticing: '' : Zero length string as key* .. : Recursive descent &amp;#064; : Current object/element () : Script* expression ?() : Filter (script) expression. [,] : Union operator* [start:end:step] : array slice operator** @param jsonString*          the json string.* @param pathString*          the json path expression.* @return json string or null when an error happens.*///evaluate方法,用于提取指定路径的JSON对象并返回JSON字符串。public Text evaluate(String jsonString, String pathString) {if (jsonString == null || jsonString.isEmpty() || pathString == null|| pathString.isEmpty() || pathString.charAt(0) != '$') {return null;}int pathExprStart = 1;boolean unknownType = pathString.equals("$");boolean isRootArray = false;if (pathString.length() > 1) {if (pathString.charAt(1) == '[') {pathExprStart = 0;isRootArray = true;} else if (pathString.charAt(1) == '.') {isRootArray = pathString.length() > 2 && pathString.charAt(2) == '[';} else {return null;}}// Cache pathExprString[] pathExpr = pathExprCache.get(pathString);if (pathExpr == null) {pathExpr = pathString.split("\\.", -1);pathExprCache.put(pathString, pathExpr);}// Cache extractObjectObject extractObject = extractObjectCache.get(jsonString);if (extractObject == null) {if (unknownType) {try {extractObject = objectMapper.readValue(jsonString, LIST_TYPE);} catch (Exception e) {// Ignore exception}if (extractObject == null) {try {extractObject = objectMapper.readValue(jsonString, MAP_TYPE);} catch (Exception e) {return null;}}} else {JavaType javaType = isRootArray ? LIST_TYPE : MAP_TYPE;try {extractObject = objectMapper.readValue(jsonString, javaType);} catch (Exception e) {return null;}}extractObjectCache.put(jsonString, extractObject);}for (int i = pathExprStart; i < pathExpr.length; i++) {if (extractObject == null) {return null;}extractObject = extract(extractObject, pathExpr[i], i == pathExprStart && isRootArray);}Text result = new Text();if (extractObject instanceof Map || extractObject instanceof List) {try {result.set(objectMapper.writeValueAsString(extractObject));} catch (Exception e) {return null;}} else if (extractObject != null) {result.set(extractObject.toString());} else {return null;}return result;}//extract方法,递归地提取JSON对象。private Object extract(Object json, String path, boolean skipMapProc) {// skip MAP processing for the first path element if root is arrayif (!skipMapProc) {// Cache patternkey.matcher(path).matches()Matcher mKey = null;Boolean mKeyMatches = mKeyMatchesCache.get(path);if (mKeyMatches == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);}if (!mKeyMatches.booleanValue()) {return null;}// Cache mkey.group(1)String mKeyGroup1 = mKeyGroup1Cache.get(path);if (mKeyGroup1 == null) {if (mKey == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);if (!mKeyMatches.booleanValue()) {return null;}}mKeyGroup1 = mKey.group(1);mKeyGroup1Cache.put(path, mKeyGroup1);}json = extract_json_withkey(json, mKeyGroup1);}// Cache indexListArrayList<String> indexList = indexListCache.get(path);if (indexList == null) {Matcher mIndex = patternIndex.matcher(path);indexList = new ArrayList<String>();while (mIndex.find()) {indexList.add(mIndex.group(1));}indexListCache.put(path, indexList);}if (indexList.size() > 0) {json = extract_json_withindex(json, indexList);}return json;}//创建一个名为jsonList的AddingList对象,用于存储提取出来的JSON对象。private transient AddingList jsonList = new AddingList();//定义了一个静态内部类AddingList,继承自ArrayList<Object>,用于添加JSON对象到jsonList中。private static class AddingList extends ArrayList<Object> {private static final long serialVersionUID = 1L;@Overridepublic Iterator<Object> iterator() {return Iterators.forArray(toArray());}@Overridepublic void removeRange(int fromIndex, int toIndex) {super.removeRange(fromIndex, toIndex);}};//extract_json_withindex方法,根据JSON路径中的索引提取JSON对象。@SuppressWarnings("unchecked")private Object extract_json_withindex(Object json, ArrayList<String> indexList) {jsonList.clear();jsonList.add(json);for (String index : indexList) {int targets = jsonList.size();if (index.equalsIgnoreCase("*")) {for (Object array : jsonList) {if (array instanceof List) {for (int j = 0; j < ((List<Object>)array).size(); j++) {jsonList.add(((List<Object>)array).get(j));}}}} else {for (Object array : jsonList) {int indexValue = Integer.parseInt(index);if (!(array instanceof List)) {continue;}List<Object> list = (List<Object>) array;if (indexValue >= list.size()) {continue;}jsonList.add(list.get(indexValue));}}if (jsonList.size() == targets) {return null;}jsonList.removeRange(0, targets);}if (jsonList.isEmpty()) {return null;}return (jsonList.size() > 1) ? new ArrayList<Object>(jsonList) : jsonList.get(0);}//extract_json_withkey方法,根据JSON路径中的键提取JSON对象。@SuppressWarnings("unchecked")private Object extract_json_withkey(Object json, String path) {if (json instanceof List) {List<Object> jsonArray = new ArrayList<Object>();for (int i = 0; i < ((List<Object>) json).size(); i++) {Object json_elem = ((List<Object>) json).get(i);Object json_obj = null;if (json_elem instanceof Map) {json_obj = ((Map<String, Object>) json_elem).get(path);} else {continue;}if (json_obj instanceof List) {for (int j = 0; j < ((List<Object>) json_obj).size(); j++) {jsonArray.add(((List<Object>) json_obj).get(j));}} else if (json_obj != null) {jsonArray.add(json_obj);}}return (jsonArray.size() == 0) ? null : jsonArray;} else if (json instanceof Map) {return ((Map<String, Object>) json).get(path);} else {return null;}}
}

依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.atguigu.hive</groupId><artifactId>hivetest</artifactId><version>1.0-SNAPSHOT</version><properties><hadoop.version>3.3.1</hadoop.version><hive.version>4.0.0</hive.version><jackson.version>2.13.5</jackson.version><guava.version>22.0</guava.version></properties><dependencies><dependency><groupId>com.fasterxml.jackson</groupId><artifactId>jackson-bom</artifactId><version>${jackson.version}</version><type>pom</type><scope>import</scope></dependency><dependency><groupId>com.google.guava</groupId><artifactId>guava</artifactId><version>${guava.version}</version></dependency><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-core</artifactId><version>${jackson.version}</version></dependency><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-databind</artifactId><version>${jackson.version}</version></dependency><dependency><groupId>org.apache.hive</groupId><artifactId>hive-jdbc</artifactId><version>${hive.version}</version></dependency></dependencies></project>

Hive 版本不同,之间的依赖可能就有些许差距,如果不注意的话可能会报依赖错误。

参考文章:

get_json_object不能解析json里面中文的key

get_json_object源码

impala&hive自定义UDF解析json中文key

相关文章:

hive中get_json_object函数不支持解析json中文key

问题 今天在 Hive 中 get_json_object 函数解析 json 串的时候&#xff0c;发现函数不支持解析 json 中文 key。 例如&#xff1a; select get_json_object({ "姓名":"张三" , "年龄":"18" }, $.姓名);我们希望的结果是得到姓名对应…...

Azure VM上意外禁用NIC如何还原恢复

创建一个windows虚拟机&#xff0c;并远程连接管理员的方式打开powershell 首先查看虚拟网卡&#xff0c;netsh interface show interface 然后禁用虚拟网卡 ,netsh interface set interface Ethernet disable 去Azure虚拟机控制台&#xff0c;打开串行控制台 控制台中键入cmd,…...

神经网络简单理解:机场登机

目录 神经网络简单理解&#xff1a;机场登机 ​编辑 激活函数&#xff1a;转为非线性问题 ​编辑 激活函数ReLU 通过神经元升维&#xff08;神经元数量&#xff09;&#xff1a;提升线性转化能力 通过增加隐藏层&#xff1a;增加非线性转化能力​编辑 模型越大&#xff0c;…...

Sping源码(七)— 后置处理器

简单回顾一下上一篇文章&#xff0c;是在BeanFacroty创建完之后&#xff0c;可以通过Editor和EditorRegistrar实现对类属性的自定义扩展&#xff0c;以及忽略要自动装配的Aware接口。 本篇帖子会顺着refresh()主流程方法接着向下执行。在讲invokeBeanFactoryPostProcessors方法…...

docker导出、导入镜像、提交

导出镜像到本地&#xff0c;然后可以通过压缩包的方式传输。 导出&#xff1a;docker image save 镜像名:版本号 > /home/quxiao/javatest.tgz 导入&#xff1a;docker image load -i /home/quxiao/javatest.tgz 删除镜像就得先删除容器&#xff0c;当你每运行一次镜像&…...

shell的变量

一、什么是变量 二、变量的命名 三、查看变量的值 env显示全局变量&#xff0c;刚刚定义的root_mess是局部变量 四、变量的定义 旧版本&#xff08;7、8四个文件都加载&#xff09;和新版本&#xff08;9只加载两个etc&#xff09;不一样&#xff0c;所以su - 现在要永久生效在…...

CentOS系统环境搭建(十三)——CentOS7安装nvm

centos系统环境搭建专栏&#x1f517;点击跳转 CentOS7.9安装nvm 文章目录 CentOS7.9安装nvm1.安装2.刷新系统环境3.查看所有node4.安装Node.js版本5.查看已安装版本号6.使用指定版本7.设置默认版本8.验证 在我们的日常开发中经常会遇到这种情况&#xff1a;手上有好几个项目&…...

uniapp评论列表插件获取

从评论列表&#xff0c;回复&#xff0c;点赞&#xff0c;删除&#xff0c;留言板 - DCloud 插件市场里导入&#xff0c;并使用。 代码样式优化及接入如下&#xff1a; <template><view class"hb-comment"><!-- 阅读数-start --><view v-if&q…...

3.redis数据结构之List

List-列表类型:L&R 列表类型&#xff1a;有序、可重复 Arraylist和linkedlist的区别 Arraylist是使用数组来存储数据&#xff0c;特点&#xff1a;查询快、增删慢 Linkedlist是使用双向链表存储数据&#xff0c;特点&#xff1a;增删快、查询慢&#xff0c;但是查询链表两端…...

安装使用MySQL8遇到的问题记录

1、root密码 启动运行后 /var/log/mysqld.log 存在默认密码 2023-08-21T15:58:17.469516Z 0 [System] [MY-013169] [Server] /usr/sbin/mysqld (mysqld 8.0.34) initializing of server in progress as process 61233 2023-08-21T15:58:17.478009Z 1 [System] [MY-013576] [I…...

Mysql、Oracle 中锁表问题解决办法

MySQL中锁表问题的解决方法&#xff1a; 1. 确定锁定表的原因&#xff1a; 首先&#xff0c;需要确定是什么原因导致了表的锁定。可能的原因包括长时间的事务、大量的并发查询、表维护操作等。 2. 查看锁定信息&#xff1a; 使用以下命令可以查看当前MySQL数据库中的锁定信…...

AUTOSAR规范与ECU软件开发(实践篇)5.1 ETAS ISOLAR-A工具简介

前言 如前所述, 开发者可以先在系统级设计工具ISOLAR-A中设计软件组件框架, 包括端口接口、 端口等, 即创建各软件组件arxml描述性文件; 再将这些软件组件描述性文件导入到行为建模工具, 如Matlab/Simulink中完成内部行为建模。 亦可以先在行为建模工具中完成逻辑建模, 再…...

shell脚本——expect脚本免交互

目录 一.Here Document 1.1.定义 1.2.多行重定向 二.expect实现免交互 2.1.基础免交互改密码 2.2.expect定义 2.3.expect基本命令 2.4.expect实现免交互ssh主机 一.Here Document 1.1.定义 使用I/O重定向的方式将命令列表提供给交互式程序&#xff0c;是标准输 入的一…...

ubuntu18.04安装远程控制软件ToDest方法,针对官网指令报错情况

有时我们在家办公&#xff0c;需要控制实验室的笔记本&#xff0c;因此好用的远程控制软件会让我们的工作事半功倍&#xff01; 常用的远程控制软件有ToDesk&#xff0c;向日葵&#xff0c;以及TeamViewer&#xff0c;但是为感觉ToDesk更流畅一些&#xff0c;所以这里介绍一下…...

系统架构设计师之缓存技术:Redis持久化的两种方式-RDB和AOF

系统架构设计师之缓存技术&#xff1a;Redis持久化的两种方式-RDB和AOF...

以创新点亮前路,戴尔科技开辟数实融合新格局

编辑&#xff1a;阿冒 设计&#xff1a;沐由 2023年&#xff0c;对于戴尔科技而言是特殊的一年&#xff0c;这是戴尔科技进入中国市场第25个年头——“巧合”的是&#xff0c;这25年也是中国产业经济发展最快&#xff0c;人们工作与生活发生变化最大的四分之一个世纪。 2023年&…...

使用Pandas处理Excel文件

Excel工作表是非常本能和用户友好的&#xff0c;这使得它们非常适合操作大型数据集&#xff0c;即使是技术人员也不例外。如果您正在寻找学习使用Python在Excel文件中操作和自动化内容的地方&#xff0c;请不要再找了。你来对地方了。 在本文中&#xff0c;您将学习如何使用Pan…...

设计模式——接口隔离原则

文章目录 基本介绍应用实例应传统方法的问题和使用接口隔离原则改进 基本介绍 客户端不应该依赖它不需要的接口&#xff0c;即一个类对另一个类的依赖应该建立在最小的接口上先看一张图: 类 A 通过接口 Interface1 依赖类 B&#xff0c;类 C 通过接口 Interface1 依赖类 D&…...

黑客(网络安全)自学

想自学网络安全&#xff08;黑客技术&#xff09;首先你得了解什么是网络安全&#xff01;什么是黑客&#xff01; 网络安全可以基于攻击和防御视角来分类&#xff0c;我们经常听到的 “红队”、“渗透测试” 等就是研究攻击技术&#xff0c;而“蓝队”、“安全运营”、“安全…...

《Go 语言第一课》课程学习笔记(三)

构建模式&#xff1a;Go 是怎么解决包依赖管理问题的&#xff1f; Go 项目的布局标准是什么&#xff1f; 首先&#xff0c;对于以生产可执行程序为目的的 Go 项目&#xff0c;它的典型项目结构分为五部分&#xff1a; 放在项目顶层的 Go Module 相关文件&#xff0c;包括 go.…...

在WinForm里玩转Halcon 3D点云:从C#代码导出到完整UI显示的保姆级避坑指南

在WinForm里玩转Halcon 3D点云&#xff1a;从C#代码导出到完整UI显示的保姆级避坑指南 当工业视觉项目需要处理复杂的三维场景时&#xff0c;Halcon的3D点云处理能力往往成为开发者的首选。但将Halcon的强大算法无缝集成到C# WinForm应用中&#xff0c;却可能遭遇一系列"…...

CSS 容器查询:组件级响应式设计

CSS 容器查询&#xff1a;组件级响应式设计代码如诗&#xff0c;容器如画。让我们用容器查询的强大能力&#xff0c;创建真正自适应的组件。什么是容器查询&#xff1f; 容器查询&#xff08;Container Queries&#xff09;是 CSS 中一项革命性的特性&#xff0c;它允许我们根据…...

新手入门福音:用快马AI生成你的第一个Python版游戏账号管理工具

作为一个刚接触Python编程的新手&#xff0c;最近想尝试开发一个简单的游戏账号管理工具。这个需求其实挺常见的&#xff0c;比如我平时玩多个游戏&#xff0c;账号密码经常记混&#xff0c;如果能有个小工具统一管理就方便多了。在朋友的推荐下&#xff0c;我尝试用InsCode(快…...

pg_textsearch:革新Postgres文本搜索的现代工具

【导语&#xff1a;GitHub上的pg_textsearch是一款适用于Postgres的现代排名文本搜索工具&#xff0c;具备简单语法、可配置参数等特性&#xff0c;目前已达v1.0.0版本可用于生产环境&#xff0c;对Postgres文本搜索领域带来新变革。】pg_textsearch&#xff1a;Postgres文本搜…...

DevOps实践:如何让开发、测试、运维不再“打架”?

质量不再是孤岛在追求快速迭代的现代软件开发中&#xff0c;开发、测试与运维团队之间的隔阂与摩擦&#xff0c;常常被戏称为“部门战争”。开发团队渴望快速交付新功能&#xff0c;测试团队需要足够的时间来保障质量&#xff0c;而运维团队则首要追求系统的稳定与可靠。当发布…...

4大场景:如何用ReplaceItems脚本实现Illustrator批量设计元素智能替换

4大场景&#xff1a;如何用ReplaceItems脚本实现Illustrator批量设计元素智能替换 【免费下载链接】illustrator-scripts Adobe Illustrator scripts 项目地址: https://gitcode.com/gh_mirrors/il/illustrator-scripts 在UI设计和品牌视觉开发过程中&#xff0c;设计师…...

m4s-converter:释放B站缓存价值的格式转换利器

m4s-converter&#xff1a;释放B站缓存价值的格式转换利器 【免费下载链接】m4s-converter 一个跨平台小工具&#xff0c;将bilibili缓存的m4s格式音视频文件合并成mp4 项目地址: https://gitcode.com/gh_mirrors/m4/m4s-converter 价值对比&#xff1a;格式转换前后的效…...

在QT中将多个项目(同代码不同ui和资源文件)合并

Linux下的qt环境 我现在有三个项目&#xff0c;代码一模一样&#xff0c;只有UI文件和资源文件不同现在想要合并代码 后期好上传在git 仅需要一个分支 更好管理将随行 康养 采图三个项目代码合并 思路是这样的 将每个项目都分类打包区分开我是在康养这个项目的基础上合…...

对 OS:TEP 的 MLFQ 策略的一点思考

1.SJF 调度算法SJF 没啥好说的, 书上讲的很清楚了, SJF 就是最短任务优先原则, 其设计初衷是想解决 FIFO 的糟糕的周转时间的问题.但是, 正如书上所说, 这玩意主打一个秩序井然, 只能处理所有任务同时到队列的情况, 要是某堆进程不按这套路出牌, 那 SJF 立马完蛋, 书上就有一个…...

VS2022 + WinForms:从拖控件到写逻辑,手把手带你做出第一个C#计算器

VS2022 WinForms&#xff1a;从拖控件到写逻辑&#xff0c;手把手带你做出第一个C#计算器 第一次打开Visual Studio 2022时&#xff0c;那个闪亮的启动界面可能会让你既兴奋又不知所措。作为微软最新的集成开发环境&#xff0c;VS2022为C#开发者提供了强大的工具链&#xff0…...