大小表 : 大小表尺寸相差 3 倍以上
大表 100GB、小表 10GB,都远超广播变量阈值
以同维度组合统计日志中的访问量 :
//左连接查询语句
select t1.gender, t1.age, t1.city, t1.platform, t1.site, t1.hour,coalesce(t2.payload, t1.payload)
from t1 left join t2 on t1.gender = t2.gender and t1.age = t2.age and t1.city = t2.city and t1.platform = t2.platform and t1.site = t2.site and t1.hour = t2.hour;
背景:两张表的 Schema 完全一致 ,基于 Join Keys 生成 Hash Key:
内表就能进行广播变量时,把 SMJ 转为 BHJ
-- 调整后的左连接查询语句
select t1.gender, t1.age, t1.city, t1.platform, t1.site, t1.hour, coalesce(t2.payload, t1.payload)
from t1 left join t2 on t1.hash_key = t2. hash_key;
HashKey 是 Join Keys 哈希值。哈希运算,就要考虑哈希冲突
统计所有头部用户贡献的营业额,并按营业额倒序排序
-- 查询语句,维表带过滤条件
select (orders.price * order.quantity) as revenue, users.name
from orders join users on orders.userId = users.id
where users.type = 'Head User'
group by users.name
order by revenue desc
维表带过滤条件 ,利用 AQE 在运行时动态地调整 Join ,把 SMJ 转为 BHJ ( spark.sql.adaptive.enabled
打开)
spark.sql.autoBroadcastJoinThreshold
,AQE 才能 SMJ 转为 BHJ用动态 Join , 中途会引发 Shuffle 的数据分发
用 DPP 的条件:
spark.sql.autoBroadcastJoinThreshold
将 orders 做成分区表
//查询语句
select (orders_new.price * orders_new.quantity) as revenue, users.name
from orders_new join users on orders_new.userId = users.id
where users.type = 'Head User'
group by users.name
order by revenue desc
利用 DPP,在做数仓规划时,要结合常用查询与典型场景,提前设计好表结构,如: Schema、分区键、存储格式
当不满足 BHJ 时,会先选 SMJ 。但 Join 的两张表尺寸相差小,数据分布均匀时,SHJ 比 SMJ 更高效
统计所有用户贡献的营业额 :
-- 查询语句
select (orders.price * order.quantity) as revenue,
users.name
from orders join users on orders.userId = users.id
group by users.name
order by revenue desc
用 Join Hints 选择 SHJ
-- Join hints 后的查询语句
select /*+ shuffle_hash(orders) */ (orders.price * order.quantity) as revenue,
from orders join users on orders.userId = users.id
group by users.name
order by revenue desc
SHJ 要成功完成计算、不抛 OOM 异常,需要保证小表的每个数据分片能放进内存