# DataLineageResearch **Repository Path**: li-zhangzheng/data-lineage-research ## Basic Information - **Project Name**: DataLineageResearch - **Description**: 数据平台建设过程中需要对数据血缘进行解析,通过对血缘数据的探索,可以快速获取数据,加快数据开发的效率。 对报表SQL进行解析获取语法树,从而进一步获取列级的血缘,生成报表页面与数据源的关系。 - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 18 - **Created**: 2022-06-22 - **Last Updated**: 2022-06-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # 数据血缘相似度探索 v1.0 数据平台建设过程中需要对数据血缘进行解析,通过对血缘数据的探索,可以快速获取数据,加快数据开发的效率。 对报表SQL进行解析获取语法树,从而进一步获取列级的血缘,生成报表页面与数据源的关系。 ![图1 报表血缘](pictures/1.png) 单张报表的血缘关系可以按照列级字段拆分成相应的血缘信息: ![图2 血缘关系](pictures/2.png) 基于血缘信息可以计算不同报表之间血缘的相似程度,从而避免过多报表的重复开发与数据冗余。 ![图3 血缘相似度](pictures/3.png) ## 如何获取SQL解析树与血缘信息 可以通过语法树的解析得到列级的血缘。那么怎么得到语法树呢? - [alibaba-druid](https://github.com/alibaba/druid/wiki/Druid_SQL_AST) - [antlr4](https://github.com/antlr/antlr4) 可以通过以上两种工具的使用,得到语法树,那么剩下的问题就是如何将语法树解析为血缘信息。 本版本使用一个Java POJO对象 LineageColumn 对报表中的列级血缘进行存储: name | desc ---|--- targetColumnName | 目标字段,即SELECT的列 sourceDbName | 字段来源DB sourceTableName | 字段来源表 sourceColumnName | 字段来源列 expression | 表达式 isEnd | 是否结束标识 通过对语法树的头结点进行递归获取子树,最终获得树的叶子节点即为最终的血缘信息,以SQL为例 ``` select user_id as uid ,user_name as uname from ( select user_id, concat("test",user_name) as user_name from user )t ``` 需要经过两层递归,过程为 ![图4 血缘关系](pictures/4.png) 最终血缘信息为 ``` --- 输出列uid 数据来源于user表的user_id 列 uid from:{"expression":"user_id","isEnd":true,"sourceTableName":"user","targetColumnName":"user_id"} -- 输出列uname 数据来源于常量字段test及 user表的user_name 列 uname from:{"expression":"concat('test', user_name)","isEnd":true,"sourceTableName":"user","targetColumnName":"user_name"} uname from:{"expression":"concat('test', user_name)","isEnd":true,"sourceTableName":"user","targetColumnName":"'test'"} ``` ## 如何计算血缘信息相似度 version 1.0 对于报表A与报表B,分别解析两张报表SQL可以获取到血缘信息A与血缘信息B。 根据血缘信息A与血缘信息B的LineageColumn对象可以计算出血缘信息之间的相似度: 血缘信息相似度初步定义如下: ![图5 相似度](pictures/7.png) ![图6 相似度](pictures/6.png) ## 相似度测试结果 version 1.0 基于已爬取的76张可以被正确解析的SQL脚本, 分别生成脚本对应的血缘数据, 并对血缘数据之间两两计算相似度,可以得到2850个无重复两两组合,计算出相似度分布如下图: ![图7 相似度](pictures/myplot.png) x轴为相似度范围,纵轴为2850个组合中在该相似度范围内的组合数。 #### 以 相似度为 92.86% 的两张报表血缘为例: ``` -- 报表A select date_format(imp_date,'%Y-%m-%d') as imp_date, case extra when 'icon' then '主动' when 'push' then 'push推送' when 'weixin' then '微信' when 'mobileQQpush' then '手Q' else 'other' end as extra,sum(value2) as uv from t_1 where imp_date>='startDate' and imp_date<='endDate' and indicator in ('indicator') and subin=0 and extra in ('icon','push','weixin','mobileQQpush') group by imp_date,case extra when 'icon' then '主动' when 'push' then 'push推送' when 'weixin' then '微信' when 'mobileQQpush' then '手Q' else 'other' end order by imp_date ``` ``` -- 报表B select imp_date, case extra when 'icon' then '主动' when 'push' then 'push推送' when 'weixin' then '微信' when 'mobileQQpush' then '手Q' end as extra,sum(value1) as pv from t_1 where imp_date>='startDate' and imp_date<='endDate' and indicator in ('indicator') and subin=0 and extra<>'other' group by imp_date,extra order by imp_date ``` 相似度构成如下: 字段数量相似度:1.0 字段来源DB相似度:1.0 字段来源表相似度:1.0 字段来源列相似度:0.71 由于表名中默认不含有库名信息,所以默认在同一个库中,即字段来源DB相似度为1,可以看出 报表A和报表B之间的差别在于uv字段的字段来源列不同,而其他的大部分血缘信息包括表名库名等信息都是完全一致的。 *** #### 以 相似度为 50.83% 的两张报表血缘为例: ``` -- 报表A select date_format(imp_date,'%Y-%m-%d') as imp_date, case extra when 'icon' then '主动' when 'push' then 'push推送' when 'weixin' then '微信' when 'mobileQQpush' then '手Q' else 'other' end as extra,sum(value2) as uv from t_1 where imp_date>='startDate' and imp_date<='endDate' and indicator in ('indicator') and subin=0 and extra in ('icon','push','weixin','mobileQQpush') group by imp_date,case extra when 'icon' then '主动' when 'push' then 'push推送' when 'weixin' then '微信' when 'mobileQQpush' then '手Q' else 'other' end order by imp_date ``` ``` -- 报表B select imp_date, case when c1 = 1 and c2 = 1 and c3 = 1 then '顶部下拉刷新' when c1 = 1 and c2 = 1 and c3 = 2 then '频道tab点击刷新' when c1 = 1 and c2 = 1 and c3 = 3 then '中部主动刷新p' when c1 = 1 and c2 = 1 and c3 = 4 then '底部tab栏点击刷新' when c1 = 1 and c2 = 1 and c3 = 9999 then '顶部主动刷新' when c1 = 1 and c2 = 2 and c3 = 9999 then '底部主动刷新' when c1 = 1 and c2 = 9999 and c3 = 9999 then '主动刷新' when c1 = 2 and c2 = 9999 and c3 = 9999 then '自动刷新' when c1 = 9999 and c2 = 9999 and c3 = 9999 then '汇总' end mode , sum(uv) uv from t_2 where imp_date>='startDate' and imp_date<='endDate' and ssitename='9999' and ostype = 'ostype' and (ista is null or ista = '-') group by imp_date,c1,c2,c3 ``` 相似度构成如下: 字段数量相似度:1.0 字段来源DB相似度:1.0 字段来源表相似度:0.0 字段来源列相似度:0.1 由于两张报表所使用的数据源表完全不同,因此字段来源表相似度为0,但是两张报表的字段来源列中存在相同的字段命名 imp_date ,因此存在一定的字段来源列相似度。 *** #### 以 相似度为 4.79% 的两张报表血缘为例: ``` select imp_date, case devtype when 1 then 'all' when 2 then 'android' when 3 then 'ipad' when 4 then 'iphone' else devtype end as devtype, max(case indicator when 105 then value end ) value_1, max(case indicator when 105 then concat(rate1,'%') end ) rate1_1, max(case indicator when 105 then concat(rate2,'%') end ) rate2_1, max(case indicator when 106 then concat(value,'%') end ) value_2, max(case indicator when 106 then concat(rate1,'%') end ) rate1_2, max(case indicator when 106 then concat(rate2,'%') end ) rate2_2 from ( select imp_date,indicator,devtype,value,rate1,rate2 from t_1 where imp_date>='startDate' and imp_date<='endDate' and indicator in ('indicator') and subin=0 and devtype in (devtype) ) t group by imp_date,devtype order by imp_date desc,devtype ``` ``` select a.imp_date as imp_date, dau, pv, uv, avg_pv, video_pv, video_uv,avg_video_pv,pic_pv, pic_uv,avg_pic_pv, play_vv, play_uv,avg_play_vv,play_video_vv, play_video_uv,avg_play_video_vv,play_pic_vv, play_pic_uv,avg_play_pic_vv, video_click_vv, video_click_uv from( select distinct imp_date, pv, uv, ROUND((pv) / (uv) , 2) AS avg_pv, video_pv, video_uv, ROUND((video_pv) / (video_uv) , 2) AS avg_video_pv, pic_pv, pic_uv, ROUND((pic_pv) / (pic_uv) , 2) AS avg_pic_pv, (play_video_vv+play_pic_vv) as play_vv, play_uv, ROUND((play_video_vv+play_pic_vv) / (play_uv) , 2) AS avg_play_vv, play_video_vv, play_video_uv, ROUND((play_video_vv) / (play_video_uv) , 2) AS avg_play_video_vv, play_pic_vv, play_pic_uv, ROUND((play_pic_vv) / (play_pic_uv) , 2) AS avg_play_pic_vv, video_click_vv, video_click_uv from t_2 where imp_date>='startDate' and imp_date <= 'endDate' and (aboutista is null or aboutista = 'all') and (site_name = 'all') and (activefrom is null or activefrom='all') ) a join ( select imp_date,value as dau from t_3 where indicator='103' and devtype = 1 and imp_date>='startDate' and imp_date <= 'endDate' )b on a.imp_date = b.imp_date order by imp_date desc ``` 相似度构成如下: 字段数量相似度:0.06 字段来源DB相似度:0 字段来源表相似度:0.0 字段来源列相似度:0.1 由于两张报表的字段来源DB和字段来源表均不同,因此相似度在很低的一个取值范围内。 *** ### 测试代码 测试报表SQL数据文件路径:[src/test/resources/sql_data](src/test/resources/sql_data) 测试入口java脚本路径:[src/test/java/com/lan/lineage/druid/LineageCalculator.java](src/test/java/com/lan/lineage/druid/LineageCalculator.java)