# JDProcess

**Repository Path**: delphi_xk/JDProcess

## Basic Information

- **Project Name**: JDProcess
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2018-01-22
- **Last Updated**: 2020-12-19

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

## Step1. 从csv,txt导入Spark建表
- header文件以“,”分隔，data文件以“\t”分隔，列数需要相同
- dbHome为数据仓库元数据存储的本机地址，迁移时需要拷贝整个目录，dataPath为数据文件路径，headerPath为表头文件路径(HDFS路径),fileNames为若干文件名用“,”拼接的字符串，例
- joinType为join操作类型，可取值left,right,full
```shell
dbHome=/export/grid/03/qiuyujiang/hyzs/dbHome/
dataPath=/user/hyzs/data/20180121/
headerPath=/user/hyzs/header/
fileNames=dmr_rec_a_ords_hyzs_s_d,dmr_rec_a_feas_hyzs_s_d
joinType=right
resTable=all_data

/soft/spark/bin/spark-submit --master yarn-client \
--class com.hyzs.spark.sql.JDDataProcess \
--num-executors 19 --driver-memory 10G --executor-memory 22G --executor-cores 3 \
--jars /soft/spark/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar \
--conf "spark.sql.shuffle.partitions=1500" \
--conf "spark.driver.extraJavaOptions=-Dderby.system.home=${dbHome}" \
--conf "spark.processJob.dataPath=${dataPath}" \
--conf "spark.processJob.headerPath=${headerPath}" \
--conf "spark.processJob.fileNames=${fileNames}" \
--conf "spark.processJob.joinType=${joinType}" \
--conf "spark.processJob.resultTable=${resTable}" \ 
/export/grid/03/qiuyujiang/hyzs/libs/DataProcess.jar [args0] [args1] [args2]
```
-------
参数名 | 取值 | 含义
--- | --- | --- 
**args0** | import_business | 导入订单表（有额外流水表处理，默认为第一张表）
 | import_info | 导入标签信息表（无其他处理，默认第二张表开始的之后所有表）
 | import_all | 同时执行前两步操作
 | skip_import | 不执行导入操作
**args1** | join | 执行合表操作
 | skip_join | 跳过合表操作
**args2** | train | 生成train数据集
 | test | 生成test数据集

## Step2. 执行Spark任务，生成Libsvm文件

### 2.1**训练模型**
- 会生成train, valid, test三个数据集
- 以下路径为HDFS路径，可自定义
- 结果会在 hdfs://${basePath}/output/point/ 下生成三个任务目录m1, m2, m3（暂定的三个任务），每个目录下会生成对应train，valid，test目录

```shell
dbHome=/export/grid/03/qiuyujiang/hyzs/dbHome/
basePath=/user/hyzs

for task in m1 m2 m3
do
  for data_type in train valid test
  do
  result_path=${basePath}/output/point/${task}/${data_type}/
  table_name=hyzs.${task}_${data_type}
  obj_file=${basePath}/output/point/${task}/train/hyzs.${task}_train.obj
  label_table=hyzs.${task}_label

  /soft/spark/bin/spark-submit \
  --class huacloud.convertLibSVM.ConvertLibSvmFF \
  --master yarn-client \
  --num-executors 19 --driver-memory 10G --executor-memory 22G \
  --conf 'spark.driver.maxResultSize=2048' \
  --conf "spark.driver.extraJavaOptions=-Dderby.system.home=${dbHome}" \
  --jars /soft/spark/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar \
  -v /export/grid/03/qiuyujiang/hyzs/libs/convertLibSVM.jar \
  "{'data_type':'"${data_type}"','table_name':'"${table_name}"',
  'result_path':'"${result_path}"','client_no':'user_id_md5','obj_file':'"${obj_file}"'}"

  sleep 5s

  /soft/spark/bin/spark-submit \
  --class huacloud.convertLibSVM.LibsvmffToLimsvm \
  --master yarn-client \
  --num-executors 19 --driver-memory 10G --executor-memory 22G \
  --conf "spark.driver.extraJavaOptions=-Dderby.system.home=${dbHome}" \
  --jars /soft/spark/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar \
  -v /export/grid/03/qiuyujiang/hyzs/libs/convertLibSVM.jar \
  "{'data_type':'"${data_type}"','libsvmff_file':'"${result_path}${table_name}".libsvmff',
  'label_table':'"${label_table}"','result_path':'"${result_path}"',
  'client_no_colume':'user_id_md5','label_colume':'label'}"

  sleep 5s

  done
done

```

### 2.2**预测数据**
- 只需生成test数据

```shell
dbHome=/export/grid/03/qiuyujiang/hyzs/dbHome/
basePath=/user/hyzs
data_type=test

for task in m1 m2 m3
do
    result_path=${basePath}/output/point/${task}/${data_type}/
    table_name=hyzs.${task}_${data_type}
    obj_file=${basePath}/output/point/${task}/train/hyzs.${task}_train.obj
    label_table=hyzs.${task}_label

    /soft/spark/bin/spark-submit \
    --class huacloud.convertLibSVM.ConvertLibSvmFF \
    --master yarn-client \
    --num-executors 19 --driver-memory 10G --executor-memory 22G \
    --conf 'spark.driver.maxResultSize=2048' \
    --jars /soft/spark/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar \
    -v /export/home/hcfruser/convertLibSVM.jar \
    "{'data_type':'"${data_type}"','table_name':'"${table_name}"',
    'result_path':'"${result_path}"','client_no':'user_id_md5','obj_file':'"${obj_file}"'}"

    sleep 5s

    /soft/spark/bin/spark-submit \
    --class huacloud.convertLibSVM.LibsvmffToLimsvm \
    --master yarn-client \
    --num-executors 19 --driver-memory 10G --executor-memory 22G \
    --jars /soft/spark/lib/hadoop-lzo-0.4.20-SNAPSHOT.jar \
    -v /export/home/hcfruser/convertLibSVM.jar \
    "{'data_type':'"${data_type}"','libsvmff_file':'"${result_path}${table_name}".libsvmff',
    'label_table':'"${label_table}"','result_path':'"${result_path}"',
    'client_no_colume':'user_id_md5','label_colume':'label'}"

    sleep 5s
done
```


## 3 建模预测，并上传结果

### 3.1 **训练模型**

- 将HDFS上生成文件拖到本地**localBasePath**下
- 将模型预测结果上传**hbasePath**
- sample_path为训练libsvm文件
- validation_path为validation文件路径

```shell
localBasePath=$1
hbasePath=$2

echo "localBasePath=${localBasePath}"
echo "hbasePath=${hbasePath}"

if [ ! -d ${localBasePath} ]; then
  mkdir ${localBasePath}
fi

hdfs dfs -get /user/hyzs/output/point/* ${localBasePath}
for task in m1 m2 m3
do
  if [ ! -d "${localBasePath}/${task}/result" ]; then
      mkdir ${localBasePath}/${task}/result
  fi

  python /export/grid/03/qiuyujiang/hyzs/hyzs_shell/js_run/train.py \
    --params '{"sample_path":"'${localBasePath}'/'${task}'/train/hyzs.'${task}'_train.libsvm",
    "sample_auc_path":"'${localBasePath}'/'${task}'/result/log.gbt-train-auc",
    "sample_name_path":"'${localBasePath}'/'${task}'/train/hyzs.'${task}'_train.name",
    "sample_index_path":"'${localBasePath}'/'${task}'/train/hyzs.'${task}'_train.index",
    "validation_path":"'${localBasePath}'/'${task}'/valid/hyzs.'${task}'_valid.libsvm",
    "validation_auc_path":"'${localBasePath}'/'${task}'/result/log.gbt-valid-auc",
    "features_path":"'${localBasePath}'/'${task}'/result/'${task}'.feature",
    "tree_path":"'${localBasePath}'/'${task}'/result/'${task}'.tree",
    "feature_explain_path":"'${localBasePath}'/'${task}'/result/'${task}'.features-epl",
    "model_store_path":"'${localBasePath}'/'${task}'/result/'${task}'.model",
    "objective":"reg:linear","eval_metric":"auc","bst:eta":"0.01","subsample":"0.5","bst:max_depth":"5","gamma":"0.5"}'

  python /export/grid/03/qiuyujiang/hyzs/hyzs_shell/js_run/predict.py \
    --path_test ${localBasePath}/${task}/test/hyzs.${task}_test.libsvm \
    --path_pred ${localBasePath}/${task}/result/hyzs.${task}_test.pred \
    --path_model ${localBasePath}/${task}/result/${task}.model

  python /export/grid/03/qiuyujiang/hyzs/hyzs_shell/concat_index.py \
    ${localBasePath}/${task}/test/hyzs.${task}_test.index \
    ${localBasePath}/${task}/result/hyzs.${task}_test.pred \
    ${localBasePath}/${task}/result/hyzs.${task}_test.result

  sed -i "s/$/&,${task}/g" ${localBasePath}/${task}/result/hyzs.${task}_test.result
  hdfs dfs -put ${localBasePath}/${task}/result/hyzs.${task}_test.result ${hbasePath}

done

echo model training step 3 finished
```

### 3.2 **预测数据**
- 只需生成test数据，根据模型文件，输入需要预测的文件，得到预测结果

```shell

localBasePath=$1
hbasePath=$2

echo "localBasePath=${localBasePath}"
echo "hbasePath=${hbasePath}"

if [ ! -d ${localBasePath} ]; then
  mkdir ${localBasePath}
fi

hdfs dfs -get /user/hyzs/output/point/* ${localBasePath}

for task in m1 m2 m3
do
  if [ ! -d "${localBasePath}/${task}/result" ]; then
    mkdir ${localBasePath}/${task}/result
  fi

  python /export/grid/03/qiuyujiang/hyzs/hyzs_shell/js_run/predict.py \
    --path_test ${localBasePath}/${task}/test/hyzs.${task}_test.libsvm \
    --path_pred ${localBasePath}/${task}/result/hyzs.${task}_test.pred \
    --path_model ${localBasePath}/${task}/result/${task}.model

  python /export/grid/03/qiuyujiang/hyzs/hyzs_shell/concat_index.py \
    ${localBasePath}/${task}/test/hyzs.${task}_test.index \
    ${localBasePath}/${task}/result/hyzs.${task}_test.pred \
    ${localBasePath}/${task}/result/hyzs.${task}_test.result

  sed -i "s/$/&,${task}/g" ${localBasePath}/${task}/result/hyzs.${task}_test.result
  hdfs dfs -put ${localBasePath}/${task}/result/hyzs.${task}_test.result ${hbasePath}

done

echo model training step 3 finished

```


## 4 结果说明

- 结果文件在对应模型的result目录下的.result文件
- 结果文件格式为: id, prediction, hbaseCol

-------
建模任务名 | HBase列名
--- | ---
consume | m1
value | m2
risk | m3
class1 | c1
class2 | c2
class3 | c3
class4 | c4
class5 | c5
class6 | c6
class7 | c7