diff --git a/.gitignore b/.gitignore
index f5f4ccba45b462f91932967f32d6e62396b20439..0bdf2b68aad5b0ab2f7b6da79ed5e620ef5396c2 100644
--- a/.gitignore
+++ b/.gitignore
@@ -3,4 +3,7 @@ kperf.data.*
 env.sh
 hostfile
 .vscode
-test.*
\ No newline at end of file
+test.*
+porting*
+HPC-info*
+tmp
\ No newline at end of file
diff --git a/README.en.md b/README.en.md
deleted file mode 100644
index 562604bd3c5379032163046529af39e46e706d8d..0000000000000000000000000000000000000000
--- a/README.en.md
+++ /dev/null
@@ -1,36 +0,0 @@
-# hpcrunner
-
-#### Description
-openEuler High Performance Computing(HPC) Runner, provides universal portal for hpc users and developers.
-
-#### Software Architecture
-Software architecture description
-
-#### Installation
-
-1.  xxxx
-2.  xxxx
-3.  xxxx
-
-#### Instructions
-
-1.  xxxx
-2.  xxxx
-3.  xxxx
-
-#### Contribution
-
-1.  Fork the repository
-2.  Create Feat_xxx branch
-3.  Commit your code
-4.  Create Pull Request
-
-
-#### Gitee Feature
-
-1.  You can use Readme\_XXX.md to support different languages, such as Readme\_en.md, Readme\_zh.md
-2.  Gitee blog [blog.gitee.com](https://blog.gitee.com)
-3.  Explore open source project [https://gitee.com/explore](https://gitee.com/explore)
-4.  The most valuable open source project [GVP](https://gitee.com/gvp)
-5.  The manual of Gitee [https://gitee.com/help](https://gitee.com/help)
-6.  The most popular members  [https://gitee.com/gitee-stars/](https://gitee.com/gitee-stars/)
diff --git a/README.md b/README.md
index 239053b5b3028f1c99fafc7723a5d542ab98a938..2cb97f9facd3572b7fc794f90b9eec92068b063e 100644
--- a/README.md
+++ b/README.md
@@ -1,21 +1,40 @@
-# HPCRunner : 贾维斯辅助系统
-### 项目背景
+# HPCRunner : 贾维斯智能助手
+## ***给每个HPC应用一个温暖的家***
 
-因为HPC应用的特殊性，其环境配置、编译、运行、CPU/GPU性能采集分析的门槛比较高，导致迁移和调优的工作量大，不同的人在不同的机器上跑同样的软件和算例基本上是重头开始，费时费力，而且很多情况下需要同时部署ARM/X86两套环境进行验证，增加了很多的重复性工作。
+### 项目背景
 
+因为HPC应用的复杂性，其依赖安装、环境配置、编译、运行、CPU/GPU性能采集分析的门槛比较高，导致迁移和调优的工作量大，不同的人在不同的机器上跑同样的应用和算例基本上是重头开始，费时费力，而且很多情况下需要同时部署鲲鹏/X86两套环境进行验证，增加了很多的重复性工作，无法聚焦软件算法优化。
 
-### 解决方案
+### 项目特色
 
-- 提供支持ARM/X86的统一接口,一键生成环境脚本、一键编译、一键运行、一键性能采集、一键Benchmark等功能.
+- 支持鲲鹏/X86,一键下载依赖，一键安装依赖、采用业界权威依赖目录结构管理海量依赖，自动生成module file
+- 根据HPC配置一键生成环境脚本、一键编译、一键运行、一键性能采集、一键Benchmark.
 - 所有配置仅用一个文件记录，HPC应用部署到不同的机器仅需修改配置文件.
 - 日志管理系统自动记录HPC应用部署过程中的所有信息.
-- 常用HPC工具软件开箱即用，提供GCC/毕昇/icc版本，支持一键module加载.
-- 软件本身开箱即用，仅依赖Python环境.
+- 常用HPC工具软件开箱即用.
+- 软件本身无需编译开箱即用，仅依赖Python环境.
 - (未来) 集成HPC领域常用性能调优手段、核心算法.
 - (未来) 集群性能分析工具.
 - (未来) 智能调优.
 - (未来) HPC应用[容器化](https://catalog.ngc.nvidia.com/orgs/hpc/containers/quantum_espresso).
 
+### 目录结构
+
+| 目录/文件 | 说明                               | 备注     |
+| --------- | ---------------------------------- | -------- |
+| benchmark | 矩阵运算、OpenMP、MPI、P2P性能测试 |          |
+| doc       | 文档                               |          |
+| downloads | 存放依赖库源码包/压缩包            |          |
+| examples  | 性能小实验                         |          |
+| package   | 存放安装脚本和FAQ                  |          |
+| software  | 依赖库二进制仓库                   | 自动生成 |
+| src       | 贾维斯源码                         |          |
+| templates | 常用HPC应用的配置模板              |          |
+| test      | 贾维斯测试用例                     |          |
+| workload  | 常用HPC应用的算例合集              |          |
+| init.sh   | 贾维斯初始化文件                   |          |
+| jarvis    | 贾维斯启动入口                     |          |
+
 ### 已验证HPC应用
 
 分子动力学领域：
@@ -36,60 +55,145 @@
 
 - [x] OpenFOAM
 
+
 ### 使用说明
 
 1.下载包解压之后初始化
 
-`source init.sh`
+```
+source init.sh
+```
+
+2.修改data.config或者套用现有模板，各配置项说明如下所示：
+
+|    配置项    | 说明                                                       | 示例                                                         |
+| :----------: | :--------------------------------------------------------- | :----------------------------------------------------------- |
+|   [SERVER]   | 服务器节点列表，多节点时用于自动生成hostfile，每行一个节点 | 11.11.11.11                                                  |
+|  [DOWNLOAD]  | 每行一个软件的版本和下载链接，默认下载到downloads目录      | cmake/3.16.4 https://cmake.org/files/v3.16/cmake-3.16.4.tar.gz |
+| [DEPENDENCY] | HPC应用依赖安装脚本                                        | ./jarvis -install gcc/9.3.1 com<br>module use ./software/modulefiles<br>module load gcc9 |
+|    [ENV]     | HPC应用编译运行环境配置                                    | source env.sh                                                |
+|    [APP]     | HPC应用信息，包括应用名、构建路径、二进制路径、算例路径    | app_name = CP2K<br/>build_dir = /home/cp2k-8.2/<br/>binary_dir = /home/CP2K/cp2k-8.2/bin/<br/>case_dir = /home/CP2K/cp2k-8.2/benchmarks/QS/ |
+|   [BUILD]    | HPC应用构建脚本                                            | make -j 128                                                  |
+|   [CLEAN]    | HPC应用编译清理脚本                                        | make -j 128 clean                                            |
+|    [RUN]     | HPC应用运行配置，包括前置命令、应用命令和节点个数          | run = mpi <br/>binary = cp2k.psmp H2O-256.inp<br/>nodes = 1  |
+|   [BATCH]    | HPC应用批量运行命令                                        | #!/bin/bash<br/>nvidia-smi -pm 1<br/>nvidia-smi -ac 1215,1410 |
+|    [PERF]    | 性能工具额外参数                                           |                                                              |
+
+3.一键下载依赖（仅针对无需鉴权的链接，否则需要自行下载）
+
+```
+./jarvis -d
+```
+
+4.安装单个依赖
+
+```
+./jarvis -install [name/version/other] [option]
+```
+
+option支持列表如下所示
+
+| 选项值             | 解释                          | 安装目录                |
+| ------------------ | ----------------------------- | ----------------------- |
+| gcc                | 使用当前gcc进行编译           | software/libs/gcc       |
+| gcc+mpi            | 使用当前gcc+当前mpi进行编译   | software/libs/gcc/mpi   |
+| clang(bisheng)     | 使用当前clang进行编译         | software/libs/clang     |
+| clang(bisheng)+mpi | 使用当前clang+当前mpi进行编译 | software/libs/clang/mpi |
+| nvc                | 使用当前nvc进行编译           | software/libs/nvc       |
+| nvc+mpi            | 使用当前nvc+当前mpi进行编译   | software/libs/nvc/mpi   |
+| icc                | 使用当前icc进行编译           | software/libs/icc       |
+| icc+mpi            | 使用当前icc+当前mpi进行编译   | software/libs/icc/mpi   |
+| com                | 安装编译器                    | software/compiler       |
+| any                | 安装工具软件                  | software/compiler/utils |
+
+注意，如果软件为MPI通信软件（如hmpi、openmpi），会安装到software/mpi目录
+
+(eg: ./jarvis -install fftw/3.3.8 gcc)
+5.一键安装所有依赖
+
+```
+./jarvis -dp
+```
+
+6.一键生成环境变量(脱离贾维斯运行才需要执行)
+
+```
+./jarvis -e && source ./env.sh
+```
+
+7.一键编译
+
+```
+./jarvis -b
+```
+
+8.一键运行
+
+```
+./jarvis -r
+```
 
-2.修改data.config（ARM）或者data.X86.config(X86)
+9.一键性能采集(perf)
 
-3.一键生成环境变量(或者python3 jarvis.py)
+```
+./jarvis -p
+```
 
-`./jarvis.py -e`
-`source env.sh`
-4.一键编译
+10.一键Kperf性能采集(生成TopDown)
 
-`./jarvis.py -b`
+```
+./jarvis -kp
+```
 
-5.一键运行
+11.一键GPU性能采集(需安装nsys、ncu)
 
-`./jarvis.py -r`
+```
+./jarvis -gp
+```
 
-6.一键性能采集(perf)
+12.一键输出服务器信息(包括CPU、网卡、OS、内存等)
 
-`./jarvis.py -p`
+```
+./jarvis -i
+```
 
-7.一键GPU性能采集(使用nsys、ncu)
+13.一键服务器性能评测(包括MPI、OMP、P2P等)
 
-`./jarvis.py -gp`
+```
+./jarvis -bench all    #运行所有benchmark
+./jarvis -bench mpi    #运行MPI benchmark
+./jarvis -bench omp    #运行OMP benchmark
+./jarvis -bench gemm   #运行矩阵运算 benchmark
+```
 
-8.一键输出服务器信息(包括CPU、网卡、OS、内存等)
+14.切换配置
 
-`./jarvis.py -i`
+```
+./jarvis -use XXX.config
+```
 
-9.切换配置
+15.其它功能查看（网络检测）
 
-`./jarvis.py -use data.XXX.config`
+```
+./jarvis -h
+```
 
-10.其它功能查看（多线程下载、网络检测）
 
-`./jarvis.py -h`
 
 ### 欢迎贡献
 
-贾维斯项目欢迎您的热情参与！
+贾维斯项目欢迎您的专业技能和热情参与！
 
-小的改进或修复总是值得赞赏的；先从文档开始可能是一个很好的起点。如果您正在考虑对源代码的更大贡献，请先提交issue讨论。
+小的改进或修复总是值得赞赏的；先从文档开始可能是一个很好的起点。如果您正在考虑对源代码的更大贡献，请先提交一个issue或者在maillist进行讨论。
 
 编写代码并不是为贾维斯做出贡献的唯一方法。您还可以：
 
-- 贡献小而精的工具(小于10MB>)
+- 贡献安装脚本
 - 帮助我们测试新的HPC应用
-- 开发教程、演示和其他教育材料
+- 开发教程、演示
 - 为我们宣传
 - 帮助新的贡献者加入
 
-请添加OpenEuler SIG微信群了解更多HPC迁移调优知识
+请添加openEuler HPC SIG微信群了解更多HPC迁移调优知识
 
 ![微信群](./wechat-group-qr.png)
\ No newline at end of file
diff --git a/benchmark/README.md b/benchmark/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..a9c072d60856f80821ded643066be7d129b0e8ec
--- /dev/null
+++ b/benchmark/README.md
@@ -0,0 +1,4 @@
+# benchmark
+# gemm: blas and MPI performance
+# p2p: GPU p2p connectivity and bandwidth check
+
diff --git a/benchmark/gemm/Makefile b/benchmark/gemm/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..b41aae52cf40d2f9ad7a9d1680a9c082b900d15a
--- /dev/null
+++ b/benchmark/gemm/Makefile
@@ -0,0 +1,19 @@
+CC = mpic++
+CCFLAGS = -O2 -fopenmp
+OPENBLAS_PATH = ${JARVIS_LIBS}/gcc9/openblas/0.3.18
+OPENBLAS_INC = -I ${OPENBLAS_PATH}/include
+OPENBLAS_LDFLAGS =  -L ${OPENBLAS_PATH}/lib -lopenblas
+
+KML_PATH = /usr/local/kml
+KML_INC = -I ${KML_PATH}/include
+KML_LDFLAGS =  -L ${KML_PATH}/lib/kblas/omp -lkblas
+all:  gemm
+
+gemm: gemm.cpp
+	${CC} ${CCFLAGS} ${OPENBLAS_INC} gemm.cpp -o gemm ${OPENBLAS_LDFLAGS}
+
+gemm-kml: gemm.cpp
+	${CC} -DUSE_KML ${CCFLAGS} ${KML_INC} gemm.cpp -o gemm-kml ${KML_LDFLAGS}
+
+clean:
+	rm -rf gemm*
diff --git a/benchmark/gemm/gemm.cpp b/benchmark/gemm/gemm.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..53b0974cd0897137db3f9de1319189abb9fad9a6
--- /dev/null
+++ b/benchmark/gemm/gemm.cpp
@@ -0,0 +1,224 @@
+#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/time.h>
+#include <iostream>
+#include "mpi.h"
+#ifdef USE_KML
+  #include "kblas.h"
+#else
+  #include <cblas.h>
+#endif
+using namespace std;
+
+void randMat(int rows, int cols, float *&Mat) {
+  Mat = new float[rows * cols];
+  for (int i = 0; i < rows; i++)
+    for (int j = 0; j < cols; j++)
+      Mat[i * cols + j] = 1.0;
+}
+
+void openmp_sgemm(int m, int n, int k, float *&leftMat, float *&rightMat,
+                  float *&resultMat) {
+  // rightMat is transposed
+#pragma omp parallel for
+  for (int row = 0; row < m; row++) {
+    for (int col = 0; col < k; col++) {
+      resultMat[row * k + col] = 0.0;
+      for (int i = 0; i < n; i++) {
+        resultMat[row * k + col] +=
+            leftMat[row * n + i] * rightMat[col * n + i];
+      }
+    }
+  }
+  return;
+}
+
+void blas_sgemm(int m, int n, int k, float *&leftMat, float *&rightMat,
+                float *&resultMat) {
+  cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans, m, k, n, 1.0, leftMat,
+    n, rightMat, n, 0.0, resultMat, k);
+}
+
+void mpi_sgemm(int m, int n, int k, float *&leftMat, float *&rightMat,
+               float *&resultMat, int rank, int worldsize, bool blas) {
+  int rowBlock = sqrt(worldsize);
+  if (rowBlock * rowBlock > worldsize)
+    rowBlock -= 1;
+  int colBlock = rowBlock;
+
+  int rowStride = m / rowBlock;
+  int colStride = k / colBlock;
+
+  worldsize = rowBlock * colBlock; // we abandom some processes.
+  // so best set process to a square number.
+
+  float *res;
+
+  if (rank == 0) {
+    float *buf = new float[k * n];
+    // transpose right Mat
+    for (int r = 0; r < n; r++) {
+      for (int c = 0; c < k; c++) {
+        buf[c * n + r] = rightMat[r * k + c];
+      }
+    }
+
+    for (int r = 0; r < k; r++) {
+      for (int c = 0; c < n; c++) {
+        rightMat[r * n + c] = buf[r * n + c];
+      }
+    }
+
+    MPI_Request sendRequest[2 * worldsize];
+    MPI_Status status[2 * worldsize];
+    for (int rowB = 0; rowB < rowBlock; rowB++) {
+      for (int colB = 0; colB < colBlock; colB++) {
+        rowStride = (rowB == rowBlock - 1) ? m - (rowBlock - 1) * (m / rowBlock)
+                                           : m / rowBlock;
+        colStride = (colB == colBlock - 1) ? k - (colBlock - 1) * (k / colBlock)
+                                           : k / colBlock;
+        int sendto = rowB * colBlock + colB;
+        if (sendto == 0)
+          continue;
+        MPI_Isend(&leftMat[rowB * (m / rowBlock) * n], rowStride * n, MPI_FLOAT,
+                  sendto, 0, MPI_COMM_WORLD, &sendRequest[sendto]);
+        MPI_Isend(&rightMat[colB * (k / colBlock) * n], colStride * n,
+                  MPI_FLOAT, sendto, 1, MPI_COMM_WORLD,
+                  &sendRequest[sendto + worldsize]);
+      }
+    }
+    for (int rowB = 0; rowB < rowBlock; rowB++) {
+      for (int colB = 0; colB < colBlock; colB++) {
+        int recvfrom = rowB * colBlock + colB;
+        if (recvfrom == 0)
+          continue;
+        MPI_Wait(&sendRequest[recvfrom], &status[recvfrom]);
+        MPI_Wait(&sendRequest[recvfrom + worldsize],
+                 &status[recvfrom + worldsize]);
+      }
+    }
+    res = new float[(m / rowBlock) * (k / colBlock)];
+  } else {
+    if (rank < worldsize) {
+      MPI_Status status[2];
+      rowStride = ((rank / colBlock) == rowBlock - 1)
+                      ? m - (rowBlock - 1) * (m / rowBlock)
+                      : m / rowBlock;
+      colStride = ((rank % colBlock) == colBlock - 1)
+                      ? k - (colBlock - 1) * (k / colBlock)
+                      : k / colBlock;
+      if (rank != 0) {
+        leftMat = new float[rowStride * n];
+        rightMat = new float[colStride * n];
+      }
+      if (rank != 0) {
+        MPI_Recv(leftMat, rowStride * n, MPI_FLOAT, 0, 0, MPI_COMM_WORLD,
+                 &status[0]);
+        MPI_Recv(rightMat, colStride * n, MPI_FLOAT, 0, 1, MPI_COMM_WORLD,
+                 &status[1]);
+      }
+      res = new float[rowStride * colStride];
+    }
+  }
+  MPI_Barrier(MPI_COMM_WORLD);
+
+  if (rank < worldsize) {
+    rowStride = ((rank / colBlock) == rowBlock - 1)
+                    ? m - (rowBlock - 1) * (m / rowBlock)
+                    : m / rowBlock;
+    colStride = ((rank % colBlock) == colBlock - 1)
+                    ? k - (colBlock - 1) * (k / colBlock)
+                    : k / colBlock;
+    if (!blas)
+      openmp_sgemm(rowStride, n, colStride, leftMat, rightMat, res);
+    else
+      blas_sgemm(rowStride, n, colStride, leftMat, rightMat, res);
+  }
+  MPI_Barrier(MPI_COMM_WORLD);
+
+  if (rank == 0) {
+    MPI_Status status;
+    float *buf = new float[(m - (rowBlock - 1) * (m / rowBlock)) *
+                           (k - (colBlock - 1) * (k / colBlock))];
+    float *temp_res;
+    for (int rowB = 0; rowB < rowBlock; rowB++) {
+      for (int colB = 0; colB < colBlock; colB++) {
+        rowStride = (rowB == rowBlock - 1) ? m - (rowBlock - 1) * (m / rowBlock)
+                                           : m / rowBlock;
+        colStride = (colB == colBlock - 1) ? k - (colBlock - 1) * (k / colBlock)
+                                           : k / colBlock;
+        int recvfrom = rowB * colBlock + colB;
+        if (recvfrom != 0) {
+          temp_res = buf;
+          MPI_Recv(temp_res, rowStride * colStride, MPI_FLOAT, recvfrom, 0,
+                   MPI_COMM_WORLD, &status);
+        } else {
+          temp_res = res;
+        }
+        for (int r = 0; r < rowStride; r++)
+          for (int c = 0; c < colStride; c++)
+            resultMat[rowB * (m / rowBlock) * k + colB * (k / colBlock) +
+                      r * k + c] = temp_res[r * colStride + c];
+      }
+    }
+  } else {
+    rowStride = ((rank / colBlock) == rowBlock - 1)
+                    ? m - (rowBlock - 1) * (m / rowBlock)
+                    : m / rowBlock;
+    colStride = ((rank % colBlock) == colBlock - 1)
+                    ? k - (colBlock - 1) * (k / colBlock)
+                    : k / colBlock;
+    if (rank < worldsize)
+      MPI_Send(res, rowStride * colStride, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);
+  }
+  MPI_Barrier(MPI_COMM_WORLD);
+
+  return;
+}
+
+int main(int argc, char *argv[]) {
+  if (argc != 5) {
+    cout << "Usage: " << argv[0] << " M N K use-blas\n";
+    exit(-1);
+  }
+
+  int rank;
+  int worldSize;
+  MPI_Init(&argc, &argv);
+
+  MPI_Comm_size(MPI_COMM_WORLD, &worldSize);
+  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
+
+  int m = atoi(argv[1]);
+  int n = atoi(argv[2]);
+  int k = atoi(argv[3]);
+  int blas = atoi(argv[4]);
+
+  float *leftMat, *rightMat, *resMat;
+
+  struct timeval start, stop;
+  if (rank == 0) {
+    randMat(m, n, leftMat);
+    randMat(n, k, rightMat);
+    randMat(m, k, resMat);
+  }
+  gettimeofday(&start, NULL);
+  mpi_sgemm(m, n, k, leftMat, rightMat, resMat, rank, worldSize, blas);
+  gettimeofday(&stop, NULL);
+  if (rank == 0) {
+    cout << "mpi matmul: "
+         << (stop.tv_sec - start.tv_sec) * 1000.0 +
+                (stop.tv_usec - start.tv_usec) / 1000.0
+         << " ms" << endl;
+
+    for (int i = 0; i < m; i++) {
+      for (int j = 0; j < k; j++)
+        if (int(resMat[i * k + j]) != n) {
+          cout << resMat[i * k + j] << "error\n";
+          exit(-1);
+        }
+    }
+  }
+  MPI_Finalize();
+}
diff --git a/benchmark/gemm/run.sh b/benchmark/gemm/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..4452ca479124203b951bb9e480b789f0baa88287
--- /dev/null
+++ b/benchmark/gemm/run.sh
@@ -0,0 +1,30 @@
+flags="**************"
+armRun(){
+    mpi_cmd="mpirun --allow-run-as-root -x OMP_NUM_THREADS=4 -mca btl ^vader,tcp,openib,uct  -np 16"
+    echo "${flags}benching openblas gemm, best 405ms${flags}"
+    make
+    ${mpi_cmd} ./gemm 4024 4024 4024 1
+    echo "${flags}benching kml gemm, best 216ms${flags}"
+    make gemm-kml
+    ${mpi_cmd} ./gemm-kml 4024 4024 4024 1
+    echo "${flags}benching MPI perf, best 1855ms${flags}"
+    ${mpi_cmd} ./gemm 4024 4024 4024 0
+}
+
+x86Run(){
+    mpi_cmd="mpirun -genv OMP_NUM_THREADS=4  -n 16"
+    echo "${flags}benching openblas gemm, best 405ms${flags}"
+    make
+    ${mpi_cmd} ./gemm 4024 4024 4024 1
+    echo "${flags}benching MKL gemm, best 216ms${flags}"
+    make gemm-MKL
+    ${mpi_cmd} ./gemm-mkl 4024 4024 4024 1
+    echo "${flags}benching MPI perf, best 1855ms${flags}"
+    ${mpi_cmd} ./gemm 4024 4024 4024 0
+}
+# check Arch
+if [ x$(arch) = xaarch64 ];then
+    armRun
+else
+    x86Run
+fi
\ No newline at end of file
diff --git a/benchmark/mpi/reduce_avg.c b/benchmark/mpi/reduce_avg.c
new file mode 100644
index 0000000000000000000000000000000000000000..05a576be7505a36a5a0ff7a4ee575a243752c3d1
--- /dev/null
+++ b/benchmark/mpi/reduce_avg.c
@@ -0,0 +1,74 @@
+// Author: Wes Kendall
+// Copyright 2013 www.mpitutorial.com
+// This code is provided freely with the tutorials on mpitutorial.com. Feel
+// free to modify it for your own use. Any distribution of the code must
+// either provide a link to www.mpitutorial.com or keep this header intact.
+//
+// Program that computes the average of an array of elements in parallel using
+// MPI_Reduce.
+//
+#include <stdio.h>
+#include <stdlib.h>
+#include <mpi.h>
+#include <assert.h>
+#include <time.h>
+
+// Creates an array of random numbers. Each number has a value from 0 - 1
+float *create_rand_nums(int num_elements) {
+  float *rand_nums = (float *)malloc(sizeof(float) * num_elements);
+  assert(rand_nums != NULL);
+  int i;
+  for (i = 0; i < num_elements; i++) {
+    rand_nums[i] = (rand() / (float)RAND_MAX);
+  }
+  return rand_nums;
+}
+
+int main(int argc, char** argv) {
+  if (argc != 2) {
+    fprintf(stderr, "Usage: avg num_elements_per_proc\n");
+    exit(1);
+  }
+
+  int num_elements_per_proc = atoi(argv[1]);
+
+  MPI_Init(NULL, NULL);
+
+  int world_rank;
+  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
+  int world_size;
+  MPI_Comm_size(MPI_COMM_WORLD, &world_size);
+
+  // Create a random array of elements on all processes.
+  srand(time(NULL)*world_rank);   // Seed the random number generator to get different results each time for each processor
+  float *rand_nums = NULL;
+  rand_nums = create_rand_nums(num_elements_per_proc);
+
+  // Sum the numbers locally
+  float local_sum = 0;
+  int i;
+  for (i = 0; i < num_elements_per_proc; i++) {
+    local_sum += rand_nums[i];
+  }
+
+  // Print the random numbers on each process
+  printf("Local sum for process %d - %f, avg = %f\n",
+         world_rank, local_sum, local_sum / num_elements_per_proc);
+
+  // Reduce all of the local sums into the global sum
+  float global_sum;
+  MPI_Reduce(&local_sum, &global_sum, 1, MPI_FLOAT, MPI_SUM, 0,
+             MPI_COMM_WORLD);
+
+  // Print the result
+  if (world_rank == 0) {
+    printf("Total sum = %f, avg = %f\n", global_sum,
+           global_sum / (world_size * num_elements_per_proc));
+  }
+
+  // Clean up
+  free(rand_nums);
+
+  MPI_Barrier(MPI_COMM_WORLD);
+  MPI_Finalize();
+}
\ No newline at end of file
diff --git a/benchmark/mpi/run.sh b/benchmark/mpi/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..265b300cb6673a879bd2fb961545f7b925b83da2
--- /dev/null
+++ b/benchmark/mpi/run.sh
@@ -0,0 +1,2 @@
+mpicc reduce_avg.c -o avg
+mpirun -n 2 --allow-run-as-root ./avg 2
\ No newline at end of file
diff --git a/benchmark/omp/Makefile b/benchmark/omp/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..254eec75896942942a0232b7a51b80a26a7687e2
--- /dev/null
+++ b/benchmark/omp/Makefile
@@ -0,0 +1,17 @@
+CC = gcc
+CCFLAGS = -fopenmp -O2
+NVCFLAGS = 
+
+all:  caclPI
+
+caclPI: caclPI.cpp
+	${CC} ${CCFLAGS} caclPI.cpp -o caclPI
+
+gramSchmidt_gpu: gramSchmidt_gpu.c
+	nvc -mp=gpu -Minfo=mp -lm gramSchmidt_gpu.c -o gramSchmidt_gpu.o
+
+gramSchmidt_gpu_f90: gramSchmidt_gpu.F90
+	nvfortran -mp=gpu -Minfo=mp -lm gramSchmidt_gpu.F90 -o gramSchmidt_gpu_f.o
+
+clean:
+	rm -rf caclPI gramSchmidt_gpu
diff --git a/benchmark/omp/caclPI.cpp b/benchmark/omp/caclPI.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..b68de200a488065f1e258f36d4f79dceceace29d
--- /dev/null
+++ b/benchmark/omp/caclPI.cpp
@@ -0,0 +1,24 @@
+
+#include <stdio.h>
+#include <omp.h>
+
+#define NUM_THREADS 32
+static long num_steps = 100000000;
+
+int main ()
+{       
+    int i;  
+    double x, pi, sum = 0.0, step, start_time,end_time;
+	step = 1.0/(double) num_steps;
+    omp_set_num_threads(NUM_THREADS);
+	start_time=omp_get_wtime();
+    #pragma omp parallel for reduction(+ : sum) private(x)
+	for (i=1;i<= num_steps; i++){
+        x = (i-0.5)*step;
+        sum = sum + 4.0/(1.0+x*x);
+    }
+    pi = step * sum;
+    end_time=omp_get_wtime();
+    printf("Pi = %16.15f\n Running time:%.3f ms \n", pi, end_time - start_time);
+    return 1;
+}
diff --git a/benchmark/omp/gramSchmidt_gpu.F90 b/benchmark/omp/gramSchmidt_gpu.F90
new file mode 100644
index 0000000000000000000000000000000000000000..aa1afd6d6d5abc1de4799fd4e576671d34b6c0d1
--- /dev/null
+++ b/benchmark/omp/gramSchmidt_gpu.F90
@@ -0,0 +1,34 @@
+! @@name:	target_data.3f
+! @@type:	F-free
+! @@compilable:	yes
+! @@linkable:	no
+! @@expect:	success
+! @@version:    omp_4.0
+subroutine gramSchmidt(Q,rows,cols)
+    integer             ::   rows,cols,  i,k
+    double precision    :: Q(rows,cols), tmp
+          !$omp target data map(Q)
+          do k=1,cols
+             tmp = 0.0d0
+            !$omp target map(tofrom: tmp)
+               !$omp parallel do reduction(+:tmp)
+               do i=1,rows
+                  tmp = tmp + (Q(i,k) * Q(i,k))
+               end do
+            !$omp end target
+    
+              tmp = 1.0d0/sqrt(tmp)
+    
+            !$omp target
+               !$omp parallel do
+               do i=1,rows
+                   Q(i,k) = Q(i,k)*tmp
+               enddo
+            !$omp end target
+          end do
+          !$omp end target data
+end subroutine
+
+! Note:  The variable tmp is now mapped with tofrom, for correct 
+! execution with 4.5 (and pre-4.5) compliant compilers. See Devices Intro.
+    
\ No newline at end of file
diff --git a/benchmark/omp/gramSchmidt_gpu.c b/benchmark/omp/gramSchmidt_gpu.c
new file mode 100644
index 0000000000000000000000000000000000000000..9cae585ffa7df5c1a8aeca46504c2bd0ddb88d38
--- /dev/null
+++ b/benchmark/omp/gramSchmidt_gpu.c
@@ -0,0 +1,59 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <math.h>
+
+#define COLS 1000
+#define ROWS 1000
+#define FLOAT_T float
+
+FLOAT_T *getFinput(int scale)
+{
+  FLOAT_T *input;
+  if ((input = (FLOAT_T *)malloc(sizeof(FLOAT_T) * scale)) == NULL)
+  {
+    fprintf(stderr, "Out of Memory!!\n");
+    exit(1);
+  }
+  for (int i = 0; i < scale; i++)
+  {
+    input[i] = ((FLOAT_T)rand() / (FLOAT_T)RAND_MAX) - 0.5;
+  }
+  return input;
+}
+
+FLOAT_T **get2Darr(int M, int N)
+{
+  FLOAT_T **input;
+  input = (FLOAT_T **)malloc(M * sizeof(FLOAT_T *));
+  for (int i = 0; i < M; i++)
+  {
+    input[i] = (FLOAT_T *)malloc(N * sizeof(FLOAT_T));
+  }
+  return input;
+}
+
+void gramSchmidt_gpu(FLOAT_T **Q)
+{
+    int cols = COLS;
+    #pragma omp target data map(Q[0:ROWS][0:cols])
+    for(int k=0; k < cols; k++)
+    {
+        double tmp = 0.0;
+        #pragma omp target map(tofrom: tmp)
+        #pragma omp parallel for reduction(+:tmp)
+        for(int i=0; i < ROWS; i++)
+            tmp += (Q[i][k] * Q[i][k]);
+        tmp = 1/sqrt(tmp);
+        #pragma omp target
+        #pragma omp parallel for
+        for(int i=0; i < ROWS; i++)
+            Q[i][k] *= tmp;
+    }
+}
+
+int main()
+{
+    FLOAT_T **Q = get2Darr(ROWS, COLS);
+    gramSchmidt_gpu(Q);
+    return;
+}
diff --git a/benchmark/omp/run.sh b/benchmark/omp/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3784d96de306443521a0c6fb52abe11e6cc122f9
--- /dev/null
+++ b/benchmark/omp/run.sh
@@ -0,0 +1,18 @@
+flags="**************"
+armRun(){
+    echo "${flags}benching omp perf, best 0.023ms${flags}"
+    make
+    ./caclPI
+    make gramSchmidt_gpu
+    ./gramSchmidt_gpu
+}
+
+x86Run(){
+    armRun
+}
+# check Arch
+if [ x$(arch) = xaarch64 ];then
+    armRun
+else
+    x86Run
+fi
\ No newline at end of file
diff --git a/benchmark/p2p/Makefile b/benchmark/p2p/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..e9c55f7d0572b79b09fb31340e90e2864ec2e83f
--- /dev/null
+++ b/benchmark/p2p/Makefile
@@ -0,0 +1,67 @@
+##
+ # =====================================================================================
+ #
+ #       Filename:  Makefile
+ #
+ #    Description:  This microbenchmark is to obtain the latency & uni/bi-directional
+ #                  bandwidth for PCI-e, NVLink-V1 in NVIDIA P100 DGX-1 and NVLink-V2 in 
+ #                  V100 DGX-1. Please see our IISWC-18 paper titled "Tartan: Evaluating 
+ #                  Modern GPU Interconnect via a Multi-GPU Benchmark Suite". The
+ #                  Code is modified from the p2pBandwidthLatencyTest app in 
+ #                  NVIDIA CUDA-SDK. Please follow NVIDIA's EULA for end usage. 
+ #
+ #        Version:  1.0
+ #        Created:  01/24/2018 02:12:31 PM
+ #       Revision:  none
+ #       Compiler:  GNU-Make
+ #
+ #         Author:  Ang Li, PNNL
+ #        Website:  http://www.angliphd.com  
+ #
+ # =====================================================================================
+##
+
+
+################################################################################
+#
+# Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
+#
+# NOTICE TO USER:
+#
+# This source code is subject to NVIDIA ownership rights under U.S. and
+# international Copyright laws.
+#
+# NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE
+# CODE FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR
+# IMPLIED WARRANTY OF ANY KIND.  NVIDIA DISCLAIMS ALL WARRANTIES WITH
+# REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF
+# MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.
+# IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL,
+# OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS
+# OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE
+# OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE
+# OR PERFORMANCE OF THIS SOURCE CODE.
+#
+# U.S. Government End Users.  This source code is a "commercial item" as
+# that term is defined at 48 C.F.R. 2.101 (OCT 1995), consisting  of
+# "commercial computer software" and "commercial computer software
+# documentation" as such terms are used in 48 C.F.R. 12.212 (SEPT 1995)
+# and is provided to the U.S. Government only as a commercial end item.
+# Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through
+# 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the
+# source code with only those rights set forth herein.
+#
+################################################################################
+#
+# Makefile project only supported on Mac OS X and Linux Platforms)
+#
+################################################################################
+
+include shared.mk
+
+p2pTest: p2pBandwidthLatencyTest.cu
+	$(NVCC) $(NVCC_FLAGS) $^ -o $@
+
+clean:
+	rm -f p2pTest
+
diff --git a/benchmark/p2p/p2pBandwidthLatencyTest.cu b/benchmark/p2p/p2pBandwidthLatencyTest.cu
new file mode 100644
index 0000000000000000000000000000000000000000..e3aaec08c278e74bb47a98b5a474d515f525164a
--- /dev/null
+++ b/benchmark/p2p/p2pBandwidthLatencyTest.cu
@@ -0,0 +1,653 @@
+/*
+ * =====================================================================================
+ *
+ *       Filename:  p2pBandwidthLatencyTest.cu
+ *
+ *    Description:  This microbenchmark is to obtain the latency & uni/bi-directional
+ *                  bandwidth for PCI-e, NVLink-V1 in NVIDIA P100 DGX-1 and NVLink-V2 in 
+ *                  V100 DGX-1. Please see our IISWC-18 paper titled "Tartan: Evaluating 
+ *                  Modern GPU Interconnects via a Multi-GPU Benchmark Suite". The
+ *                  Code is modified from the p2pBandwidthLatencyTest app in 
+ *                  NVIDIA CUDA-SDK. Please follow NVIDIA's EULA for end usage. 
+ *
+ *        Version:  1.0
+ *        Created:  01/24/2018 02:12:31 PM
+ *       Revision:  none
+ *       Compiler:  nvcc
+ *
+ *         Author:  Ang Li, PNNL
+ *        Website:  http://www.angliphd.com  
+ *
+ * =====================================================================================
+ */
+
+/*
+ * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
+ *
+ * Please refer to the NVIDIA end user license agreement (EULA) associated
+ * with this source code for terms and conditions that govern your use of
+ * this software. Any use, reproduction, disclosure, or distribution of
+ * this software and related documentation outside the terms of the EULA
+ * is strictly prohibited.
+ *
+ */
+
+#define ASCENDING
+
+#include <cstdio>
+#include <vector>
+
+using namespace std;
+
+const char *sSampleName = "P2P (Peer-to-Peer) GPU Bandwidth Latency Test";
+
+//Macro for checking cuda errors following a cuda launch or api call
+#define cudaCheckError() {                                          \
+        cudaError_t e=cudaGetLastError();                                 \
+        if(e!=cudaSuccess) {                                              \
+            printf("Cuda failure %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(e));           \
+            exit(EXIT_SUCCESS);                                           \
+        }                                                                 \
+    }
+__global__ void delay(int * null) {
+  float j=threadIdx.x;
+  for(int i=1;i<10000;i++)
+      j=(j+1)/j;
+
+  if(threadIdx.x == j) null[0] = j;
+}
+
+void checkP2Paccess(int numGPUs)
+{
+    for (int i=0; i<numGPUs; i++)
+    {
+        cudaSetDevice(i);
+
+        for (int j=0; j<numGPUs; j++)
+        {
+            int access;
+            if (i!=j)
+            {
+                cudaDeviceCanAccessPeer(&access,i,j);
+                printf("Device=%d %s Access Peer Device=%d\n", i, access ? "CAN" : "CANNOT", j);
+            }
+        }
+    }
+    printf("\n***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.\nSo you can see lesser Bandwidth (GB/s) in those cases.\n\n");
+}
+
+void outputBandwidthMatrix(int numGPUs, bool p2p)
+{
+    int numElems=10000000;
+    int repeat=5;
+    vector<int *> buffers(numGPUs);
+    vector<cudaEvent_t> start(numGPUs);
+    vector<cudaEvent_t> stop(numGPUs);
+
+    for (int d=0; d<numGPUs; d++)
+    {
+        cudaSetDevice(d);
+        cudaMalloc(&buffers[d],numElems*sizeof(int));
+        cudaCheckError();
+        cudaEventCreate(&start[d]);
+        cudaCheckError();
+        cudaEventCreate(&stop[d]);
+        cudaCheckError();
+    }
+
+    vector<double> bandwidthMatrix(numGPUs*numGPUs);
+
+    for (int i=0; i<numGPUs; i++)
+    {
+        cudaSetDevice(i);
+
+        for (int j=0; j<numGPUs; j++)
+        {
+            int access;
+            bool routingrequired = false;
+            int routingnode = -1;
+
+            if(p2p) {
+                cudaDeviceCanAccessPeer(&access,i,j);
+                if (access)
+                {
+                    cudaDeviceEnablePeerAccess(j,0 );
+                    cudaCheckError();
+                }
+                else if (i != j) //not local communication
+                {
+                    routingrequired = true;
+                    int src2route, route2dst;
+#ifdef ASCENDING
+                    for (int k=0; k<numGPUs; k++)
+#else
+                    for (int k=numGPUs-1; k>=0; k--)
+#endif
+                    {
+                        cudaDeviceCanAccessPeer(&src2route,i,k);
+                        cudaDeviceCanAccessPeer(&route2dst,k,j);
+                        if (src2route && route2dst)
+                        {
+                            routingnode =  k;
+                            break;
+                        }
+                    }
+                    cudaDeviceEnablePeerAccess(routingnode,0 );
+                    cudaCheckError();
+                    cudaSetDevice(routingnode);
+                    cudaDeviceEnablePeerAccess(j,0 );
+                    cudaSetDevice(i);
+                }
+            }
+
+            cudaDeviceSynchronize();
+            cudaCheckError();
+
+            if (routingrequired)
+            {
+                delay<<<1,1>>>(NULL);
+                cudaEventRecord(start[i]);
+                for (int r=0; r<repeat; r++)
+                {
+                    cudaMemcpyPeerAsync(buffers[i],i,buffers[routingnode],routingnode,sizeof(int)*numElems);
+                    //cudaSetDevice(routingnode);
+
+                    cudaMemcpyPeerAsync(buffers[routingnode],routingnode,buffers[j],j,sizeof(int)*numElems);
+                    //cudaSetDevice(i);
+                }
+
+                cudaEventRecord(stop[i]);
+                cudaDeviceSynchronize();
+                cudaCheckError();
+            }
+            else
+            {
+                delay<<<1,1>>>(NULL);
+                cudaEventRecord(start[i]);
+
+                for (int r=0; r<repeat; r++)
+                {
+                    cudaMemcpyPeerAsync(buffers[i],i,buffers[j],j,sizeof(int)*numElems);
+                }
+
+                cudaEventRecord(stop[i]);
+                cudaDeviceSynchronize();
+                cudaCheckError();
+            }
+
+            float time_ms;
+            cudaEventElapsedTime(&time_ms,start[i],stop[i]);
+            double time_s=time_ms/1e3;
+
+            double gb=numElems*sizeof(int)*repeat/(double)1e9;
+            if(i==j) gb*=2;  //must count both the read and the write here
+            bandwidthMatrix[i*numGPUs+j]=gb/time_s;
+            if (p2p)
+            {
+                if (access)
+                {
+                    cudaDeviceDisablePeerAccess(j);
+                    cudaCheckError();
+                }
+                
+                if (routingrequired)
+                {
+                    cudaDeviceDisablePeerAccess(routingnode);
+                    cudaCheckError();
+                    cudaSetDevice(routingnode);
+                    cudaDeviceDisablePeerAccess(j);
+                    cudaCheckError();
+                    cudaSetDevice(i);
+                }
+            }
+        }
+    }
+
+    printf("   D\\D");
+
+    for (int j=0; j<numGPUs; j++)
+    {
+        printf("%6d ", j);
+    }
+
+    printf("\n");
+
+    for (int i=0; i<numGPUs; i++)
+    {
+        printf("%6d ",i);
+
+        for (int j=0; j<numGPUs; j++)
+        {
+            printf("%6.02f ", bandwidthMatrix[i*numGPUs+j]);
+        }
+
+        printf("\n");
+    }
+
+    for (int d=0; d<numGPUs; d++)
+    {
+        cudaSetDevice(d);
+        cudaFree(buffers[d]);
+        cudaCheckError();
+        cudaEventDestroy(start[d]);
+        cudaCheckError();
+        cudaEventDestroy(stop[d]);
+        cudaCheckError();
+    }
+}
+
+void outputBidirectionalBandwidthMatrix(int numGPUs, bool p2p)
+{
+    int numElems=10000000;
+    int repeat=5;
+    vector<int *> buffers(numGPUs);
+    vector<cudaEvent_t> start(numGPUs);
+    vector<cudaEvent_t> stop(numGPUs);
+    vector<cudaStream_t> stream0(numGPUs);
+    vector<cudaStream_t> stream1(numGPUs);
+
+    for (int d=0; d<numGPUs; d++)
+    {
+        cudaSetDevice(d);
+        cudaMalloc(&buffers[d],numElems*sizeof(int));
+        cudaCheckError();
+        cudaEventCreate(&start[d]);
+        cudaCheckError();
+        cudaEventCreate(&stop[d]);
+        cudaCheckError();
+        cudaStreamCreate(&stream0[d]);
+        cudaCheckError();
+        cudaStreamCreate(&stream1[d]);
+        cudaCheckError();
+    }
+
+    vector<double> bandwidthMatrix(numGPUs*numGPUs);
+
+    for (int i=0; i<numGPUs; i++)
+    {
+        cudaSetDevice(i);
+
+        for (int j=0; j<numGPUs; j++)
+        {
+            int access;
+            bool routingrequired = false;
+            int routingnode = -1;
+
+            if(p2p) {
+                cudaDeviceCanAccessPeer(&access,i,j);
+                if (access)
+                {
+                    cudaSetDevice(i);
+                    cudaDeviceEnablePeerAccess(j,0);
+                    cudaCheckError();
+                    cudaSetDevice(j);
+                    cudaDeviceEnablePeerAccess(i,0);
+                    cudaCheckError();
+                    cudaSetDevice(i);
+                }
+                else if (i != j) // not the local communication
+                {
+                    routingrequired = true;
+                    int src2route, route2dst;
+
+#ifdef ASCENDING
+                    for (int k=0; k<numGPUs; k++)
+#else
+                    for (int k=numGPUs-1; k>=0; k--)
+#endif
+                    {
+                        cudaDeviceCanAccessPeer(&src2route,i,k);
+                        cudaDeviceCanAccessPeer(&route2dst,k,j);
+                        if (src2route && route2dst)
+                        {
+                            routingnode =  k;
+                            break;
+                        }
+                    }
+                    cudaSetDevice(i);
+                    cudaDeviceEnablePeerAccess(routingnode,0 );
+                    cudaCheckError();
+                    cudaSetDevice(routingnode);
+                    cudaDeviceEnablePeerAccess(i,0 );
+                    cudaCheckError();
+                    cudaDeviceEnablePeerAccess(j,0 );
+                    cudaCheckError();
+                    cudaSetDevice(j);
+                    cudaDeviceEnablePeerAccess(routingnode,0 );
+                    cudaSetDevice(i);
+                    cudaCheckError();
+                }
+            }
+
+            cudaSetDevice(i);
+            cudaDeviceSynchronize();
+            cudaCheckError();
+
+            if (routingrequired)
+            {
+                delay<<<1,1>>>(NULL);
+                cudaEventRecord(start[i]);
+                for (int r=0; r<repeat; r++)
+                {
+                    cudaMemcpyPeerAsync(buffers[i],i,buffers[routingnode],routingnode,sizeof(int)*numElems,stream0[i]);
+                    cudaMemcpyPeerAsync(buffers[j],j,buffers[routingnode],routingnode,sizeof(int)*numElems,stream0[i]);
+                    cudaMemcpyPeerAsync(buffers[routingnode],routingnode,buffers[j],j,sizeof(int)*numElems,stream0[i]);
+                    cudaMemcpyPeerAsync(buffers[routingnode],routingnode,buffers[i],i,sizeof(int)*numElems,stream0[i]);
+                }
+
+                cudaEventRecord(stop[i]);
+                cudaDeviceSynchronize();
+                cudaCheckError();
+
+            }
+            else
+            {
+                delay<<<1,1>>>(NULL);
+                cudaEventRecord(start[i]);
+
+                for (int r=0; r<repeat; r++)
+                {
+                    cudaMemcpyPeerAsync(buffers[i],i,buffers[j],j,sizeof(int)*numElems,stream0[i]);
+                    cudaMemcpyPeerAsync(buffers[j],j,buffers[i],i,sizeof(int)*numElems,stream1[i]);
+                }
+
+                cudaEventRecord(stop[i]);
+                cudaDeviceSynchronize();
+                cudaCheckError();
+            }
+
+
+
+
+            float time_ms;
+            cudaEventElapsedTime(&time_ms,start[i],stop[i]);
+            double time_s=time_ms/1e3;
+
+            double gb=2.0*numElems*sizeof(int)*repeat/(double)1e9;
+            if(i==j) gb*=2;  //must count both the read and the write here
+            bandwidthMatrix[i*numGPUs+j]=gb/time_s;
+            if(p2p)
+            {
+                if (access)
+                {
+                    cudaSetDevice(i);
+                    cudaDeviceDisablePeerAccess(j);
+                    cudaSetDevice(j);
+                    cudaDeviceDisablePeerAccess(i);
+                }
+                
+                if (routingrequired)
+                {
+                    cudaSetDevice(i);
+                    cudaDeviceDisablePeerAccess(routingnode);
+                    cudaSetDevice(routingnode);
+                    cudaDeviceDisablePeerAccess(i);
+                    cudaDeviceDisablePeerAccess(j);
+                    cudaSetDevice(j);
+                    cudaDeviceDisablePeerAccess(routingnode);
+                    cudaSetDevice(i);
+                }
+            }
+        }
+    }
+
+    printf("   D\\D");
+
+    for (int j=0; j<numGPUs; j++)
+    {
+        printf("%6d ", j);
+    }
+
+    printf("\n");
+
+    for (int i=0; i<numGPUs; i++)
+    {
+        printf("%6d ",i);
+
+        for (int j=0; j<numGPUs; j++)
+        {
+            printf("%6.02f ", bandwidthMatrix[i*numGPUs+j]);
+        }
+
+        printf("\n");
+    }
+
+    for (int d=0; d<numGPUs; d++)
+    {
+        cudaSetDevice(d);
+        cudaFree(buffers[d]);
+        cudaCheckError();
+        cudaEventDestroy(start[d]);
+        cudaCheckError();
+        cudaEventDestroy(stop[d]);
+        cudaCheckError();
+        cudaStreamDestroy(stream0[d]);
+        cudaCheckError();
+        cudaStreamDestroy(stream1[d]);
+        cudaCheckError();
+    }
+}
+
+void outputLatencyMatrix(int numGPUs, bool p2p)
+{
+    int repeat=10000;
+    vector<int *> buffers(numGPUs);
+    vector<cudaEvent_t> start(numGPUs);
+    vector<cudaEvent_t> stop(numGPUs);
+
+    for (int d=0; d<numGPUs; d++)
+    {
+        cudaSetDevice(d);
+        cudaMalloc(&buffers[d],1);
+        cudaCheckError();
+        cudaEventCreate(&start[d]);
+        cudaCheckError();
+        cudaEventCreate(&stop[d]);
+        cudaCheckError();
+    }
+
+    vector<double> latencyMatrix(numGPUs*numGPUs);
+
+    for (int i=0; i<numGPUs; i++)
+    {
+        cudaSetDevice(i);
+        for (int j=0; j<numGPUs; j++)
+        {
+            int access;
+            bool routingrequired = false;
+            int routingnode = -1;
+            if(p2p) 
+            {
+                cudaDeviceCanAccessPeer(&access,i,j);
+                if (access)
+                {
+                    cudaDeviceEnablePeerAccess(j,0);
+                    cudaCheckError();
+                }
+                else if (i!=j) //not local communication
+                {
+                    routingrequired = true;
+                    int src2route, route2dst;
+#ifdef ASCENDING
+                    for (int k=0; k<numGPUs; k++)
+#else
+                    for (int k=numGPUs-1; k>=0; k--)
+#endif
+                    {
+                        cudaDeviceCanAccessPeer(&src2route,i,k);
+                        cudaDeviceCanAccessPeer(&route2dst,k,j);
+                        if (src2route && route2dst)
+                        {
+                            routingnode =  k;
+                            break;
+                        }
+                    }
+                    cudaSetDevice(i);
+                    cudaDeviceEnablePeerAccess(routingnode,0 );
+                    cudaCheckError();
+                    cudaSetDevice(routingnode);
+                    cudaDeviceEnablePeerAccess(j,0 );
+                    cudaCheckError();
+                    cudaSetDevice(i);
+                }
+            }
+            cudaDeviceSynchronize();
+            cudaCheckError();
+
+
+            if (routingrequired)
+            {
+                delay<<<1,1>>>(NULL);
+                cudaEventRecord(start[i]);
+
+                for (int r=0; r<repeat; r++)
+                {
+                    cudaMemcpyPeerAsync(buffers[i],i,buffers[routingnode],routingnode,1);
+                    cudaMemcpyPeerAsync(buffers[routingnode],routingnode,buffers[j],j,1);
+                }
+
+                cudaEventRecord(stop[i]);
+                cudaDeviceSynchronize();
+                cudaCheckError();
+            }
+            else
+            {
+                delay<<<1,1>>>(NULL);
+                cudaEventRecord(start[i]);
+
+                for (int r=0; r<repeat; r++)
+                {
+                    cudaMemcpyPeerAsync(buffers[i],i,buffers[j],j,1);
+                }
+
+                cudaEventRecord(stop[i]);
+                cudaDeviceSynchronize();
+                cudaCheckError();
+            
+            }
+
+            float time_ms;
+            cudaEventElapsedTime(&time_ms,start[i],stop[i]);
+
+            latencyMatrix[i*numGPUs+j]=time_ms*1e3/repeat;
+            if(p2p)
+            {
+                if (access)
+                {
+                    cudaDeviceDisablePeerAccess(j);
+                    cudaCheckError();
+                }
+                if (routingrequired)
+                {
+                    //printf("%d=>%d=>%d,(access:%d,routingrequired:%d\n",i,routingnode,j,access, routingrequired);
+                    cudaCheckError();
+                    cudaDeviceDisablePeerAccess(routingnode);
+                    cudaCheckError();
+                    cudaSetDevice(routingnode);
+                    cudaDeviceDisablePeerAccess(j);
+                    cudaCheckError();
+                    cudaSetDevice(i);
+                }
+            }
+        }
+    }
+
+    printf("   D\\D");
+
+    for (int j=0; j<numGPUs; j++)
+    {
+        printf("%6d ", j);
+    }
+
+    printf("\n");
+
+    for (int i=0; i<numGPUs; i++)
+    {
+        printf("%6d ",i);
+
+        for (int j=0; j<numGPUs; j++)
+        {
+            printf("%6.02f ", latencyMatrix[i*numGPUs+j]);
+        }
+
+        printf("\n");
+    }
+
+    for (int d=0; d<numGPUs; d++)
+    {
+        cudaSetDevice(d);
+        cudaFree(buffers[d]);
+        cudaCheckError();
+        cudaEventDestroy(start[d]);
+        cudaCheckError();
+        cudaEventDestroy(stop[d]);
+        cudaCheckError();
+    }
+}
+
+int main(int argc, char **argv)
+{
+
+    int numGPUs;
+    cudaGetDeviceCount(&numGPUs);
+
+    printf("[%s]\n", sSampleName);
+
+    //output devices
+    for (int i=0; i<numGPUs; i++)
+    {
+        cudaDeviceProp prop;
+        cudaGetDeviceProperties(&prop,i);
+        printf("Device: %d, %s, pciBusID: %x, pciDeviceID: %x, pciDomainID:%x\n",i,prop.name, prop.pciBusID, prop.pciDeviceID, prop.pciDomainID);
+    }
+
+    checkP2Paccess(numGPUs);
+
+    //Check peer-to-peer connectivity
+    printf("P2P Connectivity Matrix\n");
+    printf("     D\\D");
+
+    for (int j=0; j<numGPUs; j++)
+    {
+        printf("%6d", j);
+    }
+    printf("\n");
+
+    for (int i=0; i<numGPUs; i++)
+    {
+        printf("%6d\t", i);
+        for (int j=0; j<numGPUs; j++)
+        {
+            if (i!=j)
+            {
+               int access;
+               cudaDeviceCanAccessPeer(&access,i,j);
+               printf("%6d", (access) ? 1 : 0);
+            }
+            else
+            {
+                printf("%6d", 1);
+            }
+        }
+        printf("\n");
+    }
+
+    printf("Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)\n");
+    outputBandwidthMatrix(numGPUs, false);
+    printf("Unidirectional P2P=Enabled Bandwidth Matrix (GB/s)\n");
+    outputBandwidthMatrix(numGPUs, true);
+    printf("Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)\n");
+    outputBidirectionalBandwidthMatrix(numGPUs, false);
+    printf("Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)\n");
+    outputBidirectionalBandwidthMatrix(numGPUs, true);
+
+
+    printf("P2P=Disabled Latency Matrix (us)\n");
+    outputLatencyMatrix(numGPUs, false);
+    printf("P2P=Enabled Latency Matrix (us)\n");
+    outputLatencyMatrix(numGPUs, true);
+
+    printf("\nNOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.\n");
+
+    exit(EXIT_SUCCESS);
+}
diff --git a/benchmark/p2p/run.sh b/benchmark/p2p/run.sh
new file mode 100644
index 0000000000000000000000000000000000000000..71843daee87308c00df12d2a3b3ab73f0e196913
--- /dev/null
+++ b/benchmark/p2p/run.sh
@@ -0,0 +1,16 @@
+flags="**************"
+armRun(){
+    echo "${flags}benching p2p perf, best 18GB/s${flags}"
+    make
+    ./p2pTest
+}
+
+x86Run(){
+    armRun
+}
+# check Arch
+if [ x$(arch) = xaarch64 ];then
+    armRun
+else
+    x86Run
+fi
\ No newline at end of file
diff --git a/benchmark/p2p/shared.mk b/benchmark/p2p/shared.mk
new file mode 100644
index 0000000000000000000000000000000000000000..9777c628e225a35d616a8e0dfa5a3965ab6efdc2
--- /dev/null
+++ b/benchmark/p2p/shared.mk
@@ -0,0 +1,59 @@
+##
+ # =====================================================================================
+ #
+ #       Filename:  shared.mk
+ #
+ #    Description:  This is the commonly shared configuration file for all Makefile in 
+ #                  directory. Please change the items according to your machine.
+ #
+ #        Version:  1.0
+ #        Created:  01/24/2018 02:12:31 PM
+ #       Revision:  none
+ #       Compiler:  GNU-Make
+ #
+ #         Author:  Ang Li, PNNL
+ #        Website:  http://www.angliphd.com  
+ #
+ #       Please cite our IISWC-18 paper "Tartan: Evaluating Modern GPU Interconnect 
+ #          via a Multi-GPU Benchmark Suite"
+ #
+ # =====================================================================================
+##
+
+SHELL = /bin/bash
+
+# GPU Compute Capability 
+ARCH=sm_80
+
+# CUDA toolkit installation path
+CUDA_DIR = /usr/local/cuda/
+
+# CUDA SDK installation path
+SDK_DIR = $(HOME)/NVIDIA_GPU_Computing_SDK/
+
+# CUDA toolkit libraries
+LIB_DIR = $(CUDA_DIR)/lib64
+
+# MPI path
+MPI_DIR = $(HOME)/opt/miniconda2/pkgs/mpich2-1.4.1p1-0/
+
+# CPU compiler
+CC = gcc
+CC_FLAGS = -O3 
+
+# MPI compiler
+MPICC = mpicc
+MPICC_FLAGS = -O5
+MPICXX = mpic++
+
+# CUDA compiler
+NVCC = $(CUDA_DIR)/bin/nvcc
+NVCC_FLAGS = -arch=$(ARCH)  -O3 
+
+# GPU Linking
+NVCC_INCLUDE = -I. -I$(CUDA_DIR)/include -I$(SDK_DIR)/C/common/inc -I../common/inc/ -I$(SDK_DIR)/shared/inc -I$(MPI_DIR)/include
+NVCC_LIB = -lcutil_x86_64 -lcuda -lmpich -lmpl
+NVCC_LIB_PATH = -L. -L$(SDK_DIR)/C/lib -L$(LIB_DIR)/ -L$(SDK_DIR)/shared/lib -L$(MPI_DIR)/lib -L/home/lian599/lib/ -L/usr/lib/ -L/usr/lib64
+
+# Linking
+LINK_FLAG = $(NVCC_INCLUDE) $(NVCC_LIB_PATH) $(NVCC_LIB) -lstdc++ -lm
diff --git a/data.config b/data.config
index 2613de8f44da8bb7e898100f5974451183c616ea..730dff95d7d722bb7a955e3b55ccf419c9d8e16d 100644
--- a/data.config
+++ b/data.config
@@ -1,11 +1,18 @@
 [SERVER]
 11.11.11.11
 
+[DEPENDENCY]
+./jarvis -install gcc/9.3.1 com
+module use ./software/modulefiles
+module load gcc9/9.3.1
+./jarvis -install openmpi/4.1.2 gcc
+module load openmpi4/4.1.2
+./jarvis -install fftw/3.3.8 gcc
+
 [ENV]
 # add gcc/mpi
 export BLAS_LIBS="-L/usr/local/kml/lib/kblas/omp -lkblas"
 export LAPACK_LIBS="-L/usr/local/kml/lib/ -lklapack_full"
-unset FFT_LIBS
 
 [APP]
 app_name = QE
@@ -22,7 +29,6 @@ make install
 make clean
 
 [RUN]
-#run = mpirun -mca btl ^vader -np 128 -x OMP_NUM_THREADS=1
 run = hpctool -o ./output -l detail
 binary = pw.x -input scf.in
 nodes = 1
\ No newline at end of file
diff --git a/downloads/download.md b/downloads/download.md
new file mode 100644
index 0000000000000000000000000000000000000000..ad4924c8e6533ca8f2599871f8ea9d4bd8ee5e2e
--- /dev/null
+++ b/downloads/download.md
@@ -0,0 +1,12 @@
+<!DOCTYPE html>
+<html>
+<head>
+<meta charset="utf-8">
+<title>Software Download:</title>
+</head>
+<body>
+	<h1>X86</h1>
+	<h2>ARM</h2>
+	<a href="https://mirrors.huaweicloud.com/kunpeng/archive/compiler/bisheng_compiler/bisheng-compiler-2.1.0-aarch64-linux.tar.gz">bisheng 2.1.0</a>
+</body>
+</html>
\ No newline at end of file
diff --git a/examples/cuda/Makefile b/examples/cuda/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..81e7d3ad511e20b7aadde1e4e792ff03001ddc33
--- /dev/null
+++ b/examples/cuda/Makefile
@@ -0,0 +1,12 @@
+ARCH=sm_80
+NVCC_FLAGS = -arch=$(ARCH)  -O3
+CUDA_DIR = /usr/local/cuda/
+# CUDA compiler
+NVCC = $(CUDA_DIR)/bin/nvcc
+all: cuda
+
+cuda: cuda.cu
+	$(NVCC) $(NVCC_FLAGS) $^ -o $@.o
+
+clean:
+	rm -f cuda.o
\ No newline at end of file
diff --git a/examples/cuda/cuda.cu b/examples/cuda/cuda.cu
new file mode 100644
index 0000000000000000000000000000000000000000..c8aace498be089ce805579e61bff0d288933e59e
--- /dev/null
+++ b/examples/cuda/cuda.cu
@@ -0,0 +1,102 @@
+// nvcc cuda_hello.cu -o hello.o
+#include <stdio.h>
+#define MAX_DEVICE 2
+#define RTERROR(status, s)                            \
+  if (status != cudaSuccess)                          \
+  {                                                   \
+    printf("%s %s\n", s, cudaGetErrorString(status)); \
+    cudaDeviceReset();                                \
+    exit(-1);                                         \
+  }
+
+//HelloFromGPU<<<1, 5>>>();
+__global__ void HelloFromGPU(void)
+{
+  printf("Hello from GPU\n");
+}
+
+int getDeviceCount() {
+  cudaError_t status;
+  int gpuCount = 0;
+  status = cudaGetDeviceCount(&gpuCount);
+  RTERROR(status, "cudaGetDeviceCount failed");
+  if (gpuCount == 0)
+  {
+    printf("No CUDA-capable devices found, exiting.\n");
+    cudaDeviceReset();
+    exit(-1);
+  }
+  return gpuCount;
+}
+
+cudaDeviceProp getProps(int device)
+{
+  cudaDeviceProp deviceProp;
+  cudaGetDeviceProperties(&deviceProp, device);
+  return deviceProp;
+}
+
+void cudaGetSetDevice(){
+  cudaError_t status;
+  int device = 0;
+  status = cudaGetDevice(&device);
+  RTERROR(status, "Error fetching current GPU");
+  status = cudaSetDevice(device);
+  RTERROR(status, "Error setting CUDA device");
+  cudaDeviceSynchronize();
+}
+
+void isSupportP2P(int gpuCount)
+{
+  int uvaOrdinals[MAX_DEVICE];
+  int uvaCount = 0;
+  int i, j;
+  cudaDeviceProp prop;
+  for (i = 0; i < gpuCount; ++i)
+  {
+    cudaGetDeviceProperties(&prop, i);
+    if (prop.unifiedAddressing)
+    {
+      uvaOrdinals[uvaCount] = i;
+      printf("   GPU%d \"%15s\"\n", i, prop.name);
+      uvaCount += 1;
+    }
+    else
+      printf("   GPU%d \"%15s\"     NOT UVA capable\n", i, prop.name);
+  }
+  int canAccessPeer_ij, canAccessPeer_ji;
+  for (i = 0; i < uvaCount; ++i)
+  {
+    for (j = i + 1; j < uvaCount; ++j)
+    {
+      cudaDeviceCanAccessPeer(&canAccessPeer_ij, uvaOrdinals[i], uvaOrdinals[j]);
+      cudaDeviceCanAccessPeer(&canAccessPeer_ji, uvaOrdinals[j], uvaOrdinals[i]);
+      if (canAccessPeer_ij * canAccessPeer_ji)
+      {
+        printf("   GPU%d and GPU%d: YES\n", uvaOrdinals[i], uvaOrdinals[j]);
+      }
+      else
+      {
+        printf("   GPU%d and GPU%d: NO\n", uvaOrdinals[i], uvaOrdinals[j]);
+      }
+    }
+  }
+}
+
+int main(void)
+{
+  // get GPU Number
+  int gpuCount = getDeviceCount();
+  printf("gpucount:%d\n", gpuCount);
+  // get SM Number
+  cudaDeviceProp deviceProp = getProps(0);
+  printf("SM number:%d\n", deviceProp.multiProcessorCount);
+  // get Mode info
+  if (deviceProp.computeMode == cudaComputeModeDefault)
+  {
+    printf("GPU is in Compute Mode.\n");
+  }
+  // get P2P support info
+  isSupportP2P(gpuCount);
+  return 0;
+}
diff --git a/examples/false_sharing/Makefile b/examples/false_sharing/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..3039d437ef86aa8994e3a650ebeb4d189f98b195
--- /dev/null
+++ b/examples/false_sharing/Makefile
@@ -0,0 +1,10 @@
+CC = gcc
+LDLIBS = -lnuma -lpthread
+binary = false_sharing.exe
+source = false_sharing_example.c
+.PHONY : clean
+
+$(binary) : $(source)
+	$(CC) $(LDLIBS) -o $@ $<
+clean :
+	-rm $(binary) $(objects)
diff --git a/examples/false_sharing/ReadMe.txt b/examples/false_sharing/ReadMe.txt
new file mode 100644
index 0000000000000000000000000000000000000000..d009b54cae33977c0cbdc73496f00d4b16b0c8ec
--- /dev/null
+++ b/examples/false_sharing/ReadMe.txt
@@ -0,0 +1,35 @@
+install numactl-devel, in order to use numa.h
+1.rpm -ivh numactl-devel-2.0.13-4.ky10.x86_64.rpm
+compile
+2.make
+start perf...
+3.perf c2c record ./false_sharing.exe 2
+start report...
+4.perf c2c report -NN -g -c pid,iaddr --stdio
+  Load Local HITM                   :       2010 【too High, false_sharing is detected】
+  Load Remote HITM                  :       1315
+  Load Remote HIT                   :          0
+  Load Local DRAM                   :         71
+  Load Remote DRAM                  :       1881
+  Load MESI State Exclusive         :       1881
+  Load MESI State Shared            :         71
+  Load LLC Misses                   :       3267
+  LLC Misses to Local DRAM          :        2.2%
+  LLC Misses to Remote DRAM         :       57.6%
+  LLC Misses to Remote cache (HIT)  :        0.0%
+  LLC Misses to Remote cache (HITM) :       40.3%
+compile no false_sharing code
+7.gcc -g false_sharing_example.c -pthread -lnuma -DNO_FALSE_SHARING -o no_false_sharing.exe
+8.perf c2c report -NN -g -c pid,iaddr --stdio
+  Load Local HITM                   :          6【normal, false_sharing is erased】
+  Load Remote HITM                  :        486
+  Load Remote HIT                   :          0
+  Load Local DRAM                   :          1
+  Load Remote DRAM                  :        498
+  Load MESI State Exclusive         :        498
+  Load MESI State Shared            :          1
+  Load LLC Misses                   :        985
+  LLC Misses to Local DRAM          :        0.1%
+  LLC Misses to Remote DRAM         :       50.6%
+  LLC Misses to Remote cache (HIT)  :        0.0%
+  LLC Misses to Remote cache (HITM) :       49.3%
\ No newline at end of file
diff --git a/examples/false_sharing/false_sharing_example.c b/examples/false_sharing/false_sharing_example.c
new file mode 100644
index 0000000000000000000000000000000000000000..900f1ee17f5b0f32a0f812b49864bce037966a21
--- /dev/null
+++ b/examples/false_sharing/false_sharing_example.c
@@ -0,0 +1,268 @@
+/*
+ * This is an example program to show false sharing between
+ * numa nodes.  
+ *
+ * It can be compiled two ways:
+ *    gcc -g false_sharing_example.c -pthread -lnuma -o false_sharing.exe
+ *    gcc -g false_sharing_example.c -pthread -lnuma -DNO_FALSE_SHARING -o no_false_sharing.exe
+ *
+ * The -DNO_FALSE_SHARING macro reduces the false sharing by expanding the shared data
+ * structure into two different cachelines, (and it runs faster).
+ *
+ * The usage is: 
+ *     ./false_sharing.exe <number of threads per node>
+ *     ./no_false_sharing.exe <number of threads per node>
+ *
+ * The program will make half the threads writer threads and half reader
+ * threads.  It will pin those threads in round-robin format to the 
+ * different numa nodes in the system.
+ *
+ * For example, on a system with 4 numa nodes:
+ * ./false_sharing.exe 2
+ * 12165 mticks, reader_thd (thread 6), on node 2 (cpu 144).
+ * 12403 mticks, reader_thd (thread 5), on node 1 (cpu 31).
+ * 12514 mticks, reader_thd (thread 4), on node 0 (cpu 96).
+ * 12703 mticks, reader_thd (thread 7), on node 3 (cpu 170).
+ * 12982 mticks, lock_th (thread 0), on node 0 (cpu 1).
+ * 13018 mticks, lock_th (thread 1), on node 1 (cpu 24).
+ * 13049 mticks, lock_th (thread 3), on node 3 (cpu 169).
+ * 13050 mticks, lock_th (thread 2), on node 2 (cpu 49).
+ * 
+ * # ./no_false_sharing.exe 2
+ * 1918 mticks, reader_thd (thread 4), on node 0 (cpu 96).
+ * 2432 mticks, reader_thd (thread 7), on node 3 (cpu 170).
+ * 2468 mticks, reader_thd (thread 6), on node 2 (cpu 146).
+ * 3903 mticks, reader_thd (thread 5), on node 1 (cpu 40).
+ * 7560 mticks, lock_th (thread 0), on node 0 (cpu 1).
+ * 7574 mticks, lock_th (thread 2), on node 2 (cpu 145).
+ * 7602 mticks, lock_th (thread 3), on node 3 (cpu 169).
+ * 7625 mticks, lock_th (thread 1), on node 1 (cpu 24).
+ *
+ */
+
+#define _MULTI_THREADED
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <sched.h>
+#include <pthread.h>
+#include <numa.h>
+#include <sys/types.h>
+
+/*
+ * A thread on each numa node seems to provoke cache misses
+ */
+#define 	LOOP_CNT     	(5 * 1024 * 1024) 
+
+#if defined(__x86_64__) || defined(__i386__) 
+static __inline__ uint64_t rdtsc() {
+    unsigned hi, lo;
+    __asm__ __volatile__ ( "rdtsc" : "=a"(lo), "=d"(hi));
+    return ( (uint64_t)lo) | ( ((uint64_t)hi) << 32);
+}
+
+#elif defined(__aarch64__)
+static __inline__ uint64_t rdtsc(void)
+{
+    uint64_t val;
+
+    /*
+     * According to ARM DDI 0487F.c, from Armv8.0 to Armv8.5 inclusive, the
+     * system counter is at least 56 bits wide; from Armv8.6, the counter
+     * must be 64 bits wide.  So the system counter could be less than 64
+     * bits wide and it is attributed with the flag 'cap_user_time_short'
+     * is true.
+     */
+    asm volatile("mrs %0, cntvct_el0" : "=r" (val));
+
+    return val;
+}
+#endif
+
+
+/* 
+ * Create a struct where reader fields share a cacheline with the hot lock field.
+ * Compiling with -DNO_FALSE_SHARING inserts padding to avoid that sharing.
+ */
+typedef struct _buf {
+  long lock0; 
+  long lock1;
+  long reserved1;
+#if defined(NO_FALSE_SHARING)
+  long pad[5];   // to keep the 'lock*' fields on their own cacheline.
+#else
+  long pad[1];  // to provoke false sharing.
+#endif
+  long reader1; 
+  long reader2; 
+  long reader3; 
+  long reader4; 
+} buf __attribute__((aligned (64)));
+
+buf buf1;
+buf buf2;
+
+volatile int wait_to_begin = 1;
+struct thread_data *thread;
+int max_node_num;
+int num_threads; 
+char * lock_thd_name = "lock_th";
+char * reader_thd_name = "reader_thd";
+
+#define checkResults(string, val) {             \
+ if (val) {                                     \
+   printf("Failed with %d at %s", val, string); \
+   exit(1);                                     \
+ }                                              \
+}
+ 
+struct thread_data {
+    pthread_t tid;
+    long tix;
+    long node;
+    char *name;
+};
+
+/*
+ * Bind a thread to the specified numa node.
+*/
+void setAffinity(void *parm) {
+   volatile uint64_t rc, j;
+   int node        = ((struct thread_data *)parm)->node;
+   char *func_name = ((struct thread_data *)parm)->name;
+
+   numa_run_on_node(node);
+   pthread_setname_np(pthread_self(),func_name);
+}
+
+/*
+ * Thread function to simulate the false sharing.
+ * The "lock" threads will test-n-set the lock field,
+ * while the reader threads will just read the other fields
+ * in the struct.
+ */
+extern void *read_write_func(void *parm) {
+
+   int tix = ((struct thread_data *)parm)->tix;
+   uint64_t start, stop, j;    
+   char *thd_name = ((struct thread_data *)parm)->name;
+
+   // Pin each thread to a numa node.
+   setAffinity(parm);
+
+   // Wait for all threads to get created before starting.
+   while(wait_to_begin) ;
+
+   start = rdtsc();
+   for(j=0; j<LOOP_CNT; j++) {
+
+      // Check for lock thread.
+      if (*thd_name == *lock_thd_name) {
+          __sync_lock_test_and_set(&buf1.lock0, 1 );
+          buf1.lock0 += 1;   
+          buf2.lock1 = 1;   
+ 
+      } else {
+         // Reader threads.
+   
+         switch(tix % max_node_num) {
+            volatile long var;
+            case 0:
+              var = *(volatile uint64_t *)&buf1.reader1;
+              var = *(volatile uint64_t *)&buf2.reader1;
+              break;
+            case 1:
+              var = *(volatile uint64_t *)&buf1.reader2;
+              var = *(volatile uint64_t *)&buf2.reader2;
+              break;
+            case 2:
+              var = *(volatile uint64_t *)&buf1.reader3;
+              var = *(volatile uint64_t *)&buf2.reader3;
+              break;
+            case 3:
+              var = *(volatile uint64_t *)&buf1.reader4;
+              var = *(volatile uint64_t *)&buf2.reader4;
+              break;
+         }; 
+     }; 
+  }  // End of for LOOP_CNT loop
+
+  // Print out stats
+  //
+  stop = rdtsc();
+  int cpu = sched_getcpu();
+  int node = numa_node_of_cpu(cpu);
+  printf("%ld mticks, %s (thread %d), on node %d (cpu %d).\n", (stop-start)/1000000, thd_name, tix, node, cpu);  
+
+  return NULL;
+}
+ 
+int main ( int argc, char *argv[] )
+{
+  int     i, n, rc=0;
+
+  if ( argc != 2 ) /* argc should be 2 for correct execution */
+  {
+   printf( "usage: %s <n>\n", argv[0] );
+   printf( "where \"n\" is the number of threads per node\n");
+   exit(1);
+  }
+
+  if ( numa_available() < 0 )
+  {
+   printf( "NUMA not available\n" );
+   exit(1);
+  }
+
+  int thread_cnt = atoi(argv[1]);
+
+  max_node_num = numa_max_node();
+  if ( max_node_num == 0 )
+    max_node_num = 1;
+  int node_cnt = max_node_num + 1;
+
+  // Use "thread_cnt" threads per node.
+  num_threads = (max_node_num +1) * thread_cnt;
+
+  thread = malloc( sizeof(struct thread_data) * num_threads);
+ 
+  // Create the first half of threads as lock threads.
+  // Assign each thread a successive round robin node to 
+  // be pinned to (later after it gets created.)
+  //
+  for (i=0; i<=(num_threads/2 - 1); i++) {
+     thread[i].tix = i;
+     thread[i].node = i%node_cnt;
+     thread[i].name = lock_thd_name;
+     rc = pthread_create(&thread[i].tid, NULL, read_write_func, &thread[i]);
+     checkResults("pthread_create()\n", rc);
+     usleep(500);
+  }
+
+  // Create the second half of threads as reader threads.
+  // Assign each thread a successive round robin node to 
+  // be pinned to (later after it gets created.)
+  //
+  for (i=((num_threads/2)); i<(num_threads); i++) {
+     thread[i].tix = i;
+     thread[i].node = i%node_cnt;
+     thread[i].name = reader_thd_name;
+     rc = pthread_create(&thread[i].tid, NULL, read_write_func, &thread[i]);
+     checkResults("pthread_create()\n", rc);
+     usleep(500);
+  }
+
+  // Sync to let threads start together
+  usleep(500);
+  wait_to_begin = 0;
+ 
+  for (i=0; i <num_threads; i++) {
+     rc = pthread_join(thread[i].tid, NULL);
+     checkResults("pthread_join()\n", rc);
+  }
+
+  return 0;
+}
diff --git a/examples/fortran/Makefile b/examples/fortran/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..9418383b4afbeb8a27c3111419733f475f81e52b
--- /dev/null
+++ b/examples/fortran/Makefile
@@ -0,0 +1,7 @@
+all: test
+
+test: test.F90
+	gfortran -O3 $^ -o $@.o
+
+clean:
+	rm -f test.o
\ No newline at end of file
diff --git a/init.sh b/init.sh
index e7a2a57193ac3904e1796e591070574227c80c76..ab6d9b386d59f2a7ce4bf9cd6c6d03301ae0854f 100644
--- a/init.sh
+++ b/init.sh
@@ -1,3 +1,14 @@
 CUR_PATH=$(pwd)
-chmod -R +x ./
-export JARVIS_ROOT=${CUR_PATH}
\ No newline at end of file
+chmod -R +x ./benchmark
+chmod -R +x ./package
+chmod -R +x ./test
+chmod +x *.sh
+chmod +x jarvis
+mkdir -p tmp
+export JARVIS_ROOT=${CUR_PATH}
+export JARVIS_COMPILER=${CUR_PATH}/software/compiler
+export JARVIS_MPI=${CUR_PATH}/software/mpi
+export JARVIS_LIBS=${CUR_PATH}/software/libs
+export JARVIS_UTILS=${CUR_PATH}/software/utils
+export JARVIS_DOWNLOAD=${CUR_PATH}/downloads
+export JARVIS_TMP=${CUR_PATH}/tmp
diff --git a/jarvis b/jarvis
new file mode 100644
index 0000000000000000000000000000000000000000..b6443c5ecf22b55ab756c25a4fb581d536655f69
--- /dev/null
+++ b/jarvis
@@ -0,0 +1,2 @@
+#!/bin/bash
+python3 ./src/jarvis.py $*
\ No newline at end of file
diff --git a/jarvis.py b/jarvis.py
deleted file mode 100644
index cf0207329043764073bac9cda90be20461e82900..0000000000000000000000000000000000000000
--- a/jarvis.py
+++ /dev/null
@@ -1,334 +0,0 @@
-#!/usr/bin/env python3
-# -*- coding: utf-8 -*- 
-import argparse
-from asyncio.log import logger
-import platform
-import sys
-import os
-import time
-import re
-from datetime import datetime
-from data import Data
-import logging
-from glob import glob
-
-LOG_FORMAT = "%(asctime)s - %(levelname)s - %(message)s"
-DATE_FORMAT = "%m/%d/%Y %H:%M:%S %p"
-logging.basicConfig(filename='runner.log', level=logging.DEBUG, format=LOG_FORMAT, datefmt=DATE_FORMAT)
-
-class Tool:
-    def __init__(self):
-        pass
-
-    def gen_list(self, data):
-        return data.strip().split('\n')
-
-    def get_time_stamp(self):
-        return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
-
-class Execute:
-    def __init__(self):
-        self.cur_time = ''
-        self.end_time = ''
-        self.tool = Tool()
-        self.flags = '*' * 80
-
-    # tools function
-    def join_cmd(self, arrs):
-        return " && ".join(arrs)
-
-    def print_cmd(self, cmd):
-        print(self.flags)
-        self.cur_time = self.tool.get_time_stamp()
-        print(f"RUNNING at {self.cur_time}:\n{cmd}")
-        logging.info(cmd)
-        print(self.flags)
-
-    # Execute, get output and don't know whether success or not
-    def exec_popen(self, cmd):
-        self.print_cmd(cmd)
-        output = os.popen(cmd).readlines()
-        return output
-
-    def get_duration(self):
-        time_1_struct = datetime.strptime(self.cur_time, "%Y-%m-%d %H:%M:%S")
-        time_2_struct = datetime.strptime(self.end_time, "%Y-%m-%d %H:%M:%S")
-        seconds = (time_2_struct - time_1_struct).seconds
-        return seconds
-
-    # Execute, get whether success or not
-    def exec_list(self, cmds):
-        cmd = self.join_cmd(cmds)
-        if not cmd.startswith('echo'):
-            self.print_cmd(cmd)
-        state = os.system(cmd)
-        self.end_time = self.tool.get_time_stamp()
-        if state:
-            print(f"failed at {self.end_time}:{state}".upper())
-        else:
-            print(f"successfully executed at {self.end_time}, congradulations!!!".upper())
-        print(f"total time used: {self.get_duration()}s")
-        logger.info(cmd)
-
-    def exec_raw(self, rows):
-        self.exec_list(self.tool.gen_list(rows))
-
-class Machine:
-    def __init__(self):
-        self.exe = Execute()
-        self.info2cmd = {
-            'CHECK network adapter':'nmcli d',
-            'CHECK Machine Bits':'getconf LONG_BIT',
-            'CHECK OS':'cat /proc/version && uname -a',
-            'CHECK GPU': 'lspci | grep -i nvidia',
-            'CHECK Total Memory':'cat /proc/meminfo | grep MemTotal',
-            'CHECK Total Disk Memory':'fdisk -l | grep Disk',
-            'CHECK CPU info': 'cat /proc/cpuinfo | grep "processor" | wc -l && lscpu && dmidecode -t 4'
-        }
-
-    def prt_content(self, content):
-        flags = '*' * 30
-        print(f"{flags}{content}{flags}")
-
-    def get_info(self, content, cmd):
-        self.prt_content(content)
-        self.exe.exec_raw(cmd)
-    
-    def output_machine_info(self):
-        for key, value in self.info2cmd.items():
-            self.get_info(key, value)
-
-class HPCRunner:
-    def __init__(self):
-        self.hpc_data = Data()
-        self.tool = Tool()
-        self.exe = Execute()
-        self.machine = Machine()
-        self.isARM = platform.machine() == 'aarch64'
-        self.ARM_path = './software/arm'
-        self.X86_path = './software/x86'
-        self.ROOT = os.getcwd()
-        self.avail_ips_list = self.tool.gen_list(Data.avail_ips)
-        self.download_list = self.tool.gen_list(Data.download_urls)
-
-        # Argparser set
-        parser = argparse.ArgumentParser(description=f'please put me into CASE directory, used for {Data.app_name} Compiler/Clean/Run/Compare',
-                    usage='%(prog)s [-h] [--build] [--clean] [...]')
-        parser.add_argument("-v","--version", help=f"get version info", action="store_true")
-        parser.add_argument("-use","--use", help="Switch config file...", nargs=1)
-        parser.add_argument("-i","--info", help=f"get machine info", action="store_true")
-        parser.add_argument("-e","--env", help=f"set environment {Data.app_name}", action="store_true")
-        parser.add_argument("-b","--build", help=f"compile {Data.app_name}", action="store_true")
-        parser.add_argument("-cls","--clean", help=f"clean {Data.app_name}", action="store_true")
-        parser.add_argument("-r","--run", help=f"run {Data.app_name}", action="store_true")
-        parser.add_argument("-p","--perf", help=f"auto perf {Data.app_name}", action="store_true")
-        # GPU perf
-        parser.add_argument("-gp","--gpuperf", help="GPU perf...", action="store_true")
-
-        # NCU perf
-        parser.add_argument("-ncu","--ncuperf", help="NCU perf...", nargs=1)
-        parser.add_argument("-c","--compare", help=f"compare {Data.app_name}", nargs=2)
-        # batch run
-        parser.add_argument("-rb","--rbatch", help=f"run batch {Data.app_name}", action="store_true")
-        # batch download
-        parser.add_argument("-d","--download", help="Batch Download...", action="store_true")
-        parser.add_argument("-net","--network", help="network checking...", action="store_true")
-        #change yum repo to aliyun
-        parser.add_argument("-yum","--yum", help="yum repo changing...", action="store_true")
-        self.args = parser.parse_args()
-
-    def write_file(self, filename, content=""):
-        with open(filename,'w') as f:
-            f.write(content)
-
-    def get_machine_info(self):
-        print("get machine info")
-        self.machine.output_machine_info()
-
-    def check_network(self):
-        print(f"start network checking")
-        network_test_cmd='''
-wget --spider -T 5 -q -t 2 www.baidu.com | echo $?
-curl -s -o /dev/null www.baidu.com | echo $?
-    '''
-        self.exe.exec_raw(network_test_cmd)
-
-    def env(self):
-        print(f"set environment {Data.app_name}")
-        self.write_file(Data.env_file, Data.module_content)
-        print(f"ENV FILE\n{Data.env_file}\nGENERATED.")
-        env_cmd = f'please execute.\n\nsource {Data.env_file}\n'
-        print(env_cmd)
-
-    def clean(self):
-        print(f"start clean {Data.app_name}")
-        clean_cmd=self.hpc_data.get_clean_cmd()
-        self.exe.exec_raw(clean_cmd)
-
-    def build(self):
-        print(f"start build {Data.app_name}")
-        build_cmd = self.hpc_data.get_build_cmd()
-        self.exe.exec_raw(build_cmd)
-
-    def gen_hostfile(self, nodes):
-        length = len(self.avail_ips_list)
-        if nodes > length:
-            print(f"You don't have {nodes} nodes, only {length} nodes available!")
-            sys.exit()
-        if nodes <= 1:
-            return
-        gen_nodes = '\n'.join(self.avail_ips_list[:nodes])
-        print(f"HOSTFILE\n{gen_nodes}\nGENERATED.")
-        self.write_file('hostfile', gen_nodes)
-
-    # single run
-    def run(self):
-        print(f"start run {Data.app_name}")
-        nodes = int(Data.run_cmd['nodes'])
-        self.gen_hostfile(nodes)
-        run_cmd = self.hpc_data.get_run_cmd()
-        self.exe.exec_raw(run_cmd)
-
-    def batch_run(self):
-        batch_file = 'Batch_run.sh'
-        print(f"start batch run {Data.app_name}")
-        batch_content = f'''
-cd {Data.case_dir}
-{Data.batch_cmd}
-'''
-        with open(batch_file, 'w') as f:
-            f.write(batch_content)
-        run_cmd = f'''
-chmod +x {batch_file}
-./{batch_file}
-'''
-        self.exe.exec_raw(run_cmd)
-
-    def change_yum_repo(self):
-        print(f"start yum repo change")
-        repo_cmd = '''
-cp ./config/yum/*.repo /etc/yum.repos.d/  
-yum clean all
-yum makecache
-'''
-        self.exe.exec_raw(repo_cmd)
-
-    def get_pid(self):
-        #get pid
-        pid_cmd = f'pidof {Data.binary_file}'
-        result = self.exe.exec_popen(pid_cmd)
-        if len(result) == 0:
-            print("failed to get pid.")
-            sys.exit()
-        else:
-            pid_list = result[0].split(' ')
-        return pid_list[0].strip()
-
-    def perf(self):
-        print(f"start perf {Data.app_name}")
-        #get pid
-        pid = self.get_pid()
-        #start perf && analysis
-        perf_cmd = f'''
-perf record -a -g -p {pid}
-perf report  -i perf.data -F period,sample,overhead,symbol,dso,comm -s overhead --percent-limit 0.1% --stdio
-'''
-        self.exe.exec_raw(perf_cmd)
-
-    def gen_wget_url(self, out_dir='./downloads', url=''):
-        head = "wget --no-check-certificate"
-        out_para = "-P"
-        if not os.path.exists(out_dir):
-            os.makedirs(out_dir)
-        download_url = f'{head} {out_para} {out_dir} {url}'
-        return download_url
-
-    def download(self):
-        print(f"start download")
-        for url in self.download_list:
-            download_url = self.gen_wget_url(url=url)
-            os.popen(download_url)
-
-    def get_arch(self):
-        arch = 'arm'
-        if not self.isARM:
-            arch = 'X86'
-        return arch
-
-    def get_cur_time(self):
-        return re.sub(' |:', '-', self.tool.get_time_stamp())
-
-    def gpu_perf(self):
-        print(f"start gpu perf")
-        run_cmd = self.hpc_data.get_run()
-        gperf_cmd = f'''
-cd {Data.case_dir}
-nsys profile -y 5s -d 100s -o nsys-{self.get_arch()}-{self.get_cur_time()} {run_cmd}
-    '''
-        self.exe.exec_raw(gperf_cmd)
-
-    def ncu_perf(self, kernel):
-        print(f"start ncu perf")
-        run_cmd = self.hpc_data.get_run()
-        ncu_cmd = f'''
-    cd {Data.case_dir}
-    ncu --export ncu-{self.get_arch()}-{self.get_cur_time()} --import-source=yes --set full --kernel-name {kernel} --launch-skip 1735 --launch-count 1 {run_cmd}
-    '''
-        self.exe.exec_raw(ncu_cmd)
-
-    def switch_config(self, config_file):
-        print(f"Switch config file to {config_file}")
-        with open(Data.meta_file, 'w') as f:
-            f.write(config_file.strip())
-        print("Successfully switched.")
-    
-    def main(self):
-        if self.args.version:
-            print("V1.0")
-        
-        if self.args.info:
-            self.get_machine_info()
-
-        if self.args.env:
-            self.env()
-
-        if self.args.clean:
-            self.clean()
-
-        if self.args.build:
-            self.build()
-
-        if self.args.run:
-            self.run()
-
-        if self.args.perf:
-            self.perf()
-
-        if self.args.rbatch:
-            self.batch_run()
-
-        if self.args.download:
-            self.download()
-
-        if self.args.gpuperf:
-            self.gpu_perf()
-        
-        if self.args.ncuperf:
-            self.ncu_perf(self.args.ncuperf[0])
-        
-        if self.args.use:
-            self.switch_config(self.args.use[0])
-        
-        if self.args.network:
-            self.check_network()
-
-        if self.args.yum:
-            self.change_yum_repo()
-        
-        data_list = self.args.compare
-        if data_list and len(data_list) == 2:
-            print(f"start compare {Data.app_name}")
-            self.compare(data_list[0], data_list[1])
-        
-if __name__ == '__main__':
-    HPCRunner().main()
diff --git a/package/bisheng/1.3.3/install.sh b/package/bisheng/1.3.3/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..118c96b8b082dbd679d71042ae02df4bccd876ea
--- /dev/null
+++ b/package/bisheng/1.3.3/install.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+#download from https://mirrors.huaweicloud.com/kunpeng/archive/compiler/bisheng_compiler/bisheng-compiler-2.1.0-aarch64-linux.tar.gz
+set -e
+cd ${JARVIS_TMP}
+tar xzvf ${JARVIS_DOWNLOAD}/bisheng-compiler-1.3.3-aarch64-linux.tar.gz -C $1 --strip-components=1
\ No newline at end of file
diff --git a/package/bisheng/2.1.0/install.sh b/package/bisheng/2.1.0/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..717c1e1931552d3b44b27886383823ea757884d4
--- /dev/null
+++ b/package/bisheng/2.1.0/install.sh
@@ -0,0 +1,6 @@
+#download from https://mirrors.huaweicloud.com/kunpeng/archive/compiler/bisheng_compiler/bisheng-compiler-2.1.0-aarch64-linux.tar.gz
+#!/bin/bash
+set -e
+cd ${JARVIS_TMP}
+yum -y install libatomic libstdc++ libstdc++-devel
+tar xzvf ${JARVIS_DOWNLOAD}/bisheng-compiler-2.1.0-aarch64-linux.tar.gz -C $1 --strip-components=1
\ No newline at end of file
diff --git a/package/boost/1.72.0/install.sh b/package/boost/1.72.0/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d95b972fb1aba8b46c13fe644acbdb7b754059e5
--- /dev/null
+++ b/package/boost/1.72.0/install.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xvf ${JARVIS_DOWNLOAD}/boost_1_72_0.tar.gz
+cd boost_1_72_0
+./bootstrap.sh
+./b2 install --prefix=$1
\ No newline at end of file
diff --git a/package/cmake/3.20.5/install.sh b/package/cmake/3.20.5/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fb01ef8d0b6c2904f4b859ffa3d6bb0a719d6add
--- /dev/null
+++ b/package/cmake/3.20.5/install.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+set -e
+cd ${JARVIS_TMP}
+tar -xvf ${JARVIS_DOWNLOAD}/cmake-3.20.5-linux-aarch64.tar.gz -C $1 --strip-components=1
\ No newline at end of file
diff --git a/package/fftw/3.3.10/install.sh b/package/fftw/3.3.10/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d732a7178dc63acb02e3b52415c86737fc976a0f
--- /dev/null
+++ b/package/fftw/3.3.10/install.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+rm -rf fftw-3.3.10
+tar -xvf ${JARVIS_DOWNLOAD}/fftw-3.3.10.tar.gz
+cd fftw-3.3.10
+./configure --prefix=$1 MPICC=mpicc --enable-shared --enable-threads --enable-openmp --enable-mpi
+make -j install
\ No newline at end of file
diff --git a/package/fftw/3.3.8/install.sh b/package/fftw/3.3.8/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..df0242ae8d092bbf939326aa64b2252cd9a50485
--- /dev/null
+++ b/package/fftw/3.3.8/install.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xvf ${JARVIS_DOWNLOAD}/fftw-3.3.8.tar.gz
+cd fftw-3.3.8
+./configure --prefix=$1 MPICC=mpicc --enable-shared --enable-threads --enable-openmp --enable-mpi
+make -j install
\ No newline at end of file
diff --git a/package/gcc/9.3.1/install.sh b/package/gcc/9.3.1/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c4ac9f88adb8a095ec8536fa73fa38d4757c8132
--- /dev/null
+++ b/package/gcc/9.3.1/install.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+set -e
+cd ${JARVIS_TMP}
+tar -xzvf ${JARVIS_DOWNLOAD}/gcc-9.3.1-2021.03-aarch64-linux.tar.gz -C $1 --strip-components=1
\ No newline at end of file
diff --git a/package/gmp/6.2.0/install.sh b/package/gmp/6.2.0/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..650efaa20e84ad399ec41d42f714e89060effc8a
--- /dev/null
+++ b/package/gmp/6.2.0/install.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xvf ${JARVIS_DOWNLOAD}/gmp-6.2.0.tar.xz
+cd gmp-6.2.0
+./configure --prefix=$1
+make -j
+make install
\ No newline at end of file
diff --git a/package/gsl/2.6/install.sh b/package/gsl/2.6/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..40948323997731d10268d05df597e81f51d39599
--- /dev/null
+++ b/package/gsl/2.6/install.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+set -e
+cd ${JARVIS_TMP}
+tar -xvf ${JARVIS_DOWNLOAD}/gsl-2.6.tar.gz
+cd gsl-2.6
+./configure --prefix=$1
+make -j
+make install
diff --git a/package/hmpi/1.1.0/gcc/install.sh b/package/hmpi/1.1.0/gcc/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..254a5d9e3d8a84c33970a2eb70a1e7c395265068
--- /dev/null
+++ b/package/hmpi/1.1.0/gcc/install.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+set -e
+cd ${JARVIS_TMP}
+tar -xvf ${JARVIS_DOWNLOAD}/Hyper-MPI_1.1.0_aarch64_CentOS7.6_GCC9.3_MLNX-OFED4.9.tar.gz -C $1 --strip-components=1
\ No newline at end of file
diff --git a/package/hmpi/1.1.1/install.sh b/package/hmpi/1.1.1/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0a1bd108c7b5fd9bd3a40d0fcb29e516ab4e1a0f
--- /dev/null
+++ b/package/hmpi/1.1.1/install.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+yum install -y perl-Data-Dumper autoconf automake libtool binutils
+rm -rf hmpi-1.1.1-huawei hucx-1.1.1-huawei xucg-1.1.1-huawei
+unzip ${JARVIS_DOWNLOAD}/hucx-1.1.1-huawei.zip
+unzip ${JARVIS_DOWNLOAD}/xucg-1.1.1-huawei.zip
+unzip ${JARVIS_DOWNLOAD}/hmpi-1.1.1-huawei.zip
+\cp -rf xucg-1.1.1-huawei/* hucx-1.1.1-huawei/src/ucg/
+sleep 3
+cd hucx-1.1.1-huawei
+./autogen.sh
+./contrib/configure-opt --prefix=$1/hucx CFLAGS="-DHAVE___CLEAR_CACHE=1" --disable-numa
+for file in `find . -name Makefile`;do sed -i "s/-Werror//g" $file;done
+for file in `find . -name Makefile`;do sed -i "s/-implicit-function-declaration//g" $file;done
+make -j64
+make install
+cd ../hmpi-1.1.1-huawei
+./autogen.pl
+./configure --prefix=$1 --with-platform=contrib/platform/mellanox/optimized --enable-mpi1-compatibility --with-ucx=$1/hucx
+make -j64
+make install
diff --git a/package/hmpi/FAQ.md b/package/hmpi/FAQ.md
new file mode 100644
index 0000000000000000000000000000000000000000..f4b848f92c6ccfd4cfb2ab1eb5b96871745da14b
--- /dev/null
+++ b/package/hmpi/FAQ.md
@@ -0,0 +1,7 @@
+Q：hucx/src/ucs/arch/aarch64/cpu.h：259:20：error: redefinition of 'ucs_arch_clear_cache'
+
+A:报错原因为该函数在其他地方已经被声明过了，无需重复声明, 应将src/ucs/arch/aarch64/cpu.h 中位于259–271行的函数注释或者删除掉
+
+Q: builtin.c: 969:21: error: comparison of array 'builtin_op->steps' not equal to a null pointer is always true
+
+A: builtin_op->steps不可能为空，该判断多余，直接删除即可
\ No newline at end of file
diff --git a/package/kgcc/10.3.1/install.sh b/package/kgcc/10.3.1/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..79fe13a6fed626317b220dfabdbfdec96336d7c3
--- /dev/null
+++ b/package/kgcc/10.3.1/install.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xzvf ${JARVIS_DOWNLOAD}/gcc-10.3.1-2021.09-aarch64-linux.tar.gz -C $1 --strip-components=1 
\ No newline at end of file
diff --git a/package/kgcc/9.3.1/install.sh b/package/kgcc/9.3.1/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..300c534ea7bcf1fc720a4d38c18ea662bb9e4663
--- /dev/null
+++ b/package/kgcc/9.3.1/install.sh
@@ -0,0 +1,5 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xzvf ${JARVIS_DOWNLOAD}/gcc-9.3.1-2021.03-aarch64-linux.tar.gz -C $1 --strip-components=1
\ No newline at end of file
diff --git a/package/kml/1.4.0/bisheng/install.sh b/package/kml/1.4.0/bisheng/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1eb7fc187bf7bc69e20e6f1125ef43c19789965e
--- /dev/null
+++ b/package/kml/1.4.0/bisheng/install.sh
@@ -0,0 +1,52 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+rpm -e boostkit-kml
+rpm --force --nodeps -ivh ${JARVIS_ROOT}/package/kml/1.4.0/bisheng/*.rpm
+# generate full lapack
+netlib=${JARVIS_DOWNLOAD}/lapack-3.9.1.tar.gz
+klapack=/usr/local/kml/lib/libklapack.a
+kservice=/usr/local/kml/lib/libkservice.a
+echo $netlib
+echo $klapack
+
+# build netlib lapack
+rm -rf netlib
+mkdir netlib
+cd netlib
+tar zxvf $netlib
+mkdir build
+cd build
+cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_POSITION_INDEPENDENT_CODE=ON ../lapack-3.9.1
+make -j
+cd ../..
+
+cp netlib/build/lib/liblapack.a liblapack_adapt.a
+
+# get symbols defined both in klapack and netlib lapack
+nm -g liblapack_adapt.a | grep 'T ' | grep -oP '\K\w+(?=_$)' | sort | uniq > netlib.sym
+nm -g $klapack | grep 'T ' | grep -oP '\K\w+(?=_$)' | sort | uniq > klapack.sym
+comm -12 klapack.sym netlib.sym > comm.sym 
+
+objcopy -W dsecnd_ -W second_ liblapack_adapt.a
+
+# add _netlib_ postfix to symbols in liblapack_adapt.a (e.g. dgetrf_netlib_)
+while read sym; do \
+    if ! nm liblapack_adapt.a | grep -qe " T ${sym}_\$"; then \
+        continue; \
+    fi; \
+    ar x liblapack_adapt.a $sym.f.o; \
+    mv $sym.f.o ${sym}_netlib.f.o; \
+    objcopy --redefine-sym ${sym}_=${sym}_netlib_ ${sym}_netlib.f.o; \
+    ar d liblapack_adapt.a ${sym}.f.o; \
+    ar ru liblapack_adapt.a ${sym}_netlib.f.o; \
+    rm ${sym}_netlib.f.o; \
+done < comm.sym
+
+# (optional) build a full lapack shared library
+clang -o libklapack_full.so -shared -fPIC -Wl,--whole-archive $klapack liblapack_adapt.a $kservice -Wl,--no-whole-archive -fopenmp -lpthread -lgfortran -lm
+
+\cp libklapack_full.so /usr/local/kml/lib/
+echo "Generated liblapack_adapt.a and libklapack_full.so"
+exit 0
\ No newline at end of file
diff --git a/package/kml/1.4.0/gcc/install.sh b/package/kml/1.4.0/gcc/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..084bbef519643fa66dd02980336af8ad07cbc617
--- /dev/null
+++ b/package/kml/1.4.0/gcc/install.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+rpm -e boostkit-kml
+rpm --force --nodeps -ivh ${JARVIS_ROOT}/package/kml/1.4.0/gcc/*.rpm
+cp -rf ${JARVIS_ROOT}/package/kml/1.4.0/gcc/libklapack_full.so /usr/local/kml/lib
\ No newline at end of file
diff --git a/package/lapack/3.8.0/install.sh b/package/lapack/3.8.0/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..dc4942aa0ec363708b714490d5897f070c314dc3
--- /dev/null
+++ b/package/lapack/3.8.0/install.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xvf ${JARVIS_DOWNLOAD}/lapack-3.8.0.tgz
+cd lapack-3.8.0
+cp make.inc.example make.inc
+make -j
+mkdir $1/lib/
+cp *.a $1/lib/
\ No newline at end of file
diff --git a/package/libint/2.6.0/install.sh b/package/libint/2.6.0/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d48e3eb3e4f831057a288f085f6377684c774f65
--- /dev/null
+++ b/package/libint/2.6.0/install.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+export GCC_LIBS=/home/HT3/HPCRunner2/software/libs/kgcc9
+tar -xvf ${JARVIS_DOWNLOAD}/libint-2.6.0.tar.gz
+cd libint-2.6.0
+./autogen.sh
+mkdir build
+cd build
+export LDFLAGS="-L${GCC_LIBS}/gmp/6.2.0/lib -L${GCC_LIBS}/boost/1.72.0/lib"
+export CPPFLAGS="-I${GCC_LIBS}/gmp/6.2.0/include -I${GCC_LIBS}/boost/1.72.0/include"
+../configure CXX=mpicxx --enable-eri=1 --enable-eri2=1 --enable-eri3=1 --with-max-am=4 --with-eri-max-am=4,3 --with-eri2-max-am=6,5 --with-eri3-max-am=6,5 --with-opt-am=3 --enable-generic-code --disable-unrolling --with-libint-exportdir=libint_cp2k_lmax4
+make export
+tar -xvf libint_cp2k_lmax4.tgz
+cd libint_cp2k_lmax4
+./configure --prefix=$1 CC=mpicc CXX=mpicxx FC=mpifort --enable-fortran --enable-shared
+make -j 32
+make install
diff --git a/package/libvori/21.04.12/install.sh b/package/libvori/21.04.12/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9f782329a895fc05ca5b86f2a17aedceee0ee674
--- /dev/null
+++ b/package/libvori/21.04.12/install.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xzvf ${JARVIS_DOWNLOAD}/libvori-210412.tar.gz
+cd libvori-210412
+mkdir build
+cd build
+cmake .. -DCMAKE_INSTALL_PREFIX=$1
+make -j
+make install
+
diff --git a/package/libxc/5.1.4/install.sh b/package/libxc/5.1.4/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..cc4d52340b5470b8358dd0c6a8f1c5fcd8e7b765
--- /dev/null
+++ b/package/libxc/5.1.4/install.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xvf ${JARVIS_DOWNLOAD}/libxc-5.1.4.tar.gz
+cd libxc-5.1.4
+./configure FC=gfortran CC=gcc --prefix=$1
+make -j
+make install
diff --git a/package/openblas/0.3.18/install.sh b/package/openblas/0.3.18/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..d475d9e78dd32c9d39a627f87615b6e00937e43f
--- /dev/null
+++ b/package/openblas/0.3.18/install.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xzvf ${JARVIS_DOWNLOAD}/OpenBLAS-0.3.18.tar.gz
+cd OpenBLAS-0.3.18
+make -j
+make PREFIX=$1 install
diff --git a/package/openmpi/4.1.2/gpu/install.sh b/package/openmpi/4.1.2/gpu/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..06ff675a366cc66c07b8b6bb3dc13521965d4161
--- /dev/null
+++ b/package/openmpi/4.1.2/gpu/install.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+#install ucx
+tar -xvf ${JARVIS_DOWNLOAD}/ucx-1.12.0.tar.gz
+cd ucx
+./autogen.sh
+./contrib/configure-release --prefix=$1/ucx
+make -j8
+make install
+#install openmpi
+tar -xvf ${JARVIS_DOWNLOAD}/openmpi-4.1.2.tar.gz
+cd openmpi-4.1.2
+CPP=cpp CC=nvc CFLAGS='-DNDEBUG -O1 -nomp -fPIC -fno-strict-aliasing -tp=haswell' CXX=nvc++ CXXFLAGS='-DNDEBUG -O1 -nomp -fPIC -finline-functions -tp=haswell' F77=nvfortran F90=nvfortran FC=nvfortran FCFLAGS='-O1 -nomp -fPIC -tp=haswell' FFLAGS='-fast -Mipa=fast,inline -tp=haswell' LDFLAGS=-Wl,--as-needed ./configure --prefix=$1 --disable-debug --disable-getpwuid --disable-mem-debug --disable-mem-profile --disable-memchecker --disable-static --enable-mca-no-build=btl-uct --enable-mpi1-compatibility --enable-oshmem --with-cuda=/usr/local/cuda --with-ucx=$1/ucx --enable-mca-no-build=op-avx
+make -j8
+make install
+
+export LIBRARY_PATH=$1/lib:$LIBRARY_PATH
+export PATH=$1/bin:$PATH \
+UCX_IB_PCI_RELAXED_ORDERING=on \
+UCX_MAX_RNDV_RAILS=1 \
+UCX_MEMTYPE_CACHE=n \
+UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda \
+UCX_TLS=rc_v,sm,cuda_copy,cuda_ipc,gdr_copy (or UCX_TLS=all)
diff --git a/package/openmpi/4.1.2/install.sh b/package/openmpi/4.1.2/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..6b93da002460d131b4bd8651555e802edb349b31
--- /dev/null
+++ b/package/openmpi/4.1.2/install.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xvf ${JARVIS_DOWNLOAD}/openmpi-4.1.2.tar.gz
+cd openmpi-4.1.2
+./configure CC=gcc CXX=g++ FC=gfortran --prefix=$1 --enable-pretty-print-stacktrace --enable-orterun-prefix-by-default  --with-knem=/opt/knem-1.1.4.90mlnx1/ --with-hcoll=/opt/mellanox/hcoll/ --with-cma --with-ucx --enable-mpi1-compatibility
+make -j install
diff --git a/package/plumed/2.6.2/install.sh b/package/plumed/2.6.2/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..a75c28b31617c98ccd5263497882ce6078b923d7
--- /dev/null
+++ b/package/plumed/2.6.2/install.sh
@@ -0,0 +1,9 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xvf ${JARVIS_DOWNLOAD}/plumed-2.6.2.tgz
+cd plumed-2.6.2
+./configure CXX=mpicxx CC=mpicc FC=mpifort --prefix=$1 --enable-external-blas --enable-gsl --enable-external-lapack LDFLAGS=-L/home//HT3/HPCRunner2/package/lapack/3.8.0/lapack-3.8.0/ LIBS="-lrefblas –llapack"
+make -j
+make install
diff --git a/package/python3/3.7.10/install.sh b/package/python3/3.7.10/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..49f88a7350acfa2fb6f1ed98e91a77665b8070de
--- /dev/null
+++ b/package/python3/3.7.10/install.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+# https://repo.huaweicloud.com/python/3.7.10/Python-3.7.10.tgz
+set -x
+set -e
+cd ${JARVIS_TMP}
+yum -y install zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gcc make libffi-devel
+tar -zxvf ${JARVIS_DOWNLOAD}/Python-3.7.10.tgz
+cd Python-3.7.10
+./configure --prefix=${JARVIS_COMPILER}/python3
+make
+make install
+ln -s ${JARVIS_COMPILER}/python3/bin/python3.7 /usr/local/bin/python3
\ No newline at end of file
diff --git a/package/scalapack/2.1.0/install.sh b/package/scalapack/2.1.0/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e79a4709e95a28c04cf4abdbc0db79914a88ccc4
--- /dev/null
+++ b/package/scalapack/2.1.0/install.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar  -xvf ${JARVIS_DOWNLOAD}/scalapack-2.1.0.tgz
+cd scalapack-2.1.0
+cp SLmake.inc.example SLmake.inc
+make -j
+mkdir $1/lib
+cp *.a $1/lib
diff --git a/package/scalapack/2.1.0/kml/install.sh b/package/scalapack/2.1.0/kml/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..26da61aa5d306a2a6c53101f40a0af2fd4e4c70a
--- /dev/null
+++ b/package/scalapack/2.1.0/kml/install.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+rm -rf scalapack-2.1.0
+tar -xvf ${JARVIS_DOWNLOAD}/scalapack-2.1.0.tgz
+cd scalapack-2.1.0
+rm -rf build
+mkdir build
+cd build
+cmake -DCMAKE_INSTALL_PREFIX=$1 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_SHARED_LIBS=ON -DBLAS_LIBRARIES=/usr/local/kml/lib/kblas/omp/libkblas.so -DLAPACK_LIBRARIES=/usr/local/kml/lib/libklapack_full.so -DCMAKE_C_COMPILER=mpicc -DCMAKE_Fortran_COMPILER=mpif90 ..
+make -j
+make install
\ No newline at end of file
diff --git a/package/spglib/1.16.0/install.sh b/package/spglib/1.16.0/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c1877ea7d4ab322b7725445cc7e1b0672fe782b4
--- /dev/null
+++ b/package/spglib/1.16.0/install.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+tar -xvf ${JARVIS_DOWNLOAD}/spglib-1.16.0.tar.gz
+cd spglib-1.16.0
+mkdir build
+cd build
+cmake .. -DCMAKE_INSTALL_PREFIX=$1
+make -j
+make install
diff --git a/package/tau/2.30.0/install.sh b/package/tau/2.30.0/install.sh
new file mode 100644
index 0000000000000000000000000000000000000000..e7803c978cc4ab8af3535a8f7b68b3479a84530a
--- /dev/null
+++ b/package/tau/2.30.0/install.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+set -x
+set -e
+cd ${JARVIS_TMP}
+# install PDT
+tar -zxvf ${JARVIS_DOWNLOAD}/pdt.tgz
+cd pdtoolkit-3.25.1/
+./configure -GNU -prefix=$1/PDT
+make -j install
+# install TAU, using tau with external package
+tar -zxvf ${JARVIS_DOWNLOAD}/tau-2.30.0.tar.gz
+cd tau-2.30.0/
+./configure -openmp -bfd=download -unwind=download -mpi -pdt=$1/PDT/ -pdt_c++=g++ -mpi
+export PATH=$1/tau-2.30.0/arm64_linux/bin:$PATH
+
+#usage: mpirun --allow-run-as-root -np 128 -x OMP_NUM_THREADS=1 --mca btl ^openib  tau_exec  vasp_std
+#pprof
diff --git a/software/compiler/bisheng/2.1.0/installed b/software/compiler/bisheng/2.1.0/installed
new file mode 100644
index 0000000000000000000000000000000000000000..c227083464fb9af8955c90d2924774ee50abb547
--- /dev/null
+++ b/software/compiler/bisheng/2.1.0/installed
@@ -0,0 +1 @@
+0
\ No newline at end of file
diff --git a/software/compiler/gcc/9.3.1/installed b/software/compiler/gcc/9.3.1/installed
new file mode 100644
index 0000000000000000000000000000000000000000..c227083464fb9af8955c90d2924774ee50abb547
--- /dev/null
+++ b/software/compiler/gcc/9.3.1/installed
@@ -0,0 +1 @@
+0
\ No newline at end of file
diff --git a/software/compiler/kgcc/10.3.1/installed b/software/compiler/kgcc/10.3.1/installed
new file mode 100644
index 0000000000000000000000000000000000000000..c227083464fb9af8955c90d2924774ee50abb547
--- /dev/null
+++ b/software/compiler/kgcc/10.3.1/installed
@@ -0,0 +1 @@
+0
\ No newline at end of file
diff --git a/software/compiler/kgcc/9.3.1/installed b/software/compiler/kgcc/9.3.1/installed
new file mode 100644
index 0000000000000000000000000000000000000000..c227083464fb9af8955c90d2924774ee50abb547
--- /dev/null
+++ b/software/compiler/kgcc/9.3.1/installed
@@ -0,0 +1 @@
+0
\ No newline at end of file
diff --git a/software/compiler/python3/installed b/software/compiler/python3/installed
new file mode 100644
index 0000000000000000000000000000000000000000..c227083464fb9af8955c90d2924774ee50abb547
--- /dev/null
+++ b/software/compiler/python3/installed
@@ -0,0 +1 @@
+0
\ No newline at end of file
diff --git a/software/libs/bisheng2/openblas/0.3.18/installed b/software/libs/bisheng2/openblas/0.3.18/installed
new file mode 100644
index 0000000000000000000000000000000000000000..c227083464fb9af8955c90d2924774ee50abb547
--- /dev/null
+++ b/software/libs/bisheng2/openblas/0.3.18/installed
@@ -0,0 +1 @@
+0
\ No newline at end of file
diff --git a/software/libs/gcc9/fftw/3.3.8/installed b/software/libs/gcc9/fftw/3.3.8/installed
new file mode 100644
index 0000000000000000000000000000000000000000..c227083464fb9af8955c90d2924774ee50abb547
--- /dev/null
+++ b/software/libs/gcc9/fftw/3.3.8/installed
@@ -0,0 +1 @@
+0
\ No newline at end of file
diff --git a/software/libs/nvc/installed b/software/libs/nvc/installed
new file mode 100644
index 0000000000000000000000000000000000000000..c227083464fb9af8955c90d2924774ee50abb547
--- /dev/null
+++ b/software/libs/nvc/installed
@@ -0,0 +1 @@
+0
\ No newline at end of file
diff --git a/software/moduledeps/gcc9-openmpi4/scalapack/2.1.0 b/software/moduledeps/gcc9-openmpi4/scalapack/2.1.0
new file mode 100644
index 0000000000000000000000000000000000000000..c0af1c6e6d9bab59c55af2d401a7bd2111603148
--- /dev/null
+++ b/software/moduledeps/gcc9-openmpi4/scalapack/2.1.0
@@ -0,0 +1,5 @@
+#%Module1.0#####################################################################
+set rootdir $::env(JARVIS_ROOT)
+set     version			 2.1.0
+
+prepend-path	LD_LIBRARY_PATH	    $rootdir/software/libs/gcc9/openmpi4/scalapack/2.1.0
diff --git a/software/moduledeps/gcc9/openblas/0.3.18 b/software/moduledeps/gcc9/openblas/0.3.18
new file mode 100644
index 0000000000000000000000000000000000000000..509b37ca869a54aa8bf89dc996f2a7516979c336
--- /dev/null
+++ b/software/moduledeps/gcc9/openblas/0.3.18
@@ -0,0 +1,12 @@
+#%Module1.0#####################################################################
+set rootdir $::env(JARVIS_ROOT)
+set prefix $rootdir/software/libs/gcc9/openblas/0.3.18
+set     version             0.3.18
+
+prepend-path    PATH                $prefix/bin
+prepend-path    INCLUDE             $prefix/include
+prepend-path    LD_LIBRARY_PATH     $prefix/lib
+
+setenv          OPENBLAS_DIR        $prefix
+setenv          OPENBLAS_LIB        $prefix/lib
+setenv          OPENBLAS_INC        $prefix/include
diff --git a/software/modulefiles/gcc9/9.3.1 b/software/modulefiles/gcc9/9.3.1
new file mode 100644
index 0000000000000000000000000000000000000000..2a2a3f888552d5a362768e9400e1b6126da73b99
--- /dev/null
+++ b/software/modulefiles/gcc9/9.3.1
@@ -0,0 +1,10 @@
+#%Module1.0#####################################################################
+set rootdir $::env(JARVIS_ROOT)
+set     prefix  $rootdir/software/compiler/gcc/9.3.1
+set     version			    9.3.1
+
+prepend-path    PATH               $prefix/bin
+prepend-path    MANPATH            $prefix/share/man
+prepend-path    INCLUDE            $prefix/include
+prepend-path	LD_LIBRARY_PATH	   $prefix/lib64
+prepend-path    MODULEPATH         $rootdir/software/moduledeps/gcc9
diff --git a/software/mpi/openmpi4-gcc9/4.1.2/installed b/software/mpi/openmpi4-gcc9/4.1.2/installed
new file mode 100644
index 0000000000000000000000000000000000000000..c227083464fb9af8955c90d2924774ee50abb547
--- /dev/null
+++ b/software/mpi/openmpi4-gcc9/4.1.2/installed
@@ -0,0 +1 @@
+0
\ No newline at end of file
diff --git a/software/utils/cmake/3.20.5/installed b/software/utils/cmake/3.20.5/installed
new file mode 100644
index 0000000000000000000000000000000000000000..c227083464fb9af8955c90d2924774ee50abb547
--- /dev/null
+++ b/software/utils/cmake/3.20.5/installed
@@ -0,0 +1 @@
+0
\ No newline at end of file
diff --git a/src/analysis.py b/src/analysis.py
new file mode 100644
index 0000000000000000000000000000000000000000..1492550b3db69d68f1cc61ebeb532a68183e256f
--- /dev/null
+++ b/src/analysis.py
@@ -0,0 +1,640 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*- 
+import platform
+import sys
+import os
+import re
+from glob import glob
+
+from data import Data
+from tool import Tool
+from execute import Execute
+from machine import Machine
+from bench import Benchmark
+
+from enum import Enum
+ 
+class SType(Enum):
+    COMPILER = 1
+    MPI = 2
+    UTIL = 3
+    LIB = 4
+
+class Install:
+    def __init__(self):
+        self.hpc_data = Data()
+        self.exe = Execute()
+        self.tool = Tool()
+        self.ROOT = os.getcwd()
+        self.PACKAGE_PATH = os.path.join(self.ROOT, 'package')
+        self.COMPILER_PATH = os.path.join(self.ROOT, 'software/compiler')
+        self.LIBS_PATH = os.path.join(self.ROOT, 'software/libs')
+        self.MODULE_DEPS_PATH = os.path.join(self.ROOT, 'software/moduledeps')
+        self.MODULE_FILES = os.path.join(self.ROOT, 'software/modulefiles')
+        self.MPI_PATH = os.path.join(self.ROOT, 'software/mpi')
+        self.UTILS_PATH = os.path.join(self.ROOT, 'software/utils')
+
+    def get_version_info(self, info):
+        return re.search( r'(\d+)\.(\d+)\.',info).group(1)
+
+    # some command don't generate output, must redirect to a tmp file
+    def get_cmd_output(self, cmd):
+        tmp_path = os.path.join(self.ROOT, 'tmp')
+        tmp_file = os.path.join(tmp_path, 'tmp.txt')
+        self.tool.mkdirs(tmp_path)
+        cmd += f' &> {tmp_file}'
+        self.exe.exec_popen(cmd, False)
+        info_list = self.tool.read_file(tmp_file).split('\n')
+        return info_list
+
+    def get_gcc_info(self):
+        gcc_info_list = self.get_cmd_output('gcc -v')
+        gcc_info = gcc_info_list[-1].strip()
+        version = self.get_version_info(gcc_info)
+        name = 'gcc'
+        if 'kunpeng' in gcc_info.lower():
+            name =  'kgcc'
+        return {"cname": name, "cmversion": version}
+
+    def get_clang_info(self):
+        clang_info_list = self.get_cmd_output('clang -v')
+        clang_info = clang_info_list[0].strip()
+        version = self.get_version_info(clang_info)
+        name = 'clang'
+        if 'bisheng' in clang_info.lower():
+            name =  'bisheng'
+        return {"cname": name, "cmversion": version}
+
+    def get_nvc_info(self):
+        return {"cname": "cuda", "cmversion": "11"}
+
+    def get_icc_info(self):
+        return {"cname": "icc", "cmversion": "11"}
+
+    def get_mpi_info(self):
+        mpi_info_list = self.get_cmd_output('mpirun -version')
+        mpi_info = mpi_info_list[0].strip()
+        name = 'openmpi'
+        version = self.get_version_info(mpi_info)
+        hmpi_info = self.get_cmd_output('ompi_info | grep "MCA coll: ucx"')[0]
+        if hmpi_info != "":
+            name = 'hmpi'
+            version = re.search( r'Component v(\d+)\.(\d+)\.',hmpi_info).group(1)
+        return {"name": name, "version": version}
+
+    def check_software_path(self, software_path):
+        abs_software_path = os.path.join(self.PACKAGE_PATH, software_path)
+        if not os.path.exists(abs_software_path):
+            print(f"{software_path} not exist, Are you sure the software lies in package dir?")
+            return False
+        return abs_software_path
+
+    def check_compiler_mpi(self, compiler_list, compiler_mpi_info):
+        no_compiler = ["COM","ANY"]
+        is_valid = False
+        compiler_mpi_info = compiler_mpi_info.upper()
+        valid_list = []
+        for compiler in compiler_list:
+            valid_list.append(compiler)
+            valid_list.append(f'{compiler}+MPI')
+        valid_list += no_compiler
+        for valid_para in valid_list:
+            if compiler_mpi_info == valid_para:
+                is_valid = True
+                break
+        if not is_valid:
+            print(f"compiler or mpi info error, Only {valid_list.join('/').lower()} is supported")
+            return False
+        return compiler_mpi_info
+
+    def get_used_compiler(self, compiler_mpi_info):
+        return compiler_mpi_info.split('+')[0]
+
+    def get_software_type(self,software_name, compiler_mpi_info):
+        if self.is_mpi_software(software_name):
+            return SType.MPI
+        if compiler_mpi_info == "COM":
+            return SType.COMPILER
+        elif compiler_mpi_info == "ANY":
+            return SType.UTIL
+        else:
+            return SType.LIB
+
+    def get_suffix(self, software_info_list):
+        if len(software_info_list) == 3:
+            return software_info_list[2]
+        return ""
+
+    def get_software_info(self, software_path, compiler_mpi_info):
+        software_info_list = software_path.split('/')
+        software_name = software_info_list[0]
+        software_version = software_info_list[1]
+        software_main_version = self.get_main_version(software_version)
+        software_type = self.get_software_type(software_name, compiler_mpi_info)
+        software_info = {
+            "sname":software_name, 
+            "sversion": software_version, 
+            "mversion": software_main_version, 
+            "type" : software_type,
+            "suffix": self.get_suffix(software_info_list)
+        }
+        if software_type == SType.LIB or software_type == SType.MPI:
+            software_info["is_use_mpi"] = self.is_contained_mpi(compiler_mpi_info)
+            software_info["use_compiler"] = self.get_used_compiler(compiler_mpi_info)
+        return software_info
+
+    def get_compiler_info(self, compilers, compiler_mpi_info):
+        compiler_info = {"cname":None, "cmversion": None}
+        for compiler, info_func in compilers.items():
+            if compiler in compiler_mpi_info:
+                compiler_info = info_func()
+        return compiler_info
+
+    def get_main_version(self, version):
+        return version.split('.')[0]
+
+    def is_mpi_software(self, software_name):
+        mpis = ['hmpi', 'openmpi', 'hpcx']
+        return software_name in mpis
+
+    def add_mpi_path(self, software_info, install_path):
+        if not software_info['is_use_mpi']:
+            return install_path
+        mpi_info = self.get_mpi_info()
+        if mpi_info["version"] == None:
+            print("MPI not found!")
+            return False
+        mpi_str = mpi_info["name"]+mpi_info["version"]
+        print("Use MPI: "+mpi_str)
+        install_path = os.path.join(install_path, mpi_str)
+        return install_path
+
+    def get_install_path(self, software_info, env_info):
+        suffix = software_info['suffix']
+        sversion = software_info['sversion']
+        stype = software_info['type']
+        cname = env_info['cname']
+        if suffix != "":
+            software_info['sname'] += '-' + suffix
+        sname = software_info['sname']
+        if stype == SType.MPI:
+            return os.path.join(self.MPI_PATH, f"{sname}{self.get_main_version(sversion)}-{cname}{env_info['cmversion']}", sversion)
+        if stype == SType.COMPILER:
+            install_path = os.path.join(self.COMPILER_PATH, f'{sname}/{sversion}')
+        elif stype == SType.UTIL:
+            install_path = os.path.join(self.UTILS_PATH, f'{sname}/{sversion}')
+        else:
+            install_path = os.path.join(self.LIBS_PATH, cname+env_info['cmversion'])
+            # get mpi name and version
+            install_path = self.add_mpi_path(software_info, install_path)
+            install_path = os.path.join(install_path, f'{sname}/{sversion}')
+        return install_path
+
+    def is_contained_mpi(self, compiler_mpi_info):
+        return "MPI" in compiler_mpi_info
+    
+    def get_files(self, abs_path):
+        file_list = [d for d in glob(abs_path+'/**', recursive=True)]
+        return file_list
+
+    def get_module_file_content(self, install_path, sversion):
+        module_file_content = ''
+        file_list = self.get_files(install_path)
+        bins_dir_type = ["bin"]
+        libs_dir_type = ["libs", "lib", "lib64"]
+        incs_dir_type = ["include"]
+        bins_dir = []
+        libs_dir = []
+        incs_dir = []
+        bins_str = ''
+        libs_str = ''
+        incs_str = ''
+        for file in file_list:
+            if not os.path.isdir(file):
+                continue
+            last_dir = file.split('/')[-1]
+            if last_dir in bins_dir_type:
+                bins_dir.append(file.replace(install_path, "$prefix"))
+            elif last_dir in libs_dir_type:
+                libs_dir.append(file.replace(install_path, "$prefix"))
+            elif last_dir in incs_dir_type:
+                incs_dir.append(file.replace(install_path, "$prefix"))
+        if len(bins_dir) >= 1:
+            bins_str = "prepend-path    PATH              "+':'.join(bins_dir)
+        if len(libs_dir) >= 1:
+            libs_str = "prepend-path    LD_LIBRARY_PATH            "+':'.join(libs_dir)
+        if len(incs_dir) >= 1:
+            incs_str = "prepend-path	INCLUDE	   " + ':'.join(incs_dir)
+        module_file_content = f'''#%Module1.0#####################################################################
+set     prefix  {install_path}
+set     version			    {sversion}
+
+{bins_str}
+{libs_str}
+{incs_str}
+'''
+        return module_file_content
+
+    def get_installed_file_path(self, install_path):
+        return os.path.join(install_path, "installed")
+
+    def is_installed(self, install_path):
+        installed_file_path = self.get_installed_file_path(install_path)
+        if not os.path.exists(installed_file_path):
+            return False
+        if not self.tool.read_file(installed_file_path) == "1":
+            return False
+        return True
+
+    def set_installed_status(self, install_path):
+        installed_file_path = self.get_installed_file_path(install_path)
+        self.tool.write_file(installed_file_path, "1")
+
+    def gen_module_file(self, install_path, software_info, env_info):
+        sname = software_info['sname']
+        sversion = software_info['sversion']
+        stype = software_info['type']
+        cname = env_info['cname']
+        cmversion = env_info['cmversion']
+        software_str = sname + self.get_main_version(sversion)
+        module_file_content = self.get_module_file_content(install_path, sversion)
+        if not self.is_installed(install_path):
+            return
+        if stype == SType.MPI:
+            compiler_str = cname + cmversion
+            module_path = os.path.join(self.MODULE_DEPS_PATH, compiler_str ,software_str)
+            attach_module_path = os.path.join(self.MODULE_DEPS_PATH, compiler_str+'-'+software_str)
+            self.tool.mkdirs(attach_module_path)
+            module_file_content += f"\nprepend-path    MODULEPATH     {attach_module_path}"
+        else:
+            if stype == SType.COMPILER:
+                module_path = os.path.join(self.MODULE_FILES, software_str)
+                attach_module_path = os.path.join(self.MODULE_DEPS_PATH, software_str)
+                self.tool.mkdirs(attach_module_path)
+                module_file_content += f"\nprepend-path    MODULEPATH     {attach_module_path}"
+            elif stype == SType.UTIL:
+                module_path = os.path.join(self.MODULE_FILES, sname)
+            else:
+                compiler_str = cname + cmversion
+                if software_info['is_use_mpi']:
+                    mpi_info = self.get_mpi_info()
+                    mpi_str = mpi_info['name'] + self.get_main_version(mpi_info['version'])
+                    module_path = os.path.join(self.MODULE_DEPS_PATH, f"{compiler_str}-{mpi_str}" ,sname)
+                else:
+                    module_path = os.path.join(self.MODULE_DEPS_PATH, compiler_str, sname)
+        self.tool.mkdirs(module_path)
+        module_file = os.path.join(module_path, sversion)
+        self.tool.write_file(module_file, module_file_content)
+        print(f"module file {module_file} successfully generated")
+
+    def install_package(self, abs_software_path, install_path):
+        install_script = 'install.sh'
+        install_script_path = os.path.join(abs_software_path, install_script)
+        print("start installing..."+ abs_software_path)
+        if not os.path.exists(install_script_path):
+            print("install script not exists, skipping...")
+            return
+        self.tool.mkdirs(install_path)
+        if self.is_installed(install_path):
+            print("already installed, skipping...")
+            return
+        install_cmd = f'''
+source ./init.sh
+cd {abs_software_path}
+chmod +x {install_script}
+./{install_script} {install_path}
+'''
+        result = self.exe.exec_raw(install_cmd)
+        if result:
+            print(f"install to {install_path} successful")
+            self.set_installed_status(install_path)
+        else:
+            print("install failed")
+            sys.exit()
+
+    def install(self, software_path, compiler_mpi_info):
+        self.tool.prt_content("INSTALL " + software_path)
+        compilers = {"GCC":self.get_gcc_info, "CLANG":self.get_clang_info,
+                     "NVC":self.get_nvc_info, "ICC":self.get_icc_info,
+		             "BISHENG":self.get_clang_info}
+        
+        # software_path should exists
+        abs_software_path = self.check_software_path(software_path)
+        if not abs_software_path: return
+        compiler_mpi_info = self.check_compiler_mpi(compilers.keys(), compiler_mpi_info)
+        if not compiler_mpi_info: return
+        software_info = self.get_software_info(software_path, compiler_mpi_info)
+        stype = software_info['type']
+        # get compiler name and version
+        env_info = self.get_compiler_info(compilers, compiler_mpi_info)
+        if stype == SType.LIB or stype == SType.MPI:
+            cmversion = env_info['cmversion']
+            if cmversion == None:
+                print(f"The specified {software_info['use_compiler']} Compiler not found!")
+                return False
+            else:
+                print(f"Use Compiler: {env_info['cname']} {cmversion}")
+        
+        # get install path
+        install_path = self.get_install_path(software_info, env_info)
+        if not install_path: return
+        # get install script
+        self.install_package(abs_software_path, install_path)
+        # gen module file
+        self.gen_module_file( install_path, software_info, env_info)
+
+    def install_depend(self):
+        depend_file = 'depend_install.sh'
+        print(f"start installing dependendcy of {Data.app_name}")
+        depend_content = f'''
+{Data.dependency}
+'''
+        self.tool.write_file(depend_file, depend_content)
+        run_cmd = f'''
+chmod +x {depend_file}
+./{depend_file}
+'''
+        self.exe.exec_raw(run_cmd)
+
+class Env:
+    def __init__(self):
+        self.hpc_data = Data()
+        self.tool = Tool()
+        self.ROOT = os.getcwd()
+        self.exe = Execute()
+
+    def env(self):
+        print(f"set environment {Data.app_name}")
+        env_file = os.path.join(self.ROOT, Data.env_file)
+        self.tool.write_file(env_file, Data.module_content)
+        print(f"ENV FILE {Data.env_file} GENERATED.")
+        self.exe.exec_raw(f'chmod +x {Data.env_file}')
+
+class Build:
+    def __init__(self):
+        self.hpc_data = Data()
+        self.exe = Execute()
+        
+    def clean(self):
+        print(f"start clean {Data.app_name}")
+        clean_cmd=self.hpc_data.get_clean_cmd()
+        self.exe.exec_raw(clean_cmd)
+
+    def build(self):
+        print(f"start build {Data.app_name}")
+        build_cmd = self.hpc_data.get_build_cmd()
+        self.exe.exec_raw(build_cmd)
+
+class Run:
+    def __init__(self):
+        self.hpc_data = Data()
+        self.exe = Execute()
+        self.tool = Tool()
+        self.ROOT = os.getcwd()
+        self.avail_ips_list = self.tool.gen_list(Data.avail_ips)
+
+    def gen_hostfile(self, nodes):
+        length = len(self.avail_ips_list)
+        if nodes > length:
+            print(f"You don't have {nodes} nodes, only {length} nodes available!")
+            sys.exit()
+        if nodes <= 1:
+            return
+        gen_nodes = '\n'.join(self.avail_ips_list[:nodes])
+        print(f"HOSTFILE\n{gen_nodes}\nGENERATED.")
+        self.tool.write_file('hostfile', gen_nodes)
+
+    # single run
+    def run(self):
+        print(f"start run {Data.app_name}")
+        nodes = int(Data.run_cmd['nodes'])
+        self.gen_hostfile(nodes)
+        run_cmd = self.hpc_data.get_run_cmd()
+        self.exe.exec_raw(run_cmd)
+
+    def batch_run(self):
+        batch_file = 'batch_run.sh'
+        batch_file_path = os.path.join(self.ROOT, batch_file)
+        print(f"start batch run {Data.app_name}")
+        batch_content = f'''
+cd {Data.case_dir}
+{Data.batch_cmd}
+'''
+        self.tool.write_file(batch_file_path, batch_content)
+        run_cmd = f'''
+chmod +x {batch_file}
+./{batch_file}
+'''
+        self.exe.exec_raw(run_cmd)
+
+class Perf:
+    def __init__(self):
+        self.hpc_data = Data()
+        self.exe = Execute()
+        self.tool = Tool()
+        self.isARM = platform.machine() == 'aarch64'
+
+    def get_pid(self):
+        #get pid
+        pid_cmd = f'pidof {Data.binary_file}'
+        result = self.exe.exec_popen(pid_cmd)
+        if len(result) == 0:
+            print("failed to get pid.")
+            sys.exit()
+        else:
+            pid_list = result[0].split(' ')
+        mid = int(len(pid_list)/2)
+        return pid_list[mid].strip()
+
+    def perf(self):
+        print(f"start perf {Data.app_name}")
+        #get pid
+        pid = self.get_pid()
+        #start perf && analysis
+        perf_cmd = f'''
+perf record {Data.perf_para} -a -g -p {pid}
+perf report  -i ./perf.data -F period,sample,overhead,symbol,dso,comm -s overhead --percent-limit 0.1% --stdio
+'''
+        self.exe.exec_raw(perf_cmd)
+
+    def get_arch(self):
+        arch = 'arm'
+        if not self.isARM:
+            arch = 'X86'
+        return arch
+
+    def get_cur_time(self):
+        return re.sub(' |:', '-', self.tool.get_time_stamp())
+
+    def gpu_perf(self):
+        print(f"start gpu perf")
+        run_cmd = self.hpc_data.get_run()
+        gperf_cmd = f'''
+cd {Data.case_dir}
+nsys profile -y 5s -d 100s {Data.nsys_para} -o nsys-{self.get_arch()}-{self.get_cur_time()} {run_cmd}
+    '''
+        self.exe.exec_raw(gperf_cmd)
+
+    def ncu_perf(self, kernel):
+        print(f"start ncu perf")
+        run_cmd = self.hpc_data.get_run()
+        ncu_cmd = f'''
+    cd {Data.case_dir}
+    ncu --export ncu-{self.get_arch()}-{self.get_cur_time()} {Data.ncu_para} --import-source=yes --set full --kernel-name {kernel} --launch-skip 1735 --launch-count 1 {run_cmd}
+    '''
+        self.exe.exec_raw(ncu_cmd)
+
+class Download:
+    def __init__(self):
+        self.hpc_data = Data()
+        self.exe = Execute()
+        self.tool = Tool()
+        self.ROOT = os.getcwd()
+        self.download_list = self.tool.gen_list(Data.download_info)
+        self.download_path = os.path.join(self.ROOT, 'downloads')
+
+    def check_network(self):
+        print(f"start network checking")
+        network_test_cmd='''
+wget --spider -T 5 -q -t 2 www.baidu.com | echo $?
+curl -s -o /dev/null www.baidu.com | echo $?
+    '''
+        self.exe.exec_raw(network_test_cmd)
+
+    def change_yum_repo(self):
+        print(f"start yum repo change")
+        repo_cmd = '''
+cp ./templates/yum/*.repo /etc/yum.repos.d/
+yum clean all
+yum makecache
+'''
+        self.exe.exec_raw(repo_cmd)
+
+    def gen_wget_url(self, out_dir='./downloads', url=''):
+        head = "wget --no-check-certificate"
+        out_para = "-P"
+        download_url = f'{head} {out_para} {out_dir} {url}'
+        return download_url
+
+    def download(self):
+        print(f"start download")
+        url_links = []
+        self.tool.mkdirs(self.download_path)
+        download_flag = False
+        # create directory
+        for url_info in self.download_list:
+            url_list = url_info.split(' ')
+            if len(url_list) != 2:
+                continue
+            software_info = url_list[0].strip()
+            url_link = url_list[1].strip()
+            url_links.append(url_link)
+            # create software directory
+            software_path = os.path.join(self.ROOT, 'package', software_info)
+            self.tool.mkdirs(software_path)
+            # create install script
+            install_script = os.path.join(software_path, "install.sh")
+            self.tool.mkfile(install_script)
+        # start download
+        for url in url_links:
+            download_flag = True
+            filename = os.path.basename(url)
+            file_path = os.path.join(self.download_path, filename)
+            if os.path.exists(file_path):
+                self.tool.prt_content(f"FILE {filename} already DOWNLOADED")
+                continue
+            download_url = self.gen_wget_url(self.download_path, url)
+            self.tool.prt_content("DOWNLOAD " + filename)
+            os.popen(download_url)
+        if not download_flag:
+            print("The download list is empty!")
+class Test:
+    def __init__(self):
+        self.exe = Execute()
+        self.ROOT = os.getcwd()
+        self.test_dir = os.path.join(self.ROOT, 'test')
+
+    def test(self):
+        run_cmd = f'''
+cd {self.test_dir}
+./test-qe.sh
+cd {self.test_dir}
+./test-util.sh
+'''
+        self.exe.exec_raw(run_cmd)
+
+class Config:
+    def __init__(self):
+        self.exe = Execute()
+        self.tool = Tool()
+        self.ROOT = os.getcwd()
+
+    def switch_config(self, config_file):
+        print(f"Switch config file to {config_file}")
+        meta_path = os.path.join(self.ROOT, Data.meta_file)
+        self.tool.write_file(meta_path, config_file.strip())
+        print("Successfully switched.")
+
+class Analysis:
+    def __init__(self):
+        self.jmachine = Machine()
+        self.jtest = Test()
+        self.jdownload = Download()
+        self.jbenchmark = Benchmark()
+        self.jperf = Perf()
+        self.jrun = Run()
+        self.jbuild = Build()
+        self.jenv = Env()
+        self.jinstall = Install()
+        self.jconfig = Config()
+
+    def get_machine_info(self):
+        self.jmachine.output_machine_info()
+
+    def bench(self, bench_case):
+        self.jbenchmark.output_bench_info(bench_case)
+
+    def switch_config(self, config_file):
+        self.jconfig.switch_config(config_file)
+
+    def test(self):
+        self.jtest.test()
+
+    def download(self):
+        self.jdownload.download()
+
+    def check_network(self):
+        self.jdownload.check_network()
+
+    def gpu_perf(self):
+        self.jperf.gpu_perf()
+    
+    def ncu_perf(self, kernel):
+        self.jperf.ncu_perf(kernel)
+
+    def perf(self):
+        self.jperf.perf()
+
+    def kperf(self):
+        self.jperf.kperf()
+    
+    def run(self):
+        self.jrun.run()
+    
+    def batch_run(self):
+        self.jrun.batch_run()
+
+    def clean(self):
+        self.jbuild.clean()
+    
+    def build(self):
+        self.jbuild.build()
+
+    def env(self):
+        self.jenv.env()
+    
+    def install(self,software_path, compiler_mpi_info):
+        self.jinstall.install(software_path, compiler_mpi_info)
+    
+    def install_deps(self):
+        self.jinstall.install_depend()
diff --git a/src/bench.py b/src/bench.py
new file mode 100644
index 0000000000000000000000000000000000000000..96f55d70c9ac7b4912c28041dd66b919b57b132e
--- /dev/null
+++ b/src/bench.py
@@ -0,0 +1,26 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*- 
+import platform
+import os
+from glob import glob
+
+from execute import Execute
+
+class Benchmark:
+    def __init__(self):
+        self.isARM = platform.machine() == 'aarch64'
+        self.ROOT = os.getcwd()
+        self.exe = Execute()
+        self.RUN_FILE = 'run.sh'
+        self.ALL = 'all'
+    
+    def output_bench_info(self, bench_case):
+        bench_path = os.path.join(self.ROOT, 'benchmark')
+        file_list = [d for d in glob(bench_path+'/**', recursive=False)]
+        for file in file_list:
+            cur_bench_case = os.path.basename(file)
+            run_file = os.path.join(file, self.RUN_FILE)
+            if os.path.isdir(file) and os.path.exists(run_file):
+                cmd = f"cd {file} && chmod +x  {self.RUN_FILE} && ./{self.RUN_FILE}"
+                if cur_bench_case == self.ALL or cur_bench_case == bench_case:
+                    self.exe.exec_raw(cmd)
diff --git a/data.py b/src/data.py
similarity index 75%
rename from data.py
rename to src/data.py
index 116348ff86a07c15e82a8a2998f614e750f72538..31bb4d2bd2b3a73e997613057648e35031339528 100644
--- a/data.py
+++ b/src/data.py
@@ -3,10 +3,13 @@
 import os
 import platform
 
+from tool import Tool
+
 class Data:
     # Hardware Info
     avail_ips=''
     # Dependent Software environment Info
+    dependency = ''
     module_content=''
     env_file = 'env.sh'
     # Application Info
@@ -23,33 +26,36 @@ class Data:
     batch_cmd = ''
     #Other Info
     meta_file = '.meta'
-    download_urls = '''
-https://www.cp2k.org/static/downloads/libxc-5.1.4.tar.gz
-https://www.cp2k.org/static/downloads/fftw-3.3.8.tar.gz
-'''
-    
+    root_path = os.getcwd()
+    download_info = ''
+    #perf info
+    kperf_para = ''
+    perf_para = ''
+    nsys_para = ''
+    ncu_para = ''
+    def get_abspath(self, relpath):
+        return os.path.join(Data.root_path, relpath)
+
     def __init__(self):
         self.isARM = platform.machine() == 'aarch64'
+        self.tool = Tool()
         self.data_process()
 
     def get_file_name(self):
         file_name = 'data.config'
         if not os.path.exists(Data.meta_file):
-            if not self.isARM:
-                file_name = 'data.X86.config'
             return file_name
-        with open(Data.meta_file, encoding='utf-8') as file_obj:
-            contents = file_obj.read()
-            return contents.strip()
+        return self.tool.read_file(Data.meta_file)
 
     def get_data_config(self):
         file_name = self.get_file_name()
-        with open(file_name, encoding='utf-8') as file_obj:
+        file_path = self.get_abspath(file_name)
+        with open(file_path, encoding='utf-8') as file_obj:
             contents = file_obj.read()
             return contents.strip()
 
-    def is_empty(self, content):
-        return len(content) == 0 or content.isspace() or content == '\n'
+    def is_empty(self, str):
+        return len(str) == 0 or str.isspace() or str == '\n'
 
     def read_rows(self, rows, start_row):
         data = ''
@@ -81,6 +87,12 @@ https://www.cp2k.org/static/downloads/fftw-3.3.8.tar.gz
         Data.build_dir = data['build_dir']
         Data.binary_dir = data['binary_dir']
         Data.case_dir = data['case_dir']
+    
+    def set_perf_info(self, data):
+        Data.kperf_para = data['kperf']
+        Data.perf_para = data['perf']
+        Data.nsys_para = data['nsys']
+        Data.ncu_para = data['ncu']
 
     def split_two_part(self, data):
         split_list = data.split(' ', 1)
@@ -95,10 +107,15 @@ https://www.cp2k.org/static/downloads/fftw-3.3.8.tar.gz
         rows = contents.split('\n')
         rowIndex = 0
         data = {}
+        perf_data = {}
         while rowIndex < len(rows):
             row = rows[rowIndex].strip()
             if row == '[SERVER]':
                 rowIndex, Data.avail_ips = self.read_rows(rows, rowIndex+1)
+            elif row == '[DOWNLOAD]':
+                rowIndex, Data.download_info = self.read_rows(rows, rowIndex+1)
+            elif row == '[DEPENDENCY]':
+                rowIndex, Data.dependency = self.read_rows(rows, rowIndex+1)
             elif row == '[ENV]':
                 rowIndex, Data.module_content = self.read_rows(rows, rowIndex+1)
             elif row == '[APP]':
@@ -112,6 +129,9 @@ https://www.cp2k.org/static/downloads/fftw-3.3.8.tar.gz
                 rowIndex, Data.run_cmd = self.read_rows_kv(rows, rowIndex+1)
             elif row == '[BATCH]':
                 rowIndex, Data.batch_cmd = self.read_rows(rows, rowIndex+1)
+            elif row == '[PERF]':
+                rowIndex, perf_data = self.read_rows_kv(rows, rowIndex+1)
+                self.set_perf_info(perf_data)
             else:
                 rowIndex += 1
         Data.binary_file, Data.binary_para = self.split_two_part(Data.run_cmd['binary'])
@@ -121,9 +141,14 @@ https://www.cp2k.org/static/downloads/fftw-3.3.8.tar.gz
 cd {Data.build_dir}
 {Data.clean_cmd}
 '''
+    def get_env(self):
+        return f'''
+./jarvis -e
+source ./{Data.env_file}'''
 
     def get_build_cmd(self):
         return f'''
+{self.get_env()}
 cd {Data.build_dir}
 {Data.build_cmd}
 '''
@@ -141,6 +166,7 @@ cd {Data.build_dir}
 
     def get_run_cmd(self):
         return  f'''
+{self.get_env()}
 cd {Data.case_dir}
 {self.get_run()}
 '''
\ No newline at end of file
diff --git a/src/execute.py b/src/execute.py
new file mode 100644
index 0000000000000000000000000000000000000000..19e6b50f283d2bfb08a61d6fc1946e8d5782162b
--- /dev/null
+++ b/src/execute.py
@@ -0,0 +1,62 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*- 
+import os
+import logging
+from asyncio.log import logger
+from datetime import datetime
+from tool import Tool
+
+LOG_FORMAT = "%(asctime)s - %(levelname)s - %(message)s"
+DATE_FORMAT = "%m/%d/%Y %H:%M:%S %p"
+logging.basicConfig(filename='runner.log', level=logging.DEBUG, format=LOG_FORMAT, datefmt=DATE_FORMAT)
+
+class Execute:
+    def __init__(self):
+        self.cur_time = ''
+        self.end_time = ''
+        self.tool = Tool()
+        self.flags = '*' * 80
+        self.end_flag = 'END: '
+
+    # tools function
+    def join_cmd(self, arrs):
+        return " && ".join(arrs)
+
+    def print_cmd(self, cmd):
+        print(self.flags)
+        self.cur_time = self.tool.get_time_stamp()
+        print(f"RUNNING at {self.cur_time}:\n{cmd}")
+        logging.info(cmd)
+        print(self.flags)
+
+    # Execute, get output and don't know whether success or not
+    def exec_popen(self, cmd, isPrint=True):
+        if isPrint:
+            self.print_cmd(cmd)
+        output = os.popen(cmd).readlines()
+        return output
+
+    def get_duration(self):
+        time_1_struct = datetime.strptime(self.cur_time, "%Y-%m-%d %H:%M:%S")
+        time_2_struct = datetime.strptime(self.end_time, "%Y-%m-%d %H:%M:%S")
+        seconds = (time_2_struct - time_1_struct).seconds
+        return seconds
+
+    # Execute, get whether success or not
+    def exec_list(self, cmds):
+        cmd = self.join_cmd(cmds)
+        if not cmd.startswith('echo'):
+            self.print_cmd(cmd)
+        state = os.system(cmd)
+        self.end_time = self.tool.get_time_stamp()
+        print(f"total time used: {self.get_duration()}s")
+        logger.info(self.end_flag + cmd)
+        if state:
+            print(f"failed at {self.end_time}:{state}".upper())
+            return False
+        else:
+            print(f"successfully executed at {self.end_time}, congradulations!!!".upper())
+            return True
+
+    def exec_raw(self, rows):
+        return self.exec_list(self.tool.gen_list(rows))
\ No newline at end of file
diff --git a/src/jarvis.py b/src/jarvis.py
new file mode 100644
index 0000000000000000000000000000000000000000..5b03d64df3d1a3d54541cec285c74609fe490fbe
--- /dev/null
+++ b/src/jarvis.py
@@ -0,0 +1,105 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*- 
+import argparse
+
+from data import Data
+from analysis import Analysis
+
+class Jarvis:
+    def __init__(self):
+        self.analysis = Analysis()
+        # Argparser set
+        parser = argparse.ArgumentParser(description=f'please put me into CASE directory, used for {Data.app_name} Compiler/Clean/Run/Compare',
+                    usage='%(prog)s [-h] [--build] [--clean] [...]')
+        parser.add_argument("-v","--version", help=f"get version info", action="store_true")
+        parser.add_argument("-use","--use", help="Switch config file...", nargs=1)
+        parser.add_argument("-i","--info", help=f"get machine info", action="store_true")
+        #accept software_name/version GCC/GCC+MPI/CLANG/CLANG+MPI
+        parser.add_argument("-install","--install", help=f"install dependency", nargs=2)
+        # dependency install
+        parser.add_argument("-dp","--depend", help=f"{Data.app_name} dependency install", action="store_true")
+        parser.add_argument("-e","--env", help=f"set environment {Data.app_name}", action="store_true")
+        parser.add_argument("-b","--build", help=f"compile {Data.app_name}", action="store_true")
+        parser.add_argument("-cls","--clean", help=f"clean {Data.app_name}", action="store_true")
+        parser.add_argument("-r","--run", help=f"run {Data.app_name}", action="store_true")
+        parser.add_argument("-p","--perf", help=f"auto perf {Data.app_name}", action="store_true")
+        parser.add_argument("-kp","--kperf", help=f"auto kperf {Data.app_name}", action="store_true")
+        # GPU perf
+        parser.add_argument("-gp","--gpuperf", help="GPU perf...", action="store_true")
+
+        # NCU perf
+        parser.add_argument("-ncu","--ncuperf", help="NCU perf...", nargs=1)
+        parser.add_argument("-c","--compare", help=f"compare {Data.app_name}", nargs=2)
+        # batch run
+        parser.add_argument("-rb","--rbatch", help=f"run batch {Data.app_name}", action="store_true")
+        # batch download
+        parser.add_argument("-d","--download", help="Batch Download...", action="store_true")
+        parser.add_argument("-net","--network", help="network checking...", action="store_true")
+        #change yum repo to aliyun
+        parser.add_argument("-yum","--yum", help="yum repo changing...", action="store_true")
+        # start benchmark test
+        parser.add_argument("-bench","--benchmark", help="start benchmark test...", nargs=1)
+        # start test
+        parser.add_argument("-t","--test", help="start Jarvis test...", action="store_true")
+        self.args = parser.parse_args()
+
+    def main(self):
+        if self.args.version:
+            print("V1.0")
+        
+        if self.args.info:
+            self.analysis.get_machine_info()
+
+        if self.args.install:
+            self.analysis.install(self.args.install[0], self.args.install[1])
+
+        if self.args.env:
+            self.analysis.env()
+
+        if self.args.clean:
+            self.analysis.clean()
+
+        if self.args.build:
+            self.analysis.build()
+
+        if self.args.run:
+            self.analysis.run()
+
+        if self.args.perf:
+            self.analysis.perf()
+
+        if self.args.kperf:
+            self.analysis.kperf()
+
+        if self.args.depend:
+            self.analysis.install_deps()
+
+        if self.args.rbatch:
+            self.analysis.batch_run()
+
+        if self.args.download:
+            self.analysis.download()
+
+        if self.args.gpuperf:
+            self.analysis.gpu_perf()
+        
+        if self.args.ncuperf:
+            self.analysis.ncu_perf(self.args.ncuperf[0])
+        
+        if self.args.use:
+            self.analysis.switch_config(self.args.use[0])
+        
+        if self.args.network:
+            self.analysis.check_network()
+
+        if self.args.yum:
+            self.analysis.change_yum_repo()
+        
+        if self.args.benchmark:
+            self.analysis.bench(self.args.benchmark[0])
+        
+        if self.args.test:
+            self.analysis.test()
+        
+if __name__ == '__main__':
+    Jarvis().main()
diff --git a/src/machine.py b/src/machine.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e40db494628a8e420859baa1ce883f3dc805c18
--- /dev/null
+++ b/src/machine.py
@@ -0,0 +1,27 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*- 
+from execute import Execute
+from tool import Tool
+
+class Machine:
+    def __init__(self):
+        self.exe = Execute()
+        self.tool = Tool()
+        self.info2cmd = {
+            'CHECK network adapter':'nmcli d',
+            'CHECK Machine Bits':'getconf LONG_BIT',
+            'CHECK OS':'cat /proc/version && uname -a',
+            'CHECK GPU': 'lspci | grep -i nvidia',
+            'CHECK Total Memory':'cat /proc/meminfo | grep MemTotal',
+            'CHECK Total Disk Memory':'fdisk -l | grep Disk',
+            'CHECK CPU info': 'cat /proc/cpuinfo | grep "processor" | wc -l && lscpu && dmidecode -t 4'
+        }
+
+    def get_info(self, content, cmd):
+        self.tool.prt_content(content)
+        self.exe.exec_raw(cmd)
+    
+    def output_machine_info(self):
+        print("get machine info")
+        for key, value in self.info2cmd.items():
+            self.get_info(key, value)
diff --git a/src/tool.py b/src/tool.py
new file mode 100644
index 0000000000000000000000000000000000000000..7cd3b62251641058efc57e3732ec8ff07144bed0
--- /dev/null
+++ b/src/tool.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*- 
+import time
+import os
+
+class Tool:
+    def __init__(self):
+        pass
+    
+    def prt_content(self, content):
+        flags = '*' * 30
+        print(f"{flags}{content}{flags}")
+
+    def gen_list(self, data):
+        return data.strip().split('\n')
+
+    def get_time_stamp(self):
+        return time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
+    
+    def read_file(self, filename):
+        content = ''
+        with open(filename, encoding='utf-8') as f:
+            content = f.read().strip()
+        return content
+    
+    def write_file(self, filename, content=""):
+        with open(filename,'w') as f:
+            f.write(content)
+    
+    def mkdirs(self, path):
+        if not os.path.exists(path):
+            os.makedirs(path)
+    
+    def mkfile(self, path, content=''):
+        if not os.path.exists(path):
+            self.write_file(path, content)
diff --git a/templates/data.CP2K.X86.config b/templates/CP2K/8.2/data.CP2K.X86.cpu.config
similarity index 100%
rename from templates/data.CP2K.X86.config
rename to templates/CP2K/8.2/data.CP2K.X86.cpu.config
diff --git a/templates/CP2K/8.2/data.CP2K.arm.cpu.config b/templates/CP2K/8.2/data.CP2K.arm.cpu.config
new file mode 100644
index 0000000000000000000000000000000000000000..66c49b6a3a1472f2d74f65c963d5dd3dfa1eab43
--- /dev/null
+++ b/templates/CP2K/8.2/data.CP2K.arm.cpu.config
@@ -0,0 +1,66 @@
+[SERVER]
+11.11.11.11
+
+[ENV]
+source /home/kpgcc-ompi.env
+export LIBRARY_PATH=/home/cp2k/EXTRA/gsl/lib:$LIBRARY_PATH
+export LD_LIBRARY_PATH=/home/cp2k/EXTRA/gsl/lib:$LD_LIBRARY_PATH
+export CPATH=/usr/local/cuda/include:$CPATH
+
+[APP]
+app_name = CP2K
+build_dir = /home/cp2k/CP2K/cp2k-8.2/
+binary_dir = /home/cp2k/CP2K/cp2k-8.2/exe/local-cpu/
+case_dir = /home/cp2k/CP2K/cp2k-8.2/benchmarks/QS/
+
+[BUILD]
+make -j 128 ARCH=local-cpu VERSION=psmp
+
+[CLEAN]
+make -j 128 ARCH=local-cpu VERSION=psmp clean
+
+[RUN]
+run = numactl -C 0-63  mpirun --allow-run-as-root -np 64 -map-by ppr:64:node:pe=1 -bind-to core -x OMP_NUM_THREADS=1 
+binary = cp2k.psmp H2O-256.inp
+nodes = 1
+
+[BATCH]
+#!/bin/bash
+
+logfile=cp2k.H2O-256.inp.log
+
+nvidia-smi -pm 1
+nvidia-smi -ac 1215,1410
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 32C*GPU===" >> $logfile
+mpirun -np 32 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 32C*2GPU===" >> $logfile
+mpirun -np 32 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0,1 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 64C*GPU===" >> $logfile
+mpirun -np 64 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 64C*2GPU===" >> $logfile
+mpirun -np 32 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0,1 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 128C*GPU===" >> $logfile
+mpirun -np 128 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 128C*2GPU===" >> $logfile
+mpirun -np 128 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0,1 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+
+
+
+
+
+
diff --git a/templates/CP2K/8.2/data.CP2K.arm.gpu.config b/templates/CP2K/8.2/data.CP2K.arm.gpu.config
new file mode 100644
index 0000000000000000000000000000000000000000..2012254a25a02c42d0f5972fed052c8eeccff1fe
--- /dev/null
+++ b/templates/CP2K/8.2/data.CP2K.arm.gpu.config
@@ -0,0 +1,98 @@
+[SERVER]
+11.11.11.11
+
+[DOWNLOAD]
+libint/2.6.0 https://github.com/evaleev/libint/archive/v2.6.0.tar.gz
+libXC/5.1.4 https://www.cp2k.org/static/downloads/libxc-5.1.4.tar.gz
+fftw/3.3.8 https://www.cp2k.org/static/downloads/fftw-3.3.8.tar.gz
+lapack/3.8.0 https://www.cp2k.org/static/downloads/lapack-3.8.0.tgz
+scalapack/2.1.0 https://www.cp2k.org/static/downloads/scalapack-2.1.0.tgz
+cmake/3.16.4 https://cmake.org/files/v3.16/cmake-3.16.4.tar.gz
+
+[DEPENDENCY]
+./jarvis -install kgcc/9.3.1 com
+module purge
+module use ./software/modulefiles
+module load kgcc9/9.3.1
+export CC=`which gcc`
+export CXX=`which g++`
+export FC=`which gfortran`
+./jarvis -install openmpi/4.1.2 gcc
+module load openmpi4/4.1.2
+./jarvis -install gmp/6.2.0 gcc
+./jarvis -install boost/1.72.0 gcc
+./jarvis -install libint/2.6.0 gcc+mpi
+./jarvis -install fftw/3.3.8 gcc+mpi
+./jarvis -install openblas/0.3.18 gcc
+module load openblas/0.3.18
+./jarvis -install scalapack/2.1.0 gcc+mpi
+./jarvis -install spglib/1.16.0 gcc
+./jarvis -install libxc/5.1.4 gcc
+./jarvis -install gsl/2.6 gcc
+module load gsl/2.6
+./jarvis -install plumed/2.6.2 gcc+mpi
+./jarvis -install libvori/21.04.12 gcc
+
+[ENV]
+module purge
+module load kgcc9/9.3.1
+module load openmpi4/4.1.2
+module load gsl/2.6
+
+[APP]
+app_name = CP2K
+build_dir = /home/HT3/HPCRunner2/cp2k-8.2/
+binary_dir = /home/HT3/HPCRunner2/cp2k-8.2/exe/local-cuda/
+case_dir = /home/HT3/HPCRunner2/cp2k-8.2/benchmarks/QS/
+
+[BUILD]
+make -j 128 ARCH=local-cuda VERSION=psmp
+
+[CLEAN]
+make -j 128 ARCH=local-cuda VERSION=psmp clean
+
+[RUN]
+run = numactl -C 0-63  mpirun --allow-run-as-root -x CUDA_VISIBLE_DEVICES=0,1 -np 64 -x OMP_NUM_THREADS=1
+binary = cp2k.psmp H2O-256.inp
+nodes = 1
+
+[BATCH]
+#!/bin/bash
+
+logfile=cp2k.H2O-256.inp.log
+
+nvidia-smi -pm 1
+nvidia-smi -ac 1215,1410
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 32C*GPU===" >> $logfile
+mpirun -np 32 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 32C*2GPU===" >> $logfile
+mpirun -np 32 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0,1 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 64C*GPU===" >> $logfile
+mpirun -np 64 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 64C*2GPU===" >> $logfile
+mpirun -np 32 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0,1 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 128C*GPU===" >> $logfile
+mpirun -np 128 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+echo 3 > /proc/sys/vm/drop_caches
+echo "===run 128C*2GPU===" >> $logfile
+mpirun -np 128 -genv OMP_NUM_THREADS=1 -genv CUDA_VISIBLE_DEVICES=0,1 exe/local-cuda/cp2k.psmp benchmarks/QS/H2O-256.inp > cp2k.H2O-256.inp.log  >> $logfile 2>&1
+
+
+
+
+
+
+
diff --git a/templates/data.amber.config b/templates/amber/20/data.amber.arm.gpu.config
similarity index 100%
rename from templates/data.amber.config
rename to templates/amber/20/data.amber.arm.gpu.config
diff --git a/templates/data.openfoam.config b/templates/openfoam/1960/data.openfoam.arm.cpu.config
similarity index 100%
rename from templates/data.openfoam.config
rename to templates/openfoam/1960/data.openfoam.arm.cpu.config
diff --git a/templates/openfoam/1960/data.openfoam.arm.cpu.opt.config b/templates/openfoam/1960/data.openfoam.arm.cpu.opt.config
new file mode 100644
index 0000000000000000000000000000000000000000..25abc6f8b8218dcdc6c328106aafbc4a3a062d70
--- /dev/null
+++ b/templates/openfoam/1960/data.openfoam.arm.cpu.opt.config
@@ -0,0 +1,34 @@
+[SERVER]
+11.11.11.11
+
+[DEPENDENCY]
+./jarvis -install bisheng/2.1.0 com
+module use ./software/modulefiles
+module load bisheng2
+./jarvis -install hmpi/1.1.1 clang
+module load hmpi1/1.1.1
+
+[ENV]
+# add gcc/mpi
+source /home/Jarvis3-4/HPCRunner/opt-OpenFOAM/opt_codes/OpenFOAM-v1906/etc/bashrc
+module use ./software/modulefiles
+module load bisheng2
+module load hmpi1/1.1.1
+
+[APP]
+app_name = OpenFOAM
+build_dir = /home/Jarvis3-4/HPCRunner/opt-OpenFOAM/opt_codes/OpenFOAM-v1906/
+binary_dir = 
+case_dir = /home/Jarvis3-4/HPCRunner/case/openfoam/audi/
+
+[BUILD]
+source /home/Jarvis3-4/HPCRunner/opt-OpenFOAM/opt_codes/OpenFOAM-v1906/etc/bashrc
+./Allwmake -j 64
+
+[CLEAN]
+rm -rf build
+
+[RUN]
+run = mpirun --allow-run-as-root -x PATH -x LD_LIBRARY_PATH -x WM_PROJECT_DIR -x WM_PROJECT_USER_DIR -np 128
+binary = pisoFoam –parallel 2
+nodes = 1
diff --git a/templates/qe/6.4/data.qe.test.config b/templates/qe/6.4/data.qe.test.config
new file mode 100644
index 0000000000000000000000000000000000000000..59254e0e8b21c6888b0d364119129e0c3721e7a5
--- /dev/null
+++ b/templates/qe/6.4/data.qe.test.config
@@ -0,0 +1,40 @@
+[SERVER]
+11.11.11.11
+
+[DEPENDENCY]
+./jarvis -install kgcc/9.3.1 com
+module purge
+module use ./software/modulefiles
+module load kgcc9/9.3.1
+export CC=`which gcc`
+export CXX=`which g++`
+export FC=`which gfortran`
+./jarvis -install openmpi/4.1.2/ gcc
+module load openmpi4/4.1.2
+#test if mpi is normal
+./jarvis -bench mpi
+
+[ENV]
+module purge
+module use ./software/modulefiles
+module load kgcc9
+module load openmpi4/4.1.2
+
+[APP]
+app_name = QE
+build_dir = /tmp/q-e-qe-6.4.1/
+binary_dir = /tmp/q-e-qe-6.4.1/bin/
+case_dir = /tmp/qe-test
+
+[BUILD]
+./configure F90=gfortran F77=gfortran MPIF90=mpifort MPIF77=mpifort CC=mpicc FCFLAGS="-O3" CFLAGS="-O3" --with-scalapack=no --enable-openmp
+make -j 96 pwall
+make install
+
+[CLEAN]
+make clean
+
+[RUN]
+run = mpirun --allow-run-as-root -x OMP_NUM_THREADS=1 -mca coll ^hcoll -mca btl ^vader,tcp,openib,uct -np 128
+binary = pw.x -input test_3.in
+nodes = 1
\ No newline at end of file
diff --git a/templates/qe/6.4/data.qe.test.opt.config b/templates/qe/6.4/data.qe.test.opt.config
new file mode 100644
index 0000000000000000000000000000000000000000..4b6d44762ffb73fffe94ead7fdb2d2ebd675a534
--- /dev/null
+++ b/templates/qe/6.4/data.qe.test.opt.config
@@ -0,0 +1,46 @@
+[SERVER]
+11.11.11.11
+
+[DEPENDENCY]
+./jarvis -install bisheng/2.1.0 com
+module purge
+module use ./software/modulefiles
+module load bisheng2/2.1.0
+export CC=`which clang`
+export CXX=`which clang++`
+export FC=`which flang`
+./jarvis -install hmpi/1.1.1 bisheng
+module load hmpi1/1.1.1
+./jarvis -bench mpi
+./jarvis -install kml/1.4.0/bisheng bisheng
+
+[ENV]
+source /etc/profile
+module purge
+module use ./software/modulefiles
+module load bisheng2/2.1.0
+export CC=`which clang`
+export CXX=`which clang++`
+export FC=`which flang`
+module load hmpi1/1.1.1
+export BLAS_LIBS="-L/usr/local/kml/lib/kblas/omp -lkblas"
+export LAPACK_LIBS="-L/usr/local/kml/lib/ -lklapack_full"
+
+[APP]
+app_name = QE
+build_dir = /tmp/q-e-qe-6.4.1/
+binary_dir = /tmp/q-e-qe-6.4.1/bin/
+case_dir = /tmp/qe-test/
+
+[BUILD]
+./configure F90=flang F77=flang MPIF90=mpifort MPIF77=mpifort CC=mpicc FCFLAGS="-O3" CFLAGS="-O3" --with-scalapack=no --enable-openmp
+make -j 96 pwall
+make install
+
+[CLEAN]
+make clean
+
+[RUN]
+run = mpirun --allow-run-as-root -x OMP_NUM_THREADS=1 -np 128
+binary = pw.x -input test_3.in
+nodes = 1
diff --git a/templates/qe/6.4/qe.block.opt.config b/templates/qe/6.4/qe.block.opt.config
new file mode 100644
index 0000000000000000000000000000000000000000..6eee58f4acc01efffaa31eae0e0c6ea730e492d9
--- /dev/null
+++ b/templates/qe/6.4/qe.block.opt.config
@@ -0,0 +1,56 @@
+[SERVER]
+11.11.11.11
+
+[DEPENDENCY]
+./jarvis -install bisheng/2.1.0 com
+module purge
+module use ./software/modulefiles
+module load bisheng2/2.1.0
+export CC=`which clang`
+export CXX=`which clang++`
+export FC=`which flang`
+./jarvis -install hmpi/1.1.1 bisheng
+module load hmpi1/1.1.1
+./jarvis -install cmake/3.20.5 bisheng
+module load cmake/3.20.5
+./jarvis -install kml/1.4.0/bisheng bisheng
+./jarvis -install scalapack/2.1.0/kml bisheng
+./jarvis -install fftw/3.3.10 bisheng
+module load fftw/3.3.10 scalapack/2.1.0 cmake/3.20.5
+#修改fortran_single的CMakeLists.txt，第10行，第74行，第75行
+./jarvis -install block-davidson/3.14 bisheng
+module load block-davidson/3.14
+
+[ENV]
+source /etc/profile
+module purge
+module use ./software/modulefiles
+module load bisheng2/2.1.0
+export CC=`which clang`
+export CXX=`which clang++`
+export FC=`which flang`
+module load hmpi1/1.1.1
+module load fftw/3.3.10 scalapack/2.1.0 block-davidson/3.14
+export BLAS_LIBS="-L/usr/local/kml/lib/kblas/omp -lkblas"
+export LAPACK_LIBS="-L/usr/local/kml/lib -lklapack_full"
+export SCALAPACK_LIBS="-L/home/fang/HT1/HPCRunner-master/software/libs/bisheng2/scalapack/2.1.0/lib/ -lscalapack"
+
+[APP]
+app_name = QE
+build_dir = /home/fang/HT1/HPCRunner-master/q-e-qe-6.4.1/
+binary_dir = /home/fang/HT1/HPCRunner-master/q-e-qe-6.4.1/bin
+case_dir = /home/fang/HT1/HPCRunner-master/workload/QE/GRIR443/
+
+[BUILD]
+# add tunning/QE/6.4/q-e-6.4.blockmesh.patch here
+./configure F90=flang F77=flang MPIF90=mpifort MPIF77=mpifort CC=mpicc FCFLAGS="-O3" CFLAGS="-O3" --with-scalapack=yes --enable-openmp
+make -j 96 pw
+make install
+
+[CLEAN]
+make clean
+
+[RUN]
+run = mpirun --allow-run-as-root -x OMP_NUM_THREADS=1 -np 128
+binary = pw.x -input grir443.in
+nodes = 1
diff --git a/templates/qe/6.5/data.qe.X86.cpu.config b/templates/qe/6.5/data.qe.X86.cpu.config
new file mode 100644
index 0000000000000000000000000000000000000000..22bf2f482097b971310769a8d46204d8ff20f88b
--- /dev/null
+++ b/templates/qe/6.5/data.qe.X86.cpu.config
@@ -0,0 +1,29 @@
+[SERVER]
+11.11.11.11
+
+[ENV]
+#add oneapi(include icc/mpi)
+source /workspace/cc/env/intel2021.4/setvars.sh
+# add cmake
+module use ./modules
+module add icc/cmake
+export LAPACK_LIBS="$MKLROOT/lib/intel64/libmkl_intel_lp64.a $MKLROOT/lib/intel64/libmkl_core.a"
+export BLAS_LIBS="$MKLROOT/lib/intel64/libmkl_sequential.a $MKLROOT/lib/intel64/libmkl_blacs_intelmpi_lp64.a -Wl,--end-group"
+
+[APP]
+app_name = QE
+build_dir = /home/csouser/HPCRunner/q-e-qe-6.5/
+binary_dir = /home/csouser/HPCRunner/q-e-qe-6.5/bin/
+case_dir = /home/csouser/HPCRunner/qe_large/
+
+[BUILD]
+./configure F90=ifort F77=ifort MPIF90=mpiifort MPIF77=mpiifort CC=mpicc FCFLAGS="-O3" CFLAGS="-O3" --with-scalapack=no
+make -j 40 pwall install
+
+[CLEAN]
+make clean
+
+[RUN]
+run = mpirun -n 40
+binary = pw.x -nk 8 -input scf.in
+nodes = 1
\ No newline at end of file
diff --git a/templates/qe/6.5/data.qe.arm.cpu.config b/templates/qe/6.5/data.qe.arm.cpu.config
new file mode 100644
index 0000000000000000000000000000000000000000..918aabcbd06b381d96b1fb623b4d88e85a4cd22f
--- /dev/null
+++ b/templates/qe/6.5/data.qe.arm.cpu.config
@@ -0,0 +1,29 @@
+[SERVER]
+11.11.11.11
+
+[ENV]
+source /etc/profile
+module use /opt/modulefile/
+module load gcc-9.3.1
+module load openmpi-4.1.1
+export BLAS_LIBS="-L/usr/local/kml/lib/kblas/omp -lkblas"
+export LAPACK_LIBS="-L/usr/local/kml/lib/ -lklapack_full"
+
+[APP]
+app_name = QE
+build_dir = /home/Jarvis3-4/HPCRunner/q-e-qe-6.5/
+binary_dir = /home/Jarvis3-4/HPCRunner/q-e-qe-6.5/bin/
+case_dir = /home/Jarvis3-4/HPCRunner/workload/QE/qe-large/
+
+[BUILD]
+./configure F90=gfortran F77=gfortran MPIF90=mpifort MPIF77=mpifort CC=mpicc FCFLAGS="-O3" CFLAGS="-O3" --with-scalapack=no --enable-openmp
+make -j 96 pwall
+make install
+
+[CLEAN]
+make clean
+
+[RUN]
+run = mpirun --allow-run-as-root -mca btl ^vader,tcp,openib,uct -np 128
+binary = pw.x -nk 8 -input scf.in
+nodes = 1
\ No newline at end of file
diff --git a/templates/qe/6.5/data.qe.arm.cpu.opt.config b/templates/qe/6.5/data.qe.arm.cpu.opt.config
new file mode 100644
index 0000000000000000000000000000000000000000..bd5d524380d8fea1da70a70161c6603573aa2e95
--- /dev/null
+++ b/templates/qe/6.5/data.qe.arm.cpu.opt.config
@@ -0,0 +1,46 @@
+[SERVER]
+11.11.11.11
+
+[DEPENDENCY]
+./jarvis -install bisheng/2.1.0 com
+module purge
+module use ./software/modulefiles
+module load bisheng2/2.1.0
+export CC=`which clang`
+export CXX=`which clang++`
+export FC=`which flang`
+./jarvis -install hmpi/1.1.1 bisheng
+module load hmpi1/1.1.1
+./jarvis -install kml/1.4.0/bisheng bisheng
+
+[ENV]
+source /etc/profile
+module purge
+module use ./software/modulefiles
+module load bisheng2/2.1.0
+export CC=`which clang`
+export CXX=`which clang++`
+export FC=`which flang`
+module load hmpi1/1.1.1
+export BLAS_LIBS="-L/usr/local/kml/lib/kblas/omp -lkblas"
+export LAPACK_LIBS="-L/usr/local/kml/lib/ -lklapack_full"
+
+[APP]
+app_name = QE
+build_dir = /tmp/q-e-qe-6.5/
+binary_dir = /tmp/q-e-qe-6.5/bin/
+case_dir = /tmp/qe-test/
+
+[BUILD]
+./configure F90=flang F77=flang MPIF90=mpifort MPIF77=mpifort CC=mpicc FCFLAGS="-O3" CFLAGS="-O3" --with-scalapack=no --enable-openmp
+sed -i "s/gfortran/flang/g" make.inc
+make -j 96 pwall
+make install
+
+[CLEAN]
+make clean
+
+[RUN]
+run = mpirun --allow-run-as-root -x OMP_NUM_THREADS=1 -np 128
+binary = pw.x -input test_3.in
+nodes = 1
diff --git a/templates/qe/6.8/data.qe.arm.cpu.config b/templates/qe/6.8/data.qe.arm.cpu.config
new file mode 100644
index 0000000000000000000000000000000000000000..bbe0749e04a9b100d0969a80344fb49321f8b24d
--- /dev/null
+++ b/templates/qe/6.8/data.qe.arm.cpu.config
@@ -0,0 +1,37 @@
+[SERVER]
+11.11.11.11
+
+[DEPENDENCY]
+./jarvis -install kgcc/9.3.1 com
+module use ./software/modulefiles
+module load kgcc9
+./jarvis -install hmpi/1.1.0/gcc gcc
+module load hmpi1/1.1.0
+./jarvis -install kml/1.4.0/gcc gcc
+
+[ENV]
+source /etc/profile
+module use ./software/modulefiles
+module load kgcc9
+module load hmpi1/1.1.0
+export BLAS_LIBS="-L/usr/local/kml/lib/kblas/omp -lkblas"
+export LAPACK_LIBS="-L/usr/local/kml/lib/ -lklapack_full"
+
+[APP]
+app_name = QE
+build_dir = /tmp/q-e-qe-6.8/
+binary_dir = /tmp/q-e-qe-6.8/bin/
+case_dir = /tmp/qe-large/
+
+[BUILD]
+./configure F90=gfortran F77=gfortran MPIF90=mpifort MPIF77=mpifort CC=mpicc FCFLAGS="-O3" CFLAGS="-O3" --with-scalapack=no --enable-openmp
+make -j 96 pwall
+make install
+
+[CLEAN]
+make clean
+
+[RUN]
+run = mpirun --allow-run-as-root -mca btl ^vader,tcp,openib,uct -np 128
+binary = pw.x -nk 8 -input scf.in
+nodes = 1
\ No newline at end of file
diff --git a/templates/data.qe.gpu.config b/templates/qe/6.8/data.qe.arm.gpu.config
similarity index 97%
rename from templates/data.qe.gpu.config
rename to templates/qe/6.8/data.qe.arm.gpu.config
index 5d00bfe8af6fbecfb104583c9c344d8c8051f4e5..60b78c182f58ef19bb66d11afff73d970b41f476 100644
--- a/templates/data.qe.gpu.config
+++ b/templates/qe/6.8/data.qe.arm.gpu.config
@@ -21,7 +21,7 @@ module load nvhpc/21.9
 app_name = QE
 build_dir = /home/HPCRunner-master/q-e-qe-6.8/
 binary_dir = /home/HPCRunner-master/q-e-qe-6.8/bin/
-case_dir = /home/HPCRunner-master/jiancong/
+case_dir = /home/HPCRunner-master/qe-large/
 
 [BUILD]
 ./configure --with-cuda=yes --with-cuda-runtime=11.4 --with-cuda-cc=80  --enable-openmp --with-scalapack=no
diff --git a/templates/data.vasp.config b/templates/vasp/5.4.4/data.vasp.arm.cpu.config
similarity index 100%
rename from templates/data.vasp.config
rename to templates/vasp/5.4.4/data.vasp.arm.cpu.config
diff --git a/templates/data.vasp6.1.gpu.x86.config b/templates/vasp/6.1.0/data.vasp.x86.gpu.config
similarity index 100%
rename from templates/data.vasp6.1.gpu.x86.config
rename to templates/vasp/6.1.0/data.vasp.x86.gpu.config
diff --git a/templates/yum/aliyun-Centos-7.repo b/templates/yum/aliyun-Centos-7.repo
new file mode 100644
index 0000000000000000000000000000000000000000..df18245ddb57fed48bf1dee61c24d0159d054312
--- /dev/null
+++ b/templates/yum/aliyun-Centos-7.repo
@@ -0,0 +1,62 @@
+# CentOS-Base.repo
+#
+# The mirror system uses the connecting IP address of the client and the
+# update status of each mirror to pick mirrors that are updated to and
+# geographically close to the client.  You should use this for CentOS updates
+# unless you are manually picking other mirrors.
+#
+# If the mirrorlist= does not work for you, as a fall back you can try the 
+# remarked out baseurl= line instead.
+#
+#
+ 
+[base]
+name=CentOS-$releasever - Base - mirrors.aliyun.com
+failovermethod=priority
+baseurl=http://mirrors.aliyun.com/centos/$releasever/os/$basearch/
+        http://mirrors.aliyuncs.com/centos/$releasever/os/$basearch/
+        http://mirrors.cloud.aliyuncs.com/centos/$releasever/os/$basearch/
+gpgcheck=1
+gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
+ 
+#released updates 
+[updates]
+name=CentOS-$releasever - Updates - mirrors.aliyun.com
+failovermethod=priority
+baseurl=http://mirrors.aliyun.com/centos/$releasever/updates/$basearch/
+        http://mirrors.aliyuncs.com/centos/$releasever/updates/$basearch/
+        http://mirrors.cloud.aliyuncs.com/centos/$releasever/updates/$basearch/
+gpgcheck=1
+gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
+ 
+#additional packages that may be useful
+[extras]
+name=CentOS-$releasever - Extras - mirrors.aliyun.com
+failovermethod=priority
+baseurl=http://mirrors.aliyun.com/centos/$releasever/extras/$basearch/
+        http://mirrors.aliyuncs.com/centos/$releasever/extras/$basearch/
+        http://mirrors.cloud.aliyuncs.com/centos/$releasever/extras/$basearch/
+gpgcheck=1
+gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
+ 
+#additional packages that extend functionality of existing packages
+[centosplus]
+name=CentOS-$releasever - Plus - mirrors.aliyun.com
+failovermethod=priority
+baseurl=http://mirrors.aliyun.com/centos/$releasever/centosplus/$basearch/
+        http://mirrors.aliyuncs.com/centos/$releasever/centosplus/$basearch/
+        http://mirrors.cloud.aliyuncs.com/centos/$releasever/centosplus/$basearch/
+gpgcheck=1
+enabled=0
+gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
+ 
+#contrib - packages by Centos Users
+[contrib]
+name=CentOS-$releasever - Contrib - mirrors.aliyun.com
+failovermethod=priority
+baseurl=http://mirrors.aliyun.com/centos/$releasever/contrib/$basearch/
+        http://mirrors.aliyuncs.com/centos/$releasever/contrib/$basearch/
+        http://mirrors.cloud.aliyuncs.com/centos/$releasever/contrib/$basearch/
+gpgcheck=1
+enabled=0
+gpgkey=http://mirrors.aliyun.com/centos/RPM-GPG-KEY-CentOS-7
diff --git a/templates/yum/hw-Centos-7.repo b/templates/yum/hw-Centos-7.repo
new file mode 100644
index 0000000000000000000000000000000000000000..4e43bbc6094d09ab211147221db6756ab854c370
--- /dev/null
+++ b/templates/yum/hw-Centos-7.repo
@@ -0,0 +1,43 @@
+# CentOS-Base.repo
+#
+# The mirror system uses the connecting IP address of the client and the
+# update status of each mirror to pick mirrors that are updated to and
+# geographically close to the client. You should use this for CentOS updates
+# unless you are manually picking other mirrors.
+#
+# If the mirrorlist= does not work for you, as a fall back you can try the
+# remarked out baseurl= line instead.
+#
+#
+
+[base]
+name=CentOS-$releasever - Base
+#mirrorlist=http://mirrors.tools.huawei.com/?release=$releasever&amp;arch=$basearch&amp;repo=os
+baseurl=http://mirrors.tools.huawei.com/centos/$releasever/os/$basearch/
+gpgcheck=1
+gpgkey=http://mirrors.tools.huawei.com/centos/RPM-GPG-KEY-CentOS-7
+
+#released updates
+[updates]
+name=CentOS-$releasever - Updates
+# mirrorlist=http://mirrors.tools.huawei.com/?release=$releasever&amp;arch=$basearch&amp;repo=updates
+baseurl=http://mirrors.tools.huawei.com/centos/$releasever/updates/$basearch/
+gpgcheck=1
+gpgkey=http://mirrors.tools.huawei.com/centos/RPM-GPG-KEY-CentOS-7
+
+#additional packages that may be useful
+[extras]
+name=CentOS-$releasever - Extras
+# mirrorlist=http://mirrors.tools.huawei.com/?release=$releasever&amp;arch=$basearch&amp;repo=extras
+baseurl=http://mirrors.tools.huawei.com/centos/$releasever/extras/$basearch/
+gpgcheck=1
+gpgkey=http://mirrors.tools.huawei.com/centos/RPM-GPG-KEY-CentOS-7
+
+#additional packages that extend functionality of existing packages
+[centosplus]
+name=CentOS-$releasever - Plus
+# mirrorlist=http://mirrors.tools.huawei.com/?release=$releasever&amp;arch=$basearch&amp;repo=centosplus
+baseurl=http://mirrors.tools.huawei.com/centos/$releasever/centosplus/$basearch/
+gpgcheck=1
+enabled=0
+gpgkey=http://mirrors.tools.huawei.com/centos/RPM-GPG-KEY-CentOS-7
\ No newline at end of file
diff --git a/templates/yum/kylin_aarch64.repo b/templates/yum/kylin_aarch64.repo
new file mode 100644
index 0000000000000000000000000000000000000000..e298fcb2586baa607ad7b4121618c69a11e76180
--- /dev/null
+++ b/templates/yum/kylin_aarch64.repo
@@ -0,0 +1,22 @@
+###Kylin Linux Advanced Server 10 - os repo###
+
+[ks10-adv-os]
+name = Kylin Linux Advanced Server 10 - Os 
+baseurl = http://update.cs2c.com.cn:8080/NS/V10/V10SP2/os/adv/lic/base/$basearch/
+gpgcheck = 1
+gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-kylin
+enabled = 1
+
+[ks10-adv-updates]
+name = Kylin Linux Advanced Server 10 - Updates
+baseurl = http://update.cs2c.com.cn:8080/NS/V10/V10SP2/os/adv/lic/updates/$basearch/
+gpgcheck = 1
+gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-kylin
+enabled = 1
+
+[ks10-adv-addons]
+name = Kylin Linux Advanced Server 10 - Addons
+baseurl = http://update.cs2c.com.cn:8080/NS/V10/V10SP2/os/adv/lic/addons/$basearch/
+gpgcheck = 1
+gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-kylin
+enabled = 0
diff --git a/test/test-qe-opt.sh b/test/test-qe-opt.sh
new file mode 100644
index 0000000000000000000000000000000000000000..1ae031bc47bcb8f0eda191e6c95d71044da3fe8c
--- /dev/null
+++ b/test/test-qe-opt.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+# back to root
+cd ..
+# release qe src code
+rm -rf /tmp/q-e-qe-6.4.1
+tar xzvf ./downloads/q-e-qe-6.4.1.tar.gz -C /tmp/
+# copy workload
+cp -rf ./workload/QE/qe-test /tmp
+# copy templates
+cp -rf ./templates/qe/6.4/data.qe.test.opt.config ./
+# switch to config
+./jarvis -use data.qe.test.opt.config
+# install dependency
+./jarvis -dp
+# generate environment
+./jarvis -e
+# environment setup
+source env.sh
+# build
+./jarvis -b
+# run
+./jarvis -r
+# perf
+./jarvis -p
+# kperf
+./jarvis -kp
+# gpu nsysperf
+./jarvis -gp
\ No newline at end of file
diff --git a/test/test-qe.sh b/test/test-qe.sh
new file mode 100644
index 0000000000000000000000000000000000000000..0248590ef415193ab7fed8e01f0dd500ee7252c4
--- /dev/null
+++ b/test/test-qe.sh
@@ -0,0 +1,27 @@
+#!/bin/bash
+# back to root
+cd ..
+# release qe src code
+tar xzvf ./downloads/q-e-qe-6.4.1.tar.gz -C /tmp/
+# copy workload
+cp -rf ./workload/QE/qe-test /tmp
+# copy templates
+cp -rf ./templates/qe/6.4/data.qe.test.config ./
+# switch to config
+./jarvis -use data.qe.test.config
+# install dependency
+./jarvis -dp
+# generate environment
+./jarvis -e
+# environment setup
+source env.sh
+# build
+./jarvis -b
+# run
+./jarvis -r
+# perf
+./jarvis -p
+# kperf
+./jarvis -kp
+# gpu nsysperf
+./jarvis -gp
\ No newline at end of file
diff --git a/test/test-util.sh b/test/test-util.sh
new file mode 100644
index 0000000000000000000000000000000000000000..210a8fb5e941adb78d36bbfeee6f91593667357f
--- /dev/null
+++ b/test/test-util.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+cd ..
+# check machine info
+./jarvis -i
+# gpu nsysperf
+./jarvis -gp
+# benchmark
+./jarvis -bench all
\ No newline at end of file
diff --git a/workloads/ReadMe.md b/workloads/ReadMe.md
new file mode 100644
index 0000000000000000000000000000000000000000..5f19fe69f65129bc1d1c90d9002ebfc99f97ab3f
--- /dev/null
+++ b/workloads/ReadMe.md
@@ -0,0 +1 @@
+存放常用的HPC应用小规模算例：通常小于1MB
\ No newline at end of file