# 基于spark的谷歌应用商店数据分析 **Repository Path**: xiaojiang999/sparkDemo ## Basic Information - **Project Name**: 基于spark的谷歌应用商店数据分析 - **Description**: 在 Linux 上使用 Hadoop 的 HDFS 存储原始数据和结果数据 并使用 Spark 进行数据清洗及分析在 JupyterLab 环境下使用 Python 的库进行可视化 - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-07-17 - **Last Updated**: 2024-10-27 ## Categories & Tags **Categories**: Uncategorized **Tags**: Python, Spark, Linux, hadoop, Jupyter-notebook ## README **基于spark的谷歌应用商店数据分析** 一、实验环境 数据处理及分析:Ubuntu Hadoop Spark Scala 可视化:Python (pyecharts、numpy、pandas、snapshot-selenium) 说明: JupyterLab(开发环境)selenium(4.0以上) ​ pyecharts:创建图表及数据可视化 ​ numpy:用于处理数组,数据处理、数值计算 ​ pandas:用于数据操作和分析 ​ snapshot-selenium:生成网页的快照 ​ JupyterLab:提供了一个集成的开发环境 ​ selenium:用于自动化Web,4.0以上可以自动匹配chomedriver的版本 二、数据集 ​ 谷歌应用商店获取公开数据集:data.csv ![输入图片说明](https://foruda.gitee.com/images/1724946528838771227/421cdadd_11604426.png "01.png") 三、数据预处理 ​ 数据预处理主要是去除重复条目及处理异常值,并删除非必要的字段 ​ 首先将data..csv上传至Hadoop的HDFS分布式文件系统中 ​ `hdfs dfs –put data.csv` ![输入图片说明](https://foruda.gitee.com/images/1724946579828193896/059504bd_11604426.png "02.png") ​ 然后启动spark`spark-shell`,进行预处理存放在HDFS中的data.csv ​ 在spark-shell中读入csv文件`var raw = spark.read.option("header", true).csv("data.csv")` ​ 删除重复数据`println("total rows: " + raw.count() +", distinct rows: " + raw.distinct().count())` `raw = raw.distinct()` ​ 去掉App字段重复的行`raw.groupBy("App").count.filter("count > 1").count()` `var distinct = raw.dropDuplicates("App")` ​ 删除非必要字段`var df = distinct.drop("Size", "Last Updated", "Current Ver", "Genres", "Android Ver")` ​ 各字段异常值处理及数据格式转换: ​ 对于字段*Category*:`df.select("Category").distinct().collectAsList()` `df = df.filter(not(col("Category").contains("1.9")))` ​ 对于字段*Rating*:`df.select("Rating").distinct().collectAsList()` `df.groupBy("Rating").count().sort($"Rating".desc).show()` `df = df.filter(col("Rating").contains(".") || col("Rating") === "NaN").withColumn("Rating", col("Rating").cast("float"))` ​ 对于字段*Reviews*:`df.filter(col("Reviews").isNull || col("Reviews") === "").count()` `df.filter(col("Reviews").rlike("[^0-9]{1,}")).count()` `df = df.withColumn("Reviews", col("Reviews").cast("int"))` ​ 对于字段*Installs*:`df = df.withColumn("Installs", regexp_replace(col("Installs"), "[^0-9]", "")).withColumn("Installs", col("Installs").cast("int"))` ​ 对于字段*Type*:`df = df.filter(not(col("Type") === "NaN"))` ​ 对于字段*Price*:`df = df.withColumn("Price", regexp_replace(col("Price"), "[$]", "").cast("float"))` 四、数据分析 ​ 1、先对应用按类别进行分组操作,然后计算各分组的总数即得到各类别的App数量category_count.csv ​ `df.groupBy("Category").count().sort(col("count").desc).toDF().write.option("header", true).csv("results/category_count.csv")` ​ 2、App的评分分布在0-5分之间,统计落到各评分区间的应用个数rating_distrib.csv ​ `df.withColumn("Rating",when($"Rating" >= 4.5, lit("[4.5, 5.0]")).when($"Rating">=4 && $"Rating"<4.5, lit("[4.0, 4.5)")).when($"Rating">=3.5 && $"Rating"<4, lit("[3.5, 4.0)")).when($"Rating">=3 && $"Rating"<3.5, lit("[3.0, 3.5)")).otherwise(lit("[0,3)"))).groupBy("Rating").count().sort($"Rating".desc).write.option("header", true).csv("results/rating_distrib.csv")` ​ 3、App评论数分布,将评论数划分,然后统计各种情况的个数reviews_distrib.csv ​ `df.withColumn("Reviews", when($"Reviews" >= 1000000, lit("1M+")).when($"Reviews">= 10000 && $"Reviews"<1000000, lit("10K+")).when($"Reviews">=100 && $"Reviews"<10000, lit("100+")).when($"Reviews">0 && $"Reviews"<100, lit("0+")).otherwise(lit("0"))).groupBy("Reviews").count().sort($"Reviews".desc).write.option("header", true).csv("results/reviews_distrib.csv")` ​ 4、App安装量分布,也是将安装量划分,然后统计各档数量installs_distrib.csv ​ `df.withColumn("Installs",when($"Installs" >= 100000000, lit("100M+")).when($"Installs"> 100000 && $"Installs"<100000000, lit("100K+")).when($"Installs">=100 && $"Installs"<100000, lit("100+")).when($"Installs">0 && $"Installs"<100, lit("0+")).otherwise(lit("0"))).groupBy("Installs").count().sort($"Installs".desc).write.option("header", true).csv("results/installs_distrib.csv")` ​ 5、获得安装量超过1亿的App有top_installs.csv ​ `df.select($"App", $"Installs").where($"Installs" >= 100000000).write.option("header", true).csv("results/top_installs.csv")` ​ 6、各类别中安装量前5的App有top_5_install_each_category.csv ​ `df.createTempView("view")` ​ `spark.sql("select App, Installs from " + "(select *, row_number()" + " over (partition by Category order by Installs desc) " + "as rn from view) as tmp where tmp.rn <= 5").write.option("header",true).csv("results/top_5_install_each_category.csv")` ​ 7、免费App与付费App评分、评论数、安装量对比free_vs_paid.csv ​ `df.na.drop().groupBy("Type").agg(round(avg("Rating"), 1) as "Rating",sum("Reviews") as "Reviews", sum("Installs") as "Installs").write.option("header", true).csv("results/free_vs_paid.csv")` ​ 8、付费App价格与评论数和安装量的关系price_reviews_installs.csv ​ `df.filter($"Type" === "Paid").groupBy("Price").agg(sum("Installs") as "Installs", sum("Reviews") as "Reviews").sort($"Price".asc).write.option("header", true).csv("results/price_reviews_installs.csv")` ​ 9、App的评论数与安装量之间的关系reviews_installs.csv ​ `df.groupBy("Installs").agg(sum($"Reviews") as "Reviews").sort($"Installs".asc).write.option("header", true).csv("results/reviews_installs.csv")` ​ 10、用户评分与评论数和安装量之间的关系rating_reviews_installs.csv ​ `df.groupBy("Rating").agg(sum($"Reviews") as "Reviews", sum($"Installs") as "Installs").sort($"Rating".asc).write.option("header", true).csv("results/rating_reviews_installs.csv")` ![输入图片说明](https://foruda.gitee.com/images/1724946638985130653/7bc360c0_11604426.png "03.png") 五、可视化 ​ 首先将HDFS中的分析结果取回本地`hdfs dfs -get results` (注意下载目录),并放在工程目录中。 ![输入图片说明](https://foruda.gitee.com/images/1724946688051669399/6e47c666_11604426.png "04.png") ​ 本项目可视化采用在JupyterLab环境下使用python进行可视化,因为JupyterLab提供了一个集成的开发环境使用很方便,可以一键下载所需要的库,且不需要配置相关文件 ​ 安装JupyterLab后`pip install jupyterlab`,启动`jupyter lab` ,然后在Notebook中安装所需要的库`pip install pyecharts numpy pandas snapshot-selenium` ;还需要安装selenium4,用于解决chomedriver版本不匹配的问题`pip install selenium==4.4.3`。 ![输入图片说明](https://foruda.gitee.com/images/1724946737781461858/61328d3f_11604426.png "05.png") ​ 由于可视化需要展示出分析结果,所以需要selenium抓取网页结果,涉及到游览器驱动ChromeDriver(本项目采用谷歌游览器),所以需要下载一个与自己的谷歌游览器匹配的ChromeDriver,并放在项目工程目录中。114版本及之前版本驱动下载地址:[https://registry.npmmirror.com/binary.html?path=chromedriver/](https://registry.npmmirror.com/binary.html?path=chromedriver/) ;115版本及之后的版本驱动下载地址:[Chrome for Testing availability (googlechromelabs.github.io)](https://googlechromelabs.github.io/chrome-for-testing/#canary) ,解压后放在工程目录中![输入图片说明](https://foruda.gitee.com/images/1724946798866285403/3e87ba4b_11604426.png "06.png") ​ ​ 所有准备工作做完后,编写可视化代码,我的工程目录中的visualization.py ​ ``` ``` import glob import numpy as np import pandas as pd from pyecharts.charts import Bar, Scatter, Pie, WordCloud from pyecharts.components import Table from pyecharts import options as opts from pyecharts.globals import CurrentConfig, NotebookType, SymbolType, ThemeType CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB # 导入输出图片工具 from pyecharts.render import make_snapshot # 使用snapshot-selenium 渲染图片 from snapshot_selenium import snapshot import os if not os.path.exists("img"): os.mkdir("img") # 各个类别的App的数量 category_count = pd.read_csv(glob.glob('results/category_count.csv/*.csv')[0]) bar_category_count = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.LIGHT)) .add_xaxis(category_count['Category'].tolist()[:10]) .add_yaxis("", category_count['count'].tolist()[:10]) .set_global_opts(xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)), title_opts={'text': 'App数量最多的10个类别'}) ) # bar_category_count.render_notebook() make_snapshot(snapshot, bar_category_count.render(), "img/category_count.png") # App评分分布 rating_dist = pd.read_csv(glob.glob('results/rating_distrib.csv/*.csv')[0]) pie_rating_dist = ( Pie(init_opts=opts.InitOpts(theme=ThemeType.INFOGRAPHIC)) .add("", [list(z) for z in zip(rating_dist['Rating'].tolist(), rating_dist['count'].tolist())]) .set_series_opts(label_opts=opts.LabelOpts(formatter="{d}%")) .set_global_opts( title_opts=opts.TitleOpts(title="用户评分分布"), legend_opts=opts.LegendOpts(type_="scroll", pos_left="80%", orient="vertical") ) ) # pie_rating_dist.render_notebook() make_snapshot(snapshot, pie_rating_dist.render(), "img/rating_dist.png") # App评论数分布 review_dist = pd.read_csv(glob.glob('results/reviews_distrib.csv/*.csv')[0]) pie_review_dist = ( Pie(init_opts=opts.InitOpts(theme=ThemeType.INFOGRAPHIC)) .add("", [list(z) for z in zip(review_dist['Reviews'].tolist(), review_dist['count'].tolist())]) .set_series_opts(label_opts=opts.LabelOpts(formatter="{d}")) .set_global_opts( title_opts=opts.TitleOpts(title="评论数量分布"), legend_opts=opts.LegendOpts(type_="scroll", pos_left="80%", orient="vertical") ) ) # pie_review_dist.render_notebook() make_snapshot(snapshot, pie_review_dist.render(), "img/review_dist.png") # App安装量分布 install_dist = pd.read_csv(glob.glob('results/installs_distrib.csv/*.csv')[0]) pie_install_dist = ( Pie(init_opts=opts.InitOpts(theme=ThemeType.INFOGRAPHIC)) .add("", [list(z) for z in zip(install_dist['Installs'].tolist(), install_dist['count'].tolist())]) .set_series_opts(label_opts=opts.LabelOpts(formatter="{d}")) .set_global_opts( title_opts=opts.TitleOpts(title="安装次数分布"), legend_opts=opts.LegendOpts(type_="scroll", pos_left="80%", orient="vertical") ) ) # pie_install_dist.render_notebook() make_snapshot(snapshot, pie_install_dist.render(), "img/install_dist.png") # 安装量超过1亿的App top_install = pd.read_csv(glob.glob('results/top_installs.csv/*.csv')[0]) top_install_list = list(top_install.itertuples(index=False, name=None)) wc_top_install = WordCloud() wc_top_install.add('', top_install_list, shape=SymbolType.DIAMOND, word_size_range=[10, 25]) # wc_top_install.render_notebook() make_snapshot(snapshot, wc_top_install.render(), "img/top_install.png") # 各类别中安装量前5的App top_5_install_each_category = pd.read_csv(glob.glob('results/top_5_install_each_category.csv/*.csv')[0]) top_5_install_each_category_list = list(top_5_install_each_category.itertuples(index=False, name=None)) wc_top_5_install = WordCloud() wc_top_5_install.add('', top_5_install_each_category_list, shape=SymbolType.DIAMOND, word_size_range=[10, 25]) wc_top_5_install.render_notebook() make_snapshot(snapshot, wc_top_5_install.render(), 'img/top_5_each_category.png') # 免费App与付费App评分、评论数、安装量对比 free_vs_paid = pd.read_csv(glob.glob('results/free_vs_paid.csv/*.csv')[0]) table_free_vs_paid = Table() headers = list(free_vs_paid.columns) rows = [list(free_vs_paid.loc[index]) for index in free_vs_paid.index] table_free_vs_paid.add(headers, rows) # table_free_vs_paid.add(free_vs_paid.columns.tolist(), free_vs_paid.values) table_free_vs_paid.set_global_opts( title_opts=opts.ComponentTitleOpts(title="免费应用与付费应用对比") ) table_free_vs_paid.render_notebook() # 付费App价格与评论数和安装量的关系 price_reviews_installs = pd.read_csv(glob.glob('results/price_reviews_installs.csv/*.csv')[0]) bar_price_reviews_installs = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.MACARONS)) .add_xaxis(price_reviews_installs['Price'].tolist()) .add_yaxis("Reviews", np.sqrt(price_reviews_installs['Reviews']).tolist()) .add_yaxis("Installs", np.sqrt(price_reviews_installs['Installs']).tolist()) .set_global_opts(xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-90)), title_opts={'text': '价格与评论数和安装量之间的关系'}) .set_series_opts(label_opts=opts.LabelOpts(is_show=False)) ) # bar_price_reviews_installs.render_notebook() make_snapshot(snapshot, bar_price_reviews_installs.render(), 'img/price_reviews_installs.png') # App的评论数与安装量之间的关系 reviews_installs = pd.read_csv(glob.glob('results/reviews_installs.csv/*.csv')[0]) bar_reviews_installs = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.MACARONS)) .add_xaxis(np.log(reviews_installs['Installs']).tolist()) .add_yaxis("", np.log(reviews_installs['Reviews']).tolist()) .set_global_opts(xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-90)), title_opts={'text': '安装量和评论数之间的关系'}) .set_series_opts(label_opts=opts.LabelOpts(is_show=False)) ) # bar_reviews_installs.render_notebook() make_snapshot(snapshot, bar_reviews_installs.render(), 'img/reviews_installs.png') # 用户评分与评论数和安装量之间的关系 rating_reviews_installs = pd.read_csv(glob.glob('results/rating_reviews_installs.csv/*.csv')[0]) bar_rating_reviews_installs = ( Bar(init_opts=opts.InitOpts(theme=ThemeType.MACARONS)) .add_xaxis(rating_reviews_installs['Rating'].tolist()) .add_yaxis("Reviews", np.log(rating_reviews_installs['Reviews']).tolist()) .add_yaxis("Installs", np.log(rating_reviews_installs['Installs']).tolist()) .set_global_opts(xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-90)), title_opts={'text': '用户评分与安装量和评论数之间的关系'}) .set_series_opts(label_opts=opts.LabelOpts(is_show=False)) ) # bar_rating_reviews_installs.render_notebook() make_snapshot(snapshot, bar_rating_reviews_installs.render(), 'img/rating_reviews_installs.png') ``` ​ ``` ​ 然后在JupyterLab中的Terminal中运行`python visualization.py` ,就会在工程中出现一个img目录,里面就会有数据可视化分析的图片 ![输入图片说明](https://foruda.gitee.com/images/1724946882946335253/a4426ac5_11604426.png "07.png") ![输入图片说明](https://foruda.gitee.com/images/1724946922922948692/5126ca6f_11604426.png "08.png") ​ 运行结果就是img目录,还有一个render.html网页。这是由于代码中使用了pyecharts库来创建各种图表,pyecharts可能会在内部生成 HTML 代码来渲染图表,并将其保存为render.html或在其他相关操作中使用。(注意工程中的google.py只是用于测试谷歌驱动能否正常使用)。 六、结果展示 1、App的数量最多的前10各类别 ![输入图片说明](https://foruda.gitee.com/images/1724946985812663725/a8ebd3d8_11604426.png "09.png") ​ 由于类别个数较多,仅显示前10个类别。从结果可以看出,数量最多的前三个类别分别是家庭、游戏和工具。 2、App评分分布 ![输入图片说明](https://foruda.gitee.com/images/1724947032977456443/0480982e_11604426.png "10.png") ​ 从用户评分分布图可以看出,大约80%的App评分超过4.0,说明App的质量较高,用户整体较为满意。 3、App评论数分布 ![输入图片说明](https://foruda.gitee.com/images/1724947078183056374/4904b6ce_11604426.png "11.png") ​ 从评论数来看,大部分App的评论数为1百万以下,绝大部分App评论数介于100-1百万之间。 4、App安装量分布 ![输入图片说明](https://foruda.gitee.com/images/1724947120832237898/c06872ce_11604426.png "12.png") ​ 从App安装量分布图可以看出,安装量超过10亿的很少,安装量在100至10万之间的占比最多,其次为10万至10亿之间。 5、安装量超过1亿的App ​![输入图片说明](https://foruda.gitee.com/images/1724947155932793261/1baed9d6_11604426.png "13.png") ​ 从词云图看出,安装量超过1亿有很多是Google公司的产品。同时,安装量超1亿的App都是常用的应用,如社交媒体应用和游戏。 6、各类别中安装量前5的App ![输入图片说明](https://foruda.gitee.com/images/1724947195595274940/2399424f_11604426.png "14.png") ​ 从词云图看出,各个类别安装量前5的App,多数还是以Google公司产品占多数。 7、免费App与付费App评分、评论数、安装量对比 ![输入图片说明](https://foruda.gitee.com/images/1724947239717842728/e5ba89cd_11604426.png "15.png") ​ 从表格中可以看出,虽然免费App安装量和评论数都比付费类App高,免费App数量更多是主要原因,但是付费App总体得分稍高于免费App,侧面可以反映出用户对付费类App也是较为满意。 8、付费App价格与评论数和安装量的关系 ![输入图片说明](https://foruda.gitee.com/images/1724947285106537627/ab68167d_11604426.png "16.png") ​ 从图中可以看出,0.99$的App安装量最大,其次是6.99$和2.99$;10$以下的App占据了收费App绝大部分的安装量和评论数。 9、App的评论数与安装量之间的关系 ![输入图片说明](https://foruda.gitee.com/images/1724947324576363019/b4c98ae0_11604426.png "17.png") ​ 由于评论数和安装量跨度很大,因此对两个坐标轴均取了对数,以更好地呈现结果。从图中可以看出,安装量和评论数呈现正相关的关系,安装量高的App评论数也高。 10、用户评分与评论数和安装量之间的关系 ![输入图片说明](https://foruda.gitee.com/images/1724947358395140688/54340ba5_11604426.png "18.png") ​ 同理,此处y轴也同样取对数;从关系图中可以看出,随着用户评分的升高,评论数和安装量也随之增加。 七、总述 ​ 该项目主要是在伪分布式状态下利用Hadoop的HDFS存储原始文件和结果文件,使用Scala 编写Spark-shell进行数据分析,然后将分析结果存入HDFS,再通过Hadoop取回本地,最后进行可视化分析。 ----------------------------------------- 如果你喜欢我的项目,可以通过买杯咖啡来支持我! ![输入图片说明](show/img/wx.jpg)