通过sql和dataframe查询的结果不一样

来源:9-11 -需求一统计功能实现

catcher_1224

2017-08-26

我通过sql和dataframe查询的结果不一致,老师帮忙看下吧,代码如下:

def topNStatistics(sparkSession: SparkSession, dataFrame: DataFrame, day: String) = {
    import sparkSession.implicits._
    //dataFrame.filter($"day" === day && $"category" === "video").groupBy($"id", $"day").count().sort($"count".desc).show()
    dataFrame.filter($"day" === day && $"category" === "video").groupBy($"id", $"day").agg(count($"id").as("times")).sort($"times".desc).show(5)
    dataFrame.createOrReplaceTempView("access")
    sparkSession.sql(s"select id,day,count(1) as times from access where day = '$day' and category = 'video' group by id, day sort by times desc" ).show(5)
}

结果如下:

+-------------------------------+--------+----+---------------+--------+----+-------+----+--------+
|url                            |category|id  |ip             |province|city|time   |flow|day     |
+-------------------------------+--------+----+---------------+--------+----+-------+----+--------+
|http://www.imooc.com/code/1852 |code    |1852|117.35.88.11   |陕西省     |西安市 |0:01:02|2345|20161110|
|http://www.imooc.com/code/2053 |code    |2053|211.162.33.31  |福建省     |厦门市 |0:01:02|331 |20161110|
|http://www.imooc.com/code/3500 |code    |3500|116.22.196.70  |广东省     |广州市 |0:01:02|54  |20161110|
|http://www.imooc.com/code/547  |code    |547 |119.130.229.90 |广东省     |广州市 |0:01:02|54  |20161110|
|http://www.imooc.com/code/2053 |code    |2053|211.162.33.31  |福建省     |厦门市 |0:01:02|2954|20161110|
|http://www.imooc.com/video/8701|video   |8701|106.39.41.166  |北京市     |北京市 |0:01:02|54  |20161110|
|http://www.imooc.com/code/75   |code    |75  |125.119.9.35   |浙江省     |杭州市 |0:01:02|2152|20161110|
|http://www.imooc.com/video/9819|video   |9819|125.122.216.102|浙江省     |杭州市 |0:01:02|54  |20161110|
|http://www.imooc.com/code/75   |code    |75  |125.119.9.35   |浙江省     |杭州市 |0:01:02|358 |20161110|
|http://www.imooc.com/video/6335|video   |6335|116.231.219.242|上海市     |上海市 |0:01:02|54  |20161110|
|http://www.imooc.com/code/939  |code    |939 |120.85.249.51  |广东省     |广州市 |0:01:02|54  |20161110|
|http://www.imooc.com/code/1852 |code    |1852|117.35.88.11   |陕西省     |西安市 |0:01:02|2640|20161110|
|http://www.imooc.com/video/7629|video   |7629|61.136.143.62  |湖北省     |宜昌市 |0:01:02|54  |20161110|
|http://www.imooc.com/code/2053 |code    |2053|211.162.33.31  |福建省     |厦门市 |0:01:02|2202|20161110|
|http://www.imooc.com/video/3237|video   |3237|39.186.247.142 |浙江省     |杭州市 |0:01:02|2957|20161110|
|http://www.imooc.com/code/75   |code    |75  |125.119.9.35   |浙江省     |杭州市 |0:01:02|66  |20161110|
|http://www.imooc.com/code/5515 |code    |5515|113.100.63.27  |广东省     |清远市 |0:01:02|54  |20161110|
|http://www.imooc.com/code/3036 |code    |3036|171.43.210.54  |湖北省     |武汉市 |0:01:02|67  |20161110|
|http://www.imooc.com/code/1852 |code    |1852|117.35.88.11   |陕西省     |西安市 |0:01:02|996 |20161110|
|http://www.imooc.com/video/7053|video   |7053|59.42.206.247  |广东省     |广州市 |0:01:02|54  |20161110|
+-------------------------------+--------+----+---------------+--------+----+-------+----+--------+
only showing top 20 rows

root
 |-- url: string (nullable = true)
 |-- category: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip: string (nullable = true)
 |-- province: string (nullable = true)
 |-- city: string (nullable = true)
 |-- time: string (nullable = true)
 |-- flow: long (nullable = true)
 |-- day: integer (nullable = true)

+----+--------+-----+
|  id|     day|times|
+----+--------+-----+
|1459|20161110| 2159|
|6701|20161110| 1426|
|1803|20161110| 1381|
|1412|20161110| 1226|
|3520|20161110| 1186|
+----+--------+-----+
only showing top 5 rows

+-----+--------+-----+
|   id|     day|times|
+-----+--------+-----+
|12302|20161110|  861|
| 1741|20161110|  558|
|  243|20161110|  295|
| 6850|20161110|  200|
| 6841|20161110|  117|
+-----+--------+-----+
only showing top 5 rows

耗时:16秒

数据是用的课程中提供的5G的数据文件,然后从中取出前1.2G左右测试的

写回答

3回答

Michael_PK

2017-08-27

你这出来的id顺序都不同,你同意按id降序试试

0
3
catcher_1224
非常感谢!
2017-08-27
共3条回复

Michael_PK

2017-08-27

你明确指定几个id查,看结果,

0
0

catcher_1224

提问者

2017-08-27

没效果,还是那样,我试着show(100)发现通过sql的排序方式很乱,如下图

+-----+--------+-----+
|   id|     day|times|
+-----+--------+-----+
|12302|20161110|  861|
| 1741|20161110|  558|
|  243|20161110|  295|
| 6850|20161110|  200|
| 6841|20161110|  117|
| 8038|20161110|   70|
|12141|20161110|   68|
| 6767|20161110|   58|
| 2198|20161110|   47|
| 4644|20161110|   46|
|10768|20161110|   29|
| 5648|20161110|   25|
|13018|20161110|   23|
| 2883|20161110|   19|
| 7911|20161110|    5|
|11830|20161110|    4|
|12820|20161110|    3|
| 8314|20161110|    2|
| 1018|20161110|    2|
|  884|20161110|    2|
| 6402|20161110|    2|
|13246|20161110|    1|
| 5433|20161110|  496|
| 8913|20161110|  348|
| 8597|20161110|  311|
| 2397|20161110|  188|
| 8288|20161110|  154|
| 7874|20161110|  145|
| 6545|20161110|   95|
| 8615|20161110|   83|
| 1053|20161110|   67|
|12580|20161110|   61|
|10086|20161110|   61|
| 6531|20161110|   48|
| 9447|20161110|   31|
|  797|20161110|   27|
|  526|20161110|   26|
| 2704|20161110|   25|
|10509|20161110|   23|
| 2121|20161110|   11|
| 8937|20161110|   11|
|11701|20161110|    2|
| 5115|20161110|    1|
| 4017|20161110|  693|
|  775|20161110|  297|
| 2632|20161110|  182|
| 1903|20161110|  121|
|10071|20161110|   98|
|   93|20161110|   86|
| 2802|20161110|   81|
| 3457|20161110|   79|
|10465|20161110|   71|
| 5214|20161110|   41|
|  357|20161110|   34|
| 1551|20161110|   31|
|11058|20161110|   26|
| 7861|20161110|   20|
|12951|20161110|   19|
| 4643|20161110|   16|
| 9393|20161110|   14|
| 5523|20161110|   11|
| 3653|20161110|    9|
| 6894|20161110|    8|
| 4216|20161110|    5|
| 9646|20161110|    3|
| 3387|20161110|    3|
|10886|20161110|    1|
| 1459|20161110| 2159|
|10632|20161110|  258|
| 4307|20161110|   83|
| 5254|20161110|   80|
| 3775|20161110|   75|
| 2560|20161110|   55|
| 4046|20161110|   28|
| 9911|20161110|   28|
| 5424|20161110|   23|
|13004|20161110|   22|
|13009|20161110|   21|
| 4205|20161110|   18|
| 5564|20161110|   12|
|12841|20161110|   12|
| 2816|20161110|   10|
|12584|20161110|    4|
| 5144|20161110|    2|
| 3744|20161110|    1|
| 4685|20161110|    1|
|  353|20161110|  676|
| 3354|20161110|  364|
|11549|20161110|  308|
|  777|20161110|  295|
| 5534|20161110|  205|
| 6159|20161110|  148|
| 1987|20161110|  113|
| 8820|20161110|  112|
| 5521|20161110|   75|
|12505|20161110|   57|
| 2006|20161110|   43|
| 7427|20161110|   31|
| 9073|20161110|   26|
|10221|20161110|   23|
+-----+--------+-----+
only showing top 100 rows


0
0

以慕课网日志分析为例 进入大数据Spark SQL的世界

快速转型大数据:Hadoop,Hive,SparkSQL步步为赢

1644 学习 · 1129 问题

查看课程