通过sql和dataframe查询的结果不一样
来源:9-11 -需求一统计功能实现
catcher_1224
2017-08-26
我通过sql和dataframe查询的结果不一致,老师帮忙看下吧,代码如下:
def topNStatistics(sparkSession: SparkSession, dataFrame: DataFrame, day: String) = { import sparkSession.implicits._ //dataFrame.filter($"day" === day && $"category" === "video").groupBy($"id", $"day").count().sort($"count".desc).show() dataFrame.filter($"day" === day && $"category" === "video").groupBy($"id", $"day").agg(count($"id").as("times")).sort($"times".desc).show(5) dataFrame.createOrReplaceTempView("access") sparkSession.sql(s"select id,day,count(1) as times from access where day = '$day' and category = 'video' group by id, day sort by times desc" ).show(5) }
结果如下:
+-------------------------------+--------+----+---------------+--------+----+-------+----+--------+ |url |category|id |ip |province|city|time |flow|day | +-------------------------------+--------+----+---------------+--------+----+-------+----+--------+ |http://www.imooc.com/code/1852 |code |1852|117.35.88.11 |陕西省 |西安市 |0:01:02|2345|20161110| |http://www.imooc.com/code/2053 |code |2053|211.162.33.31 |福建省 |厦门市 |0:01:02|331 |20161110| |http://www.imooc.com/code/3500 |code |3500|116.22.196.70 |广东省 |广州市 |0:01:02|54 |20161110| |http://www.imooc.com/code/547 |code |547 |119.130.229.90 |广东省 |广州市 |0:01:02|54 |20161110| |http://www.imooc.com/code/2053 |code |2053|211.162.33.31 |福建省 |厦门市 |0:01:02|2954|20161110| |http://www.imooc.com/video/8701|video |8701|106.39.41.166 |北京市 |北京市 |0:01:02|54 |20161110| |http://www.imooc.com/code/75 |code |75 |125.119.9.35 |浙江省 |杭州市 |0:01:02|2152|20161110| |http://www.imooc.com/video/9819|video |9819|125.122.216.102|浙江省 |杭州市 |0:01:02|54 |20161110| |http://www.imooc.com/code/75 |code |75 |125.119.9.35 |浙江省 |杭州市 |0:01:02|358 |20161110| |http://www.imooc.com/video/6335|video |6335|116.231.219.242|上海市 |上海市 |0:01:02|54 |20161110| |http://www.imooc.com/code/939 |code |939 |120.85.249.51 |广东省 |广州市 |0:01:02|54 |20161110| |http://www.imooc.com/code/1852 |code |1852|117.35.88.11 |陕西省 |西安市 |0:01:02|2640|20161110| |http://www.imooc.com/video/7629|video |7629|61.136.143.62 |湖北省 |宜昌市 |0:01:02|54 |20161110| |http://www.imooc.com/code/2053 |code |2053|211.162.33.31 |福建省 |厦门市 |0:01:02|2202|20161110| |http://www.imooc.com/video/3237|video |3237|39.186.247.142 |浙江省 |杭州市 |0:01:02|2957|20161110| |http://www.imooc.com/code/75 |code |75 |125.119.9.35 |浙江省 |杭州市 |0:01:02|66 |20161110| |http://www.imooc.com/code/5515 |code |5515|113.100.63.27 |广东省 |清远市 |0:01:02|54 |20161110| |http://www.imooc.com/code/3036 |code |3036|171.43.210.54 |湖北省 |武汉市 |0:01:02|67 |20161110| |http://www.imooc.com/code/1852 |code |1852|117.35.88.11 |陕西省 |西安市 |0:01:02|996 |20161110| |http://www.imooc.com/video/7053|video |7053|59.42.206.247 |广东省 |广州市 |0:01:02|54 |20161110| +-------------------------------+--------+----+---------------+--------+----+-------+----+--------+ only showing top 20 rows root |-- url: string (nullable = true) |-- category: string (nullable = true) |-- id: long (nullable = true) |-- ip: string (nullable = true) |-- province: string (nullable = true) |-- city: string (nullable = true) |-- time: string (nullable = true) |-- flow: long (nullable = true) |-- day: integer (nullable = true) +----+--------+-----+ | id| day|times| +----+--------+-----+ |1459|20161110| 2159| |6701|20161110| 1426| |1803|20161110| 1381| |1412|20161110| 1226| |3520|20161110| 1186| +----+--------+-----+ only showing top 5 rows +-----+--------+-----+ | id| day|times| +-----+--------+-----+ |12302|20161110| 861| | 1741|20161110| 558| | 243|20161110| 295| | 6850|20161110| 200| | 6841|20161110| 117| +-----+--------+-----+ only showing top 5 rows 耗时:16秒
数据是用的课程中提供的5G的数据文件,然后从中取出前1.2G左右测试的
写回答
3回答
-
你这出来的id顺序都不同,你同意按id降序试试
032017-08-27 -
Michael_PK
2017-08-27
你明确指定几个id查,看结果,
00 -
catcher_1224
提问者
2017-08-27
没效果,还是那样,我试着show(100)发现通过sql的排序方式很乱,如下图
+-----+--------+-----+ | id| day|times| +-----+--------+-----+ |12302|20161110| 861| | 1741|20161110| 558| | 243|20161110| 295| | 6850|20161110| 200| | 6841|20161110| 117| | 8038|20161110| 70| |12141|20161110| 68| | 6767|20161110| 58| | 2198|20161110| 47| | 4644|20161110| 46| |10768|20161110| 29| | 5648|20161110| 25| |13018|20161110| 23| | 2883|20161110| 19| | 7911|20161110| 5| |11830|20161110| 4| |12820|20161110| 3| | 8314|20161110| 2| | 1018|20161110| 2| | 884|20161110| 2| | 6402|20161110| 2| |13246|20161110| 1| | 5433|20161110| 496| | 8913|20161110| 348| | 8597|20161110| 311| | 2397|20161110| 188| | 8288|20161110| 154| | 7874|20161110| 145| | 6545|20161110| 95| | 8615|20161110| 83| | 1053|20161110| 67| |12580|20161110| 61| |10086|20161110| 61| | 6531|20161110| 48| | 9447|20161110| 31| | 797|20161110| 27| | 526|20161110| 26| | 2704|20161110| 25| |10509|20161110| 23| | 2121|20161110| 11| | 8937|20161110| 11| |11701|20161110| 2| | 5115|20161110| 1| | 4017|20161110| 693| | 775|20161110| 297| | 2632|20161110| 182| | 1903|20161110| 121| |10071|20161110| 98| | 93|20161110| 86| | 2802|20161110| 81| | 3457|20161110| 79| |10465|20161110| 71| | 5214|20161110| 41| | 357|20161110| 34| | 1551|20161110| 31| |11058|20161110| 26| | 7861|20161110| 20| |12951|20161110| 19| | 4643|20161110| 16| | 9393|20161110| 14| | 5523|20161110| 11| | 3653|20161110| 9| | 6894|20161110| 8| | 4216|20161110| 5| | 9646|20161110| 3| | 3387|20161110| 3| |10886|20161110| 1| | 1459|20161110| 2159| |10632|20161110| 258| | 4307|20161110| 83| | 5254|20161110| 80| | 3775|20161110| 75| | 2560|20161110| 55| | 4046|20161110| 28| | 9911|20161110| 28| | 5424|20161110| 23| |13004|20161110| 22| |13009|20161110| 21| | 4205|20161110| 18| | 5564|20161110| 12| |12841|20161110| 12| | 2816|20161110| 10| |12584|20161110| 4| | 5144|20161110| 2| | 3744|20161110| 1| | 4685|20161110| 1| | 353|20161110| 676| | 3354|20161110| 364| |11549|20161110| 308| | 777|20161110| 295| | 5534|20161110| 205| | 6159|20161110| 148| | 1987|20161110| 113| | 8820|20161110| 112| | 5521|20161110| 75| |12505|20161110| 57| | 2006|20161110| 43| | 7427|20161110| 31| | 9073|20161110| 26| |10221|20161110| 23| +-----+--------+-----+ only showing top 100 rows
00
相似问题