Prediction(1)Data Collection

sillycat

浏览: 2492773 次
性别:
来自: 成都

最近访客更多访客>>

huageng520

learnmore

u012363178

ymgjava

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Summary

Prediction(1)Data Collection

All the data are in JSON format in S3 buckets.

We can verify and view the JSON data on this online tool.
http://www.jsoneditoronline.org/

I try to do the implementation on zeppelin which is really a useful tool.

Some important codes are as follow:
val date_pattern = "2015/08/{17,18,19,20}" //week1
//val date_pattern = "2015/08/{03,04,05,06,07,08,09}" //week2
//val date_pattern = "2015/{07/27,07/28,07/29,07/30,07/31,08/01,08/02}"
//val date_pattern = "2015/07/29"

val clicks = sqlContext.jsonFile(s"s3n://mybucket/click/${date_pattern}/*/*")

That codes can follow the pattern and load all the files.

clicks.registerTempTable("clicks")
//applications.printSchema

The can register the data as a table and print out the schema of the JSON data.

val jobs = sc.textFile("s3n://mybucket/jobs/publishers/xxx.xml.gz")
import sqlContext.implicits._
val jobsDF = jobs.toDF()

This can load all the text files in zip format and convert that to and Dataframe

%sql
select SUBSTR(timestamp,0,10), job_id, count(*) from applications group by SUBSTR(timestamp,0,10), job_id

%sql will give us the ability to write SQLs and display that data below in graph.

val clickDF = sqlContext.sql("select SUBSTR(timestamp,0,10) as click_date, job_id, count(*) as count from clicks where SUBSTR(timestamp,0,10)='2015-08-20' group by SUBSTR(timestamp,0,10), job_id")

import org.apache.spark.sql.functions._

val clickFormattedDF = clickDF.orderBy(asc("click_date"),desc("count"))

These command will do the query and sorting for us on Dataframe.

val appFile = "s3n://mybucket/date_2015_08_20"
clickFormattedDF.printSchema
sc.parallelize(clickFormattedDF.collect, 1).saveAsTextFile(appFile)

writes the data back to S3.

Here is the place to check the hadoop cluster
http://localhost:9026/cluster

And once we start that spark context, we can visit this URL to get the status on spark
http://localhost:4040/

References:
http://www.jsoneditoronline.org/
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
https://gist.github.com/bigsnarfdude/d9c0ceba1aa8c1cfa4e5
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.DataFrame

分享到：

PHP(7)RESTful Framework - Lumen - Settin ... | Mail Server Solution(7)Meteor and Email/ ...

2015-08-25 06:10
浏览 762
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论