`
sillycat
  • 浏览: 2492773 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Prediction(1)Data Collection

 
阅读更多
Prediction(1)Data Collection

All the data are in JSON format in S3 buckets.

We can verify and view the JSON data on this online tool.
http://www.jsoneditoronline.org/

I try to do the implementation on zeppelin which is really a useful tool.

Some important codes are as follow:
val date_pattern = "2015/08/{17,18,19,20}" //week1
//val date_pattern = "2015/08/{03,04,05,06,07,08,09}"   //week2
//val date_pattern = "2015/{07/27,07/28,07/29,07/30,07/31,08/01,08/02}"
//val date_pattern = "2015/07/29"

val clicks = sqlContext.jsonFile(s"s3n://mybucket/click/${date_pattern}/*/*")

That codes can follow the pattern and load all the files.

clicks.registerTempTable("clicks")
//applications.printSchema

The can register the data as a table and print out the schema of the JSON data.

val jobs = sc.textFile("s3n://mybucket/jobs/publishers/xxx.xml.gz")
import sqlContext.implicits._
val jobsDF = jobs.toDF()

This can load all the text files in zip format and convert that to and Dataframe

%sql
select SUBSTR(timestamp,0,10), job_id, count(*) from applications  group by SUBSTR(timestamp,0,10), job_id

%sql will give us the ability to write SQLs and display that data below in graph.

val clickDF = sqlContext.sql("select SUBSTR(timestamp,0,10) as click_date, job_id, count(*) as count from clicks where SUBSTR(timestamp,0,10)='2015-08-20'  group by SUBSTR(timestamp,0,10), job_id")

import org.apache.spark.sql.functions._

val clickFormattedDF = clickDF.orderBy(asc("click_date"),desc("count"))

These command will do the query and sorting for us on Dataframe.

val appFile = "s3n://mybucket/date_2015_08_20"
clickFormattedDF.printSchema
sc.parallelize(clickFormattedDF.collect, 1).saveAsTextFile(appFile)

writes the data back to S3.

Here is the place to check the hadoop cluster
http://localhost:9026/cluster

And once we start that spark context, we can visit this URL to get the status on spark
http://localhost:4040/

References:
http://www.jsoneditoronline.org/
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
https://gist.github.com/bigsnarfdude/d9c0ceba1aa8c1cfa4e5
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.DataFrame
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics