Prediction(1)Data Collection
All the data are in JSON format in S3 buckets.
We can verify and view the JSON data on this online tool.
http://www.jsoneditoronline.org/
I try to do the implementation on zeppelin which is really a useful tool.
Some important codes are as follow:
val date_pattern = "2015/08/{17,18,19,20}" //week1
//val date_pattern = "2015/08/{03,04,05,06,07,08,09}" //week2
//val date_pattern = "2015/{07/27,07/28,07/29,07/30,07/31,08/01,08/02}"
//val date_pattern = "2015/07/29"
val clicks = sqlContext.jsonFile(s"s3n://mybucket/click/${date_pattern}/*/*")
That codes can follow the pattern and load all the files.
clicks.registerTempTable("clicks")
//applications.printSchema
The can register the data as a table and print out the schema of the JSON data.
val jobs = sc.textFile("s3n://mybucket/jobs/publishers/xxx.xml.gz")
import sqlContext.implicits._
val jobsDF = jobs.toDF()
This can load all the text files in zip format and convert that to and Dataframe
%sql
select SUBSTR(timestamp,0,10), job_id, count(*) from applications group by SUBSTR(timestamp,0,10), job_id
%sql will give us the ability to write SQLs and display that data below in graph.
val clickDF = sqlContext.sql("select SUBSTR(timestamp,0,10) as click_date, job_id, count(*) as count from clicks where SUBSTR(timestamp,0,10)='2015-08-20' group by SUBSTR(timestamp,0,10), job_id")
import org.apache.spark.sql.functions._
val clickFormattedDF = clickDF.orderBy(asc("click_date"),desc("count"))
These command will do the query and sorting for us on Dataframe.
val appFile = "s3n://mybucket/date_2015_08_20"
clickFormattedDF.printSchema
sc.parallelize(clickFormattedDF.collect, 1).saveAsTextFile(appFile)
writes the data back to S3.
Here is the place to check the hadoop cluster
http://localhost:9026/cluster
And once we start that spark context, we can visit this URL to get the status on spark
http://localhost:4040/
References:
http://www.jsoneditoronline.org/
https://spark.apache.org/docs/1.3.0/sql-programming-guide.html
https://gist.github.com/bigsnarfdude/d9c0ceba1aa8c1cfa4e5
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.sql.DataFrame
分享到:
相关推荐
CNN-Prediction-Based-Reversible-Data-Hiding-main CNN-Prediction-Based-Reversible-Data-Hiding-main CNN-Prediction-Based-Reversible-Data-Hiding-main CNN-Prediction-Based-Reversible-Data-Hiding-main
Disease Prediction by Machine Learning Over big data from health communities.pdf
and Cheryl Melanie, Dora, Monika, and Ildiko vi This is page vii Printer: Opaque this Preface to the Second Edition In God we trust, all others bring data. –William Edwards Deming (1900-1993)1 We ...
very good one describing the project which is on going
Prediction-based Adaptive Data gathering protocol (MPAD) based on the random waypoint mobility model tailored for DTMSN is proposed. In MPAD, a node independently makes decision to replicate messages ...
The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 和之前上传的略有差别,这个要更好。
14 Data Collection, Preparation, Quality, and Visualization 15 Data Storage and Management 16 Feature Extraction, Selection, and Construction 17 Performance Analysis and Evaluation 18 Security and ...
利用线性混合模型做贝叶斯预测的论文,来自Wiley数据库,在google scholar上没有下载
data mining on healtchare
Data-Driven Prediction for Industrial Processes and Their Applications (Information Fusion and Data Science) By 作者: Jun Zhao – Wei Wang – Chunyang Sheng ISBN-10 书号: 3319940503 ISBN-13 书号: ...
High-capacity reversible data hiding in encrypted images by prediction error
本文为布拉格捷克理工大学(作者:Oleg Ostashchuk)的硕士论文,共78页。 本文讨论了时间序列分析和预测的问题。论文的目的是研究现有的时间序列预测方法,包括必要的数据预处理步骤。本文选取了ARIMA、人工神经...
从提取数据克隆存储库: git clone https://github.com/ishanmadan1996/Real-Estate-Price-Prediction-Data-Extraction.git 安装必备库: pip install requirements.txt 使用以下命令运行Python脚本进行数据提取: ...
The book's coverage is broad, from supervisedlearing (prediction) to unsupervised learning. The many topics includeneural networks, support vector machines, classification trees andboosting--the ...
1 Big Data Indexing Techniques 2 Data Organization and Layout Techniques 3 Non-traditional Workloads in Big Data 4 Curation and Metadata Management in Big Data 5 Conclusion Big Data Query Engines 1 ...
Rasmus Berg Palm著作,对深度学习和判断预测等知识进行详细介绍,原理叙述清楚,很适合相关领域研究的朋友们,值得推荐。
结合交通理论与计算机大数据计算技术的红绿灯通过时间预测模型
Multiplex Graph Neural Network for Tabular Data Prediction.pdf
Heart Disease Prediction Using Machine,data mining