- 浏览: 2486496 次
- 性别:
- 来自: 成都
文章分类
最新评论
-
nation:
你好,在部署Mesos+Spark的运行环境时,出现一个现象, ...
Spark(4)Deal with Mesos -
sillycat:
AMAZON Relatedhttps://www.godad ...
AMAZON API Gateway(2)Client Side SSL with NGINX -
sillycat:
sudo usermod -aG docker ec2-use ...
Docker and VirtualBox(1)Set up Shared Disk for Virtual Box -
sillycat:
Every Half an Hour30 * * * * /u ...
Build Home NAS(3)Data Redundancy -
sillycat:
3 List the Cron Job I Have>c ...
Build Home NAS(3)Data Redundancy
Classification(3)Generate Features and Stem Adjust the Model System
1. Scala Operation
String Method - contains
scala> val longContent = "carl love to study python, scala"
longContent: String = carl love to study python, scala
scala> longContent.contains("python")
res0: Boolean = true
Map Merge Function
Directly under the the project which we already have the jar dependencies.
> sbt console
scala> import scalaz.Scalaz._
import scalaz.Scalaz._
scala>
scala> val m1 = Map("0"->0, "1" ->1)
m1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1)
scala> val m2 = Map("2"->2)
m2: scala.collection.immutable.Map[String,Int] = Map(2 -> 2)
scala> val m3 = m1 |+| m2
m3: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)
Map Operation
scala> m3
res1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)
scala> m3 - "0"
res2: scala.collection.immutable.Map[String,Int] = Map(1 -> 1, 2 -> 2)
Magic scalaz
https://github.com/scalaz/scalaz
Sliding
scala> (1 to 5).iterator.sliding(3).toList
res3: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))
List Operation
scala> List(1,2,3).zip(List("one","two","three"))
res8: List[(Int, String)] = List((1,one), (2,two), (3,three))
Run with Assembly Jar
./spark-submit —num-executors 2 —driver-memory 2G —class com.sillycat.jobs.GenerateFeatureMap ${path_to_jar}
Nice Configuration in build.sbt
// There's a problem with jackson 2.5+ with Spark 1.4.1
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)
When we build assembly Jar, We may just need Spark Core and related provided
"org.apache.spark" %% "spark-core" % "1.4.1" % "provided", // Apache v2
"org.apache.spark" %% "spark-mllib" % "1.4.1" % "provided", // Apache v2
2. Detail Operations
GenerateFeatureMap
step1. Load Job Info from S3(Only title and description), cache()
step2. Place the title and description in Object, Regex to Find the Title and Description again
step3. Normalize the String
For title: toLower —> filter all html —> stripChars, only keep [a-zA-Z\d\-]
For description: toLower —>filter URL —> filter HTML —> stripChar —> stripNumber
step4. Tokenize the String
We predefined a list of phrases and stored in text file. 2 words and 3 words.
For Title:
Find the phrases in the string which are contained in the pre-defined list.
Convert the string to words and phrase List
eg: big data software engineer —> big, data, software, engineer, big data, software engineer
(big data and software engineer are pre-defined in the list)
For description:
Find the phrases in the string which are contained in the pre-defined list.
Pre-defined a stop word list. Remove stop word
Porter Stemming Algorithm (https://github.com/dlwh/epic, PorterStemmer.scala)
Convert the string to words and phrase List
step5. Calculate IDF
TF-IDF http://sillycat.iteye.com/blog/2231432
The document frequency DF(t, D) is the number of documents that contains term t.
IDI is the total number of documents in the corpus.
IDF(t, D) = log((IDI+1)/(DF(t,D) +1))
step6. Save File on S3
key, index, IDF
3. Classifier Model Training
step1. Load featureMap which is pre-calculate in previous operation
step2. Binary Feature Extractor
step3. Load List of Jobs
step4. Train Minor
step5. Train Arbitrator
4. Classification System
MajorGroupClassificationSystem
MinorGroupClassificationSystem
References:
http://sillycat.iteye.com/blog/2230117
http://sillycat.iteye.com/blog/2231432
http://www.scalanlp.org/
1. Scala Operation
String Method - contains
scala> val longContent = "carl love to study python, scala"
longContent: String = carl love to study python, scala
scala> longContent.contains("python")
res0: Boolean = true
Map Merge Function
Directly under the the project which we already have the jar dependencies.
> sbt console
scala> import scalaz.Scalaz._
import scalaz.Scalaz._
scala>
scala> val m1 = Map("0"->0, "1" ->1)
m1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1)
scala> val m2 = Map("2"->2)
m2: scala.collection.immutable.Map[String,Int] = Map(2 -> 2)
scala> val m3 = m1 |+| m2
m3: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)
Map Operation
scala> m3
res1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)
scala> m3 - "0"
res2: scala.collection.immutable.Map[String,Int] = Map(1 -> 1, 2 -> 2)
Magic scalaz
https://github.com/scalaz/scalaz
Sliding
scala> (1 to 5).iterator.sliding(3).toList
res3: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))
List Operation
scala> List(1,2,3).zip(List("one","two","three"))
res8: List[(Int, String)] = List((1,one), (2,two), (3,three))
Run with Assembly Jar
./spark-submit —num-executors 2 —driver-memory 2G —class com.sillycat.jobs.GenerateFeatureMap ${path_to_jar}
Nice Configuration in build.sbt
// There's a problem with jackson 2.5+ with Spark 1.4.1
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)
When we build assembly Jar, We may just need Spark Core and related provided
"org.apache.spark" %% "spark-core" % "1.4.1" % "provided", // Apache v2
"org.apache.spark" %% "spark-mllib" % "1.4.1" % "provided", // Apache v2
2. Detail Operations
GenerateFeatureMap
step1. Load Job Info from S3(Only title and description), cache()
step2. Place the title and description in Object, Regex to Find the Title and Description again
step3. Normalize the String
For title: toLower —> filter all html —> stripChars, only keep [a-zA-Z\d\-]
For description: toLower —>filter URL —> filter HTML —> stripChar —> stripNumber
step4. Tokenize the String
We predefined a list of phrases and stored in text file. 2 words and 3 words.
For Title:
Find the phrases in the string which are contained in the pre-defined list.
Convert the string to words and phrase List
eg: big data software engineer —> big, data, software, engineer, big data, software engineer
(big data and software engineer are pre-defined in the list)
For description:
Find the phrases in the string which are contained in the pre-defined list.
Pre-defined a stop word list. Remove stop word
Porter Stemming Algorithm (https://github.com/dlwh/epic, PorterStemmer.scala)
Convert the string to words and phrase List
step5. Calculate IDF
TF-IDF http://sillycat.iteye.com/blog/2231432
The document frequency DF(t, D) is the number of documents that contains term t.
IDI is the total number of documents in the corpus.
IDF(t, D) = log((IDI+1)/(DF(t,D) +1))
step6. Save File on S3
key, index, IDF
3. Classifier Model Training
step1. Load featureMap which is pre-calculate in previous operation
step2. Binary Feature Extractor
step3. Load List of Jobs
step4. Train Minor
step5. Train Arbitrator
4. Classification System
MajorGroupClassificationSystem
MinorGroupClassificationSystem
References:
http://sillycat.iteye.com/blog/2230117
http://sillycat.iteye.com/blog/2231432
http://www.scalanlp.org/
发表评论
-
Stop Update Here
2020-04-28 09:00 260I will stop update here, and mo ... -
NodeJS12 and Zlib
2020-04-01 07:44 430NodeJS12 and Zlib It works as ... -
Docker Swarm 2020(2)Docker Swarm and Portainer
2020-03-31 23:18 310Docker Swarm 2020(2)Docker Swar ... -
Docker Swarm 2020(1)Simply Install and Use Swarm
2020-03-31 07:58 321Docker Swarm 2020(1)Simply Inst ... -
Traefik 2020(1)Introduction and Installation
2020-03-29 13:52 291Traefik 2020(1)Introduction and ... -
Portainer 2020(4)Deploy Nginx and Others
2020-03-20 12:06 378Portainer 2020(4)Deploy Nginx a ... -
Private Registry 2020(1)No auth in registry Nginx AUTH for UI
2020-03-18 00:56 373Private Registry 2020(1)No auth ... -
Docker Compose 2020(1)Installation and Basic
2020-03-15 08:10 325Docker Compose 2020(1)Installat ... -
VPN Server 2020(2)Docker on CentOS in Ubuntu
2020-03-02 08:04 397VPN Server 2020(2)Docker on Cen ... -
Buffer in NodeJS 12 and NodeJS 8
2020-02-25 06:43 334Buffer in NodeJS 12 and NodeJS ... -
NodeJS ENV Similar to JENV and PyENV
2020-02-25 05:14 415NodeJS ENV Similar to JENV and ... -
Prometheus HA 2020(3)AlertManager Cluster
2020-02-24 01:47 359Prometheus HA 2020(3)AlertManag ... -
Serverless with NodeJS and TencentCloud 2020(5)CRON and Settings
2020-02-24 01:46 291Serverless with NodeJS and Tenc ... -
GraphQL 2019(3)Connect to MySQL
2020-02-24 01:48 207GraphQL 2019(3)Connect to MySQL ... -
GraphQL 2019(2)GraphQL and Deploy to Tencent Cloud
2020-02-24 01:48 390GraphQL 2019(2)GraphQL and Depl ... -
GraphQL 2019(1)Apollo Basic
2020-02-19 01:36 274GraphQL 2019(1)Apollo Basic Cl ... -
Serverless with NodeJS and TencentCloud 2020(4)Multiple Handlers and Running wit
2020-02-19 01:19 263Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(3)Build Tree and Traverse Tree
2020-02-19 01:19 259Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(2)Trigger SCF in SCF
2020-02-19 01:18 251Serverless with NodeJS and Tenc ... -
Serverless with NodeJS and TencentCloud 2020(1)Running with Component
2020-02-19 01:17 235Serverless with NodeJS and Tenc ...
相关推荐
介绍网页分类算法,比较全面和新颖,是综述性文章
3D Model Classification and Retrieval Method Using LDA Based on Heterogeneous Features
Starting with the introduction of classification and model evaluation techniques, we will explore Apache Mahout and learn why it is a good choice for classification. Next, you will learn about ...
The Review of the AR-DRG Classification System Development Process
Because the properties of objects are largely determined by their geometric features, shape analysis and classification are essential to almost every applied scientific and technological area....
This book discusses methods for accurately and appropriately assessing the performance of prediction and classification models. It also shows how the functioning of existing models can be improved by ...
automatic modulation classification principles, algorithms and applications.pdf AMC automatic modulation classification Wiley
Data ClassifiCation Algorithms andApplications 2014.英文版
Carry out classification decisions using Borda counts, MinMax and MaxMin rules, union and intersection rules, logistic regression, selection by local accuracy, maximization of the fuzzy integral, and...
图像纹理分类经典的文章《Textural features for image classification》
The Optics Classification and Indexing Scheme (OCIS) provides a flexible, comprehensive classification system for all optical author input and user retrieval needs. OCIS has a two-level hierarchical ...
This paper will give an overview of3 D laser sensing and related activities at the Swedish Defence Research Agency (POT) in the view ofsystem needs and applications. Our activities include data ...
Wireless sensor networks ...of the classification strategies for outlier detection techniques in WSNs and discusses the feasibility of various types of techniques for WSNs deployed in harsh environments.
Assessing and Improving Prediction and Classification Theory and Algorithms in C++ 英文epub 本资源转载自网络,如有侵权,请联系上传者或csdn删除 查看此书详细信息请在美国亚马逊官网搜索此书
In this project, I tried the traditional method use SIFT to extract features and KNN for classification which get accuracy of 97.31%, and also tried the convolutional neural network method such as ...
Classification and Regression Trees reflects these two sides, covering the use of trees as a data analysis method, and in a more mathematical framework, proving some of their fundamental properties.
Equivalence of Classification and Regression under Support Vector Machine Theory,吴春国,... Thereafter, the equivalence of the classification and regression is demonstrated by using numerical experimen
Classification and Regression Trees 分类与决策树 英文高清经典教程
图像识别与分类的算法系统应用的论文集 ...advances in image recognition and classification over the past decade. It provides both theoretical and practical information on advances in the field.
Malware Images Visualization and Automatic Classification Web 安全之机器学习 提到的恶意文件图像识别机制,本论文提供恶意图像可视化和自动分类的方法