Classification(3)Generate Features and Stem Adjust the Model System

sillycat

浏览: 2486496 次
性别:
来自: 成都

最近访客更多访客>>

huageng520

learnmore

u012363178

ymgjava

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Summary

Classification(3)Generate Features and Stem Adjust the Model System

1. Scala Operation
String Method - contains
scala> val longContent = "carl love to study python, scala"
longContent: String = carl love to study python, scala

scala> longContent.contains("python")
res0: Boolean = true

Map Merge Function
Directly under the the project which we already have the jar dependencies.
> sbt console
scala> import scalaz.Scalaz._
import scalaz.Scalaz._

scala>

scala> val m1 = Map("0"->0, "1" ->1)
m1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1)

scala> val m2 = Map("2"->2)
m2: scala.collection.immutable.Map[String,Int] = Map(2 -> 2)

scala> val m3 = m1 |+| m2
m3: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)

Map Operation
scala> m3
res1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)

scala> m3 - "0"
res2: scala.collection.immutable.Map[String,Int] = Map(1 -> 1, 2 -> 2)

Magic scalaz
https://github.com/scalaz/scalaz

Sliding
scala> (1 to 5).iterator.sliding(3).toList
res3: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))

List Operation
scala> List(1,2,3).zip(List("one","two","three"))
res8: List[(Int, String)] = List((1,one), (2,two), (3,three))

Run with Assembly Jar
./spark-submit —num-executors 2 —driver-memory 2G —class com.sillycat.jobs.GenerateFeatureMap ${path_to_jar}

Nice Configuration in build.sbt
// There's a problem with jackson 2.5+ with Spark 1.4.1
dependencyOverrides ++= Set(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)

When we build assembly Jar, We may just need Spark Core and related provided
"org.apache.spark" %% "spark-core" % "1.4.1" % "provided",               // Apache v2
"org.apache.spark" %% "spark-mllib" % "1.4.1" % "provided",              // Apache v2

2. Detail Operations
GenerateFeatureMap
step1. Load Job Info from S3(Only title and description), cache()
step2. Place the title and description in Object, Regex to Find the Title and Description again
step3. Normalize the String
     For title: toLower —> filter all html —> stripChars, only keep [a-zA-Z\d\-]
     For description: toLower —>filter URL —> filter HTML —> stripChar —> stripNumber
step4. Tokenize the String
     We predefined a list of phrases and stored in text file. 2 words and 3 words.
     For Title:
          Find the phrases in the string which are contained in the pre-defined list.
          Convert the string to words and phrase List
          eg:   big data software engineer —> big, data, software, engineer, big data, software engineer
                  (big data and software engineer are pre-defined in the list)

     For description:
          Find the phrases in the string which are contained in the pre-defined list.
          Pre-defined a stop word list. Remove stop word
          Porter Stemming Algorithm (https://github.com/dlwh/epic, PorterStemmer.scala)
          Convert the string to words and phrase List
step5. Calculate IDF
      TF-IDF http://sillycat.iteye.com/blog/2231432
      The document frequency DF(t, D) is the number of documents that contains term t.
      IDI is the total number of documents in the corpus.
      IDF(t, D) = log((IDI+1)/(DF(t,D) +1))
step6. Save File on S3
      key, index, IDF

3. Classifier Model Training
step1. Load featureMap which is pre-calculate in previous operation
step2. Binary Feature Extractor
step3. Load List of Jobs
step4. Train Minor
step5. Train Arbitrator

4. Classification System
MajorGroupClassificationSystem
MinorGroupClassificationSystem

References:
http://sillycat.iteye.com/blog/2230117
http://sillycat.iteye.com/blog/2231432

http://www.scalanlp.org/

分享到：

Playframework and FileUpload in RESTful | SBT with Memory

2015-10-31 05:16
浏览 666
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论