`
sillycat
  • 浏览: 2486496 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Classification(3)Generate Features and Stem Adjust the Model System

 
阅读更多
Classification(3)Generate Features and Stem Adjust the Model System

1. Scala Operation
String Method - contains
scala> val longContent = "carl love to study python, scala"
longContent: String = carl love to study python, scala

scala> longContent.contains("python")
res0: Boolean = true

Map Merge Function
Directly under the the project which we already have the jar dependencies.
> sbt console
scala> import scalaz.Scalaz._
import scalaz.Scalaz._

scala>

scala> val m1 = Map("0"->0, "1" ->1)
m1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1)

scala> val m2 = Map("2"->2)
m2: scala.collection.immutable.Map[String,Int] = Map(2 -> 2)

scala> val m3 = m1 |+| m2
m3: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)

Map Operation
scala> m3
res1: scala.collection.immutable.Map[String,Int] = Map(0 -> 0, 1 -> 1, 2 -> 2)

scala> m3 - "0"
res2: scala.collection.immutable.Map[String,Int] = Map(1 -> 1, 2 -> 2)

Magic scalaz
https://github.com/scalaz/scalaz

Sliding
scala> (1 to 5).iterator.sliding(3).toList
res3: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))

List Operation
scala> List(1,2,3).zip(List("one","two","three"))
res8: List[(Int, String)] = List((1,one), (2,two), (3,three))

Run with Assembly Jar
./spark-submit —num-executors 2 —driver-memory 2G —class com.sillycat.jobs.GenerateFeatureMap ${path_to_jar}

Nice Configuration in build.sbt
// There's a problem with jackson 2.5+ with Spark 1.4.1
dependencyOverrides ++= Set(
  "com.fasterxml.jackson.core" % "jackson-databind" % "2.4.4"
)

When we build assembly Jar, We may just need Spark Core and related provided
"org.apache.spark" %% "spark-core" % "1.4.1"  % "provided",               // Apache v2
"org.apache.spark" %% "spark-mllib" % "1.4.1"  % "provided",              // Apache v2

2. Detail Operations
GenerateFeatureMap
step1. Load Job Info from S3(Only title and description), cache()
step2. Place the title and description in Object, Regex to Find the Title and Description again
step3. Normalize the String
     For title: toLower —> filter all html —> stripChars, only keep [a-zA-Z\d\-]
     For description: toLower —>filter URL —> filter HTML —> stripChar —> stripNumber
step4. Tokenize the String
     We predefined a list of phrases and stored in text file. 2 words and 3 words.
     For Title:
          Find the phrases in the string which are contained in the pre-defined list.
          Convert the string to words and phrase List
          eg:   big data software engineer —> big, data, software, engineer, big data, software engineer
                  (big data and software engineer are pre-defined in the list)

     For description:
          Find the phrases in the string which are contained in the pre-defined list.
          Pre-defined a stop word list. Remove stop word
          Porter Stemming Algorithm (https://github.com/dlwh/epic, PorterStemmer.scala)
          Convert the string to words and phrase List
step5. Calculate IDF
      TF-IDF  http://sillycat.iteye.com/blog/2231432
      The document frequency DF(t, D) is the number of documents that contains term t.
      IDI is the total number of documents in the corpus.
      IDF(t, D) = log((IDI+1)/(DF(t,D) +1))
step6. Save File on S3
      key, index, IDF

3. Classifier Model Training
step1. Load featureMap which is pre-calculate in previous operation
step2. Binary Feature Extractor
step3. Load List of Jobs
step4. Train Minor
step5. Train Arbitrator

4. Classification System
MajorGroupClassificationSystem
MinorGroupClassificationSystem

References:
http://sillycat.iteye.com/blog/2230117
http://sillycat.iteye.com/blog/2231432

http://www.scalanlp.org/
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics