`
sillycat
  • 浏览: 2486928 次
  • 性别: Icon_minigender_1
  • 来自: 成都
社区版块
存档分类
最新评论

Classification(1)Find Phrases from String

 
阅读更多
Classification(1)Find Phrases from String

1. Find Import Phrase in All the Content
Start my Local Zeppelin
> bin/zeppelin-daemon.sh start

Because My local Zeppelin is connecting to my virtual box yarn cluster. So I need to start my virtual box and ubuntu-master, ubuntu-dev1, ubuntu-dev2.

How to Load Jar
z.load("org.scalaz:scalaz-core_2.10:7.2.0-M2")

How to Connect to S3
val rdd = sc.textFile("s3n://sillycat/jobs.csv")

How to Add Customer Jar to Zeppelin
in the file zeppelin-env.sh
export ZEPPELIN_JAVA_OPTS="-Dspark.jars=/home/spark-seed-assembly-0.0.1.jar,/home/classifier-assembly-1.0.jar"

README.md Format will Help a lot
# Classification System #

### What is this repository for? ###

* NLP and classification

### How do I get set up? (TODO)###

* Summary of set up

Special Character in HTML
http://www.degraeve.com/reference/specialcharacters.php

Really Nice Codes to Filter the Charactors
IncludetextMunging.scala
IncludeTextMungingSpec.scala

Get Phrases from One String
/**
* Counts phrases using a sliding window.
*
* Example:
* In:  getPhrasesInTitle(Job("foo foo foo foo foo foo", ""), 2)
* Out: Map( -> 0, foo foo -> 5)
*
* In:  getPhrasesInTitle(Job("foo foo foo foo foo foo bar foo", ""), 2)
* Out: Map( -> 0, foo foo -> 5, foo bar -> 1, bar foo -> 1)
*/
def getPhrasesInTitle(job: Job, numWordsInPhrase: Int) = {
    val phrases = job.title.split(" ").sliding(numWordsInPhrase).foldLeft(Map("" -> 0)) {
        (phraseCounts: Map[String, Int], phrase: Array[String]) =>
            phrase.size == numWordsInPhrase match {
                case true =>
                    val str = phrase.mkString(" ")
                    val count = phraseCounts.getOrElse(str, 0) + 1
                    phraseCounts + (str -> count)
                case false =>
                    phraseCounts
            }
    }
    phrases - ""
}

One Map Operation
scala> val m1 = Map( ""->0, "s1" ->1)
val m2 = m1 - ""
m2: scala.collection.immutable.Map[String,Int] = Map(s1 -> 1)
val m3 = m2 - "s1"
m3: scala.collection.immutable.Map[String,Int] = Map()

Merge Map
http://stackoverflow.com/questions/20047080/scala-merge-map
http://www.nimrodstech.com/scala-map-merge/

Then merge the map by map1 |+| map2

https://github.com/scalaz/scalaz
How to add scalaz-core in your class path
https://keramida.wordpress.com/2013/12/02/using-sbt-to-experiment-with-new-scala-libraries/

Directly on Command
> wget http://central.maven.org/maven2/org/scalaz/scalaz-core_2.10/7.1.3/scalaz-core_2.10-7.1.3.jar
> scala -cp scalaz-core_2.10-7.1.3.jar
scala> import scalaz.Scalaz._
scala> val k1 = Map( "key"->1, "key22"->3)
k1: scala.collection.immutable.Map[String,Int] = Map(key -> 1, key22 -> 3)
scala> val k2 = Map( "key1"->11, "key122"->13)
k2: scala.collection.immutable.Map[String,Int] = Map(key1 -> 11, key122 -> 13)
scala> val k3 = k1 |+| k2
k3: scala.collection.immutable.Map[String,Int] = Map(key1 -> 11, key122 -> 13, key -> 1, key22 -> 3)

Or put the jar in one place and this will work
> scala -cp lib/*

The Whole Flow of Phrase Finding will be
item = “foo foo foo foo” —> Map(“foo foo” -> 4, “ok hello” -> 3)
items.map( item => ).reduce(_ |+| _ )


Scala Skill Tip
1. How to use _
var className: ClassName = _
similar to
var className: ClassName = null

2. foldLeft/: and foldRight:\ and fold
val numbers = List(5,1,3,3)
numbers.fold(0) { (z, i) =>
     z+i
}
This function will init the 0, use 0 and add one element in the list, the result will be 5, then the result will add another element in the list.

Another UseCase
class Foo(val name: String, val age: Int, val sex: Symbol)
object Foo {
     def apply(name:String, age:Int, sex: Symbol) = new Foo(name, age, sex)
}

val fooList = Foo(“Carl”, 33, ‘male) :: Foo(“Kiko”, 23, ‘female) :: Nil
val stringList = fooList.foldLeft(List[String]()) { (z, f) =>
     val title = f.sex match {
          case ‘male => “Mr."
          case ‘female => “Ms."
     }
     z :+ s”$title ${f.name}, ${f.age}"
}      //stringList(0) Mr. Carl, 33

folerLeft will begin from Left, folderRight will from Right, fold will be no order.

3. Iterator.Sliding
sliding[B>:A](size: Int, step: Int)   size of the window, step of the window
scala> (1 to 5).iterator.sliding(3).toList
res0: List[Seq[Int]] = List(List(1, 2, 3), List(2, 3, 4), List(3, 4, 5))

scala> (1 to 5).iterator.sliding(4, 3).toList
res1: List[Seq[Int]] = List(List(1, 2, 3, 4), List(4, 5))

scala> (1 to 5).iterator.sliding(4, 3).withPartial(false).toList
res2: List[Seq[Int]] = List(List(1, 2, 3, 4))




References:
scala underscore
http://stackoverflow.com/questions/8000903/what-are-all-the-uses-of-an-underscore-in-scala
foldLeft
http://hongjiang.info/foldleft-and-foldright/
http://www.iteblog.com/archives/1228
sliding
http://daily-scala.blogspot.com/2009/11/iteratorsliding.html
http://hongjiang.info/scala-counting-reduplicated-character/
分享到:
评论

相关推荐

Global site tag (gtag.js) - Google Analytics