Lesson 2

Got Features?

Features in Tweet Dumps

tweets.json

  • favorites
  • userid, id number
  • city, state
  • lat, lon
  • user settings (geo on, protected account)
  • tweet info (promoted, urls)

Tokens

What's a token?

Meaning

Smallest meaning bit?

  • Statement? Paragraph?
  • Sentence?
  • Phrase?
  • Noun phrase ("python language")
  • Word?
  • Character?
  • Punctuation

Depends on Your Language

RNA and DNA

RNA tokens:

"A", "G", "C", "U"

DNA tokens:

"A", "G", "C", "T"

If You're a Computer

0's and 1's

00101010 00000100 0000010 00101010

Bit, Byte or Word Which is the Token?

Computer Language

OR DL,00101000b XOR DL,00000010b HCF

  • Line, statement
  • word, symbol

Python

ans = [c for c in "Hello"]

for x in "upword":
    print(x)

English Words

We'll mangle common multi-word tokens:

  • United States -> "States", "United"
  • Manchester United -> "Manchester", "United"

Speaking of English Words

Have you heard of

"Lexicon"

  • Vocabulary
  • Dictionary
  • Set of words

Corpus

  • Set of Documents

Our Corpus

>>> df.text
731122251278499841    RT @javacodegeeks: Top Performance Metrics for...
724281574129180672    World's Largest Python Discovered in Nepal: WA...
724281535587856384    🎷 💎 STOP! Could you be @ExpendTeam's Python / ...
724281501622345729    My Little Python:Changing the world: Artist of...
724281482357837825    Artist of the Day:Arder: This cute snake art c...
724280128860094469    Watching Boa vs. Python — https://t.co/iyNJ58EZCE
724280108807249920    RT @mumbrainstats: FSL task fMRI tutorials alm...
724280070467129344    #How to print the amount of times each word of...
731128258532540416    RT @randal_olson: Free video course on YouTube...
731127852318523393    RT @Petzoldt: Excellent hands-on tutorial abou...
731127380249444357    RT @nixcraft: Announcing Certbot: EFF's Client...
724275650043875328    Go boa wkwk💪😄 ★ Boa vs. Python — https://t.co/...
724275609858392066    RT @RealPython: List of Python API Wrappers &g...
724275578879111169    Watching Boa vs. Python — https://t.co/5THbrirfQO
724275568871673857    Чертова дюжина вакансий в IT и Digital /  / 1....

Sequences of Words

Before we throw tokens into a bag...

  • Bags of words have no "order"
  • python set() vs. list()

Bags Jumble up Meaning

I saw a black Ferrari and stopped at red lights.

I stopped at a red Ferrari and saw black lights.

Real Bag Jumbling Example

My Little Python: Changing the world: Artist

Little Python Changing the :My Artist: world

N-Grams

2-grams:

["I saw", "saw a", "a black", "black Ferrari"

Sentences

  • How would you detect sentence "boundaries"?

How can twip.nlp.Tokenizer knows when periods in Mr. Chomsky's name end a sentence and when they don't? What about U.S.S.R. or :smilies: :-) or :-P?

Even Parsey McParseface can't hangle this one (needs one sentence per line)

Sentence Segmenters

Tweets

We don't need no stink'n sentences!

Tweets are stinky enough. Can't ramble in 140 characters.

Word2Vect Vectors

Each word is...

  • bag of words
  • defined by its neighbors
  • all words every used with it

Word Vectors

Each word is...

  • dictionary definition(s)?

Words

Wh

Workshop 2

  1. Tokenize Your Tweets
    • case normalization
    • ignore URLs and punctuation
    • transcode smilies?
  2. Compile a Vocabulary
    • Count word frequencies
    • Plot a Zipf plot of word frequencies
    • Plot a Zipf plot of document frequencies
  3. Compute the TFIDF
    • Term frequency in each tweet
    • Document frequency (which tweets contain the term)

Got Features?