>>> df.text
731122251278499841 RT @javacodegeeks: Top Performance Metrics for...
724281574129180672 World's Largest Python Discovered in Nepal: WA...724281535587856384 🎷 💎 STOP! Could you be @ExpendTeam's Python / ...
724281501622345729 My Little Python:Changing the world: Artist of...
724281482357837825 Artist of the Day:Arder: This cute snake art c...
724280128860094469 Watching Boa vs. Python — https://t.co/iyNJ58EZCE
724280108807249920 RT @mumbrainstats: FSL task fMRI tutorials alm...
724280070467129344#How to print the amount of times each word of...731128258532540416 RT @randal_olson: Free video course on YouTube...
731127852318523393 RT @Petzoldt: Excellent hands-on tutorial abou...
731127380249444357 RT @nixcraft: Announcing Certbot: EFF's Client...724275650043875328 Go boa wkwk💪😄 ★ Boa vs. Python — https://t.co/...724275609858392066 RT @RealPython: List of Python API Wrappers &g...724275578879111169 Watching Boa vs. Python — https://t.co/5THbrirfQO724275568871673857 Чертова дюжина вакансий в IT и Digital / / 1....
Sequences of Words
Before we throw tokens into a bag...
Bags of words have no "order"
python set() vs. list()
Bags Jumble up Meaning
I saw a black Ferrari and stopped at red lights.
I stopped at a red Ferrari and saw black lights.
Real Bag Jumbling Example
My Little Python: Changing the world: Artist
Little Python Changing the :My Artist: world
N-Grams
2-grams:
["I saw", "saw a", "a black", "black Ferrari"
Sentences
How would you detect sentence "boundaries"?
How can twip.nlp.Tokenizer knows when periods in Mr. Chomsky's name end a sentence and when they don't? What about U.S.S.R. or :smilies: :-) or :-P?
Even Parsey McParseface can't hangle this one (needs one sentence per line)