Hacking Oregon's Hidden Political Connections
Agenda:
For Hack Oregon we explored the data in unusual ways
- Pandas as a DB
- Find Connections (FKs, PKs, other DBs)
- TFIDF on a DB table
- TFIDF similarity
- Similarity Similarity
Intro: 1
Pandas as a relational DB
- Identify foreign keys automatically
- Use FKs to do join SQL-like queries
Intro: 2
Intersect large sets
- AM emails in BehindTheCurtain DB?
- 10 GB mysql dump >> dozens of CSVs
- Load 50M emails efficiently
- Intersect emails with public records
Intro: 3
Restructure a DB
- Why?
- How?
- Restructure (TFIDF)
Intro: 4
TFIDF to detect similarity between records
- cluster Oregon PACs by their "mission"
- d3 force-directed graph of PAC similarity
- compare to DG of financial transactions
Intro: 5
Similarity between similarity matrices
SAY
(TFIDF)
vs.
DO
(Transactions)
3. Restructure DB
Why?
- Squish fields into a string?
- Vectorizing later anyway, right?
Because
- Dimensions are vaguely defined/understood
- Information "smear" across fields/dimensions
3. Restructure DB: How?
- Ignore numbers/dates
- Stringify each field
- Stem words
- Ignore words (are you sure?)
- Concatenate
- Split
- Vectorize/Count
3. Restructure DB: TFIDF
- Must be sparse to fit in memory
- Explicit python builtins:
Counter
, defaultdict
- sklearn
4. TFIDF Similarity
Large dimensions are scary
- Everything is far apart
- Euclidean distance is meaningless
- Our brains fail
4. TFIDF Similarity
Vector distances
4. TFIDF Similarity
Cosine Similarity
(similarity = 1/distance)
- Equivalent:
- Pierson Correlation
- | v_1 dot v_2 | (projection)
- angle between v1 and v2
- Bounded: [-1, +1]
5. Similarity Similarity
Cluster Oregon PACs by their "mission"
- d3 force-directed graph of PAC similarity
- compare to DG of financial transactions