Sunday, May 31, 2015

Inverted Index Text Mining of Twitter Data

A year ago, I got tired of wading through my timeline to find tweets of interest.  It was beginning to feel like work or another chore I had to complete every day.  OnlyWorthy was born over a weekend and it has served very admirably.  A little Python script wakes up every night, grabs all the day's tweets from a handful of accounts, and retweets those with the highest scores (a combination of favorites and retweets).  I follow @OnlyWorthy instead of those handful of accounts so I only get these 'good' tweets in my timeline.

This is obviously a crude way to solve my problem.  So I saw an opportunity to iterate OnlyWorthy in conjunction with my Data Mining class project with a more sophisticated mechanism.

The idea is as follows: put tweets into the Vector Space Model.  If a collection of tweets which a user enjoys is known, they could be indexed as individual documents.  New tweets could also be indexed.  At this point, similarity scores can be generated and relevant tweets can be identified.

This is the first time I've worked with PyLucene and I was very impressed.  Common Lucene analyzers and the Similarity API were utilized.

Here's an example workflow:

- A user identifies 100 tweets which they have enjoyed.  This can be done a variety of ways (examination of Twitter activity, an app with swipe left/right feedback, etc) but for this project their text was simply fed in or randomly selected.  All these tweets are placed in an index.
- A user identifies 50 twitter usernames they like.  This is to scope the project but if bulk-level API access was provided, perhaps more of the Twitter firehose can be ingested.
- At some interval, the application grabs recent tweets from those 50 accounts.  All these tweets are placed in an index and a score is calculated for each new tweet against each liked tweet.
- A handful of the top scoring new tweets are considered worthy and displayed to the user.  This could be done by retweet or presentation in a special app.  In my project, they were simply printed to stdout.
- I did not implement a continual feedback loop here but that's what should be done to keep refining the performance.

I went with an ensemble system which calculated TF-IDF cosine similarity and BM25 values.

Results were pretty good.  For example, given these tweets:
the program identified these tweets which seem like pretty good matches (one about monetary policy and one about new gadgets):


Difficulties:

- It sometimes told the user what they already knew.  If the user indicated they liked a tweet accouncing a new version of Elasticsearch, for example, other tweets with similiar content would be flagged.  I added some penaltization based on time thinking that these tweets would be clustered but it didn't work well enough.
- Tuning is necessary to get the performance needed to evaluate millions of tweets.
- Lots of scaffolding was mocked and needs to be implemented.  While it's not extremely difficult to build a system to enable people to flag tweets that they enjoy, it also can't be written quickly.

If interested, the source code is available here.  I haven't deployed this to OpenShift because there is still a lot of work required to make it run continuously.  Feel free to borrow whatever ideas are expressed here if you're interested in this concept!