#DCtech Tweets Visualized in 60 Minutes

Tweets with the hashtag #dctech

Yesterday, Peter Corbett of iStrategyLabs posted a data set of 65,000 tweets and facebook statuses with the hashtag #dctech and challenged the community to visualize it.

These tweets represent two years of social media messages for and by the DC tech and startup community. Naturally, we thought visualizing this was a task for Data Community DC. I built two word clouds to visualize who and what people tweet about. Here’s how I built them:

  1. Using the Python csv library, I imported the data from a csv file.
  2. I scrubbed each tweet/status using regular expressions and the NLTK stopword corpus (I added some of my own stopwords after examining the data).
  3. I used NLTK to create a frequency distribution of words and two-word phrases in the tweets/statuses.
  4. I fed this data to a Python word cloud library.

I also filtered out just the handles to see who gets mentioned the most.

Who's mentioned the most in dctech

Some takeaways:

  1. #DCtech is “great” and “awesome”.
  2. We love our meetups.
  3. We love our journalists.
  4. @1776dc, @corbett3000, and @inthecapital are the cool kids in school.

Here’s the code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys, csv, re, nltk
from nltk.corpus import stopwords

sys.path.append("./word_cloud")

from wordcloud import make_wordcloud
import numpy as np

SW = stopwords.words('english')
SW.extend(["rt","via","amp","cc","get","us","got","way","mt","10","gt"])

words = []

def tokenize(update):
	update = re.sub('\.|:|;|&|\/|#|!|"|,|-|\?|\(|▸','',update)
	words = update.lower().split()
	words = [word for word in words if (word[:4] != "http" and len(word) > 1 and word not in SW)]
	bigrams = [' '.join(words[i:i+2]) for i in range(len(words)-1)]
	return words + bigrams

with open("tweet_data.csv","r") as h:

	csvreader = csv.reader(h)
	for r in csvreader:
		words.extend(tokenize(r[4]))

MINCOUNT = 5
freq = nltk.FreqDist(word for word in words)
h1 = open("freq.txt","w")
h2 = open("handles.txt","w")
handles = []
handlecounts = []
for k, i in freq.items():
	if i > MINCOUNT:
		h1.write(k+" "+str(i)+"\n")
		if k[0] == '@' and len(k.split()) == 1:
			h2.write(k+" "+str(i)+"\n")
			handles.append(k)
			handlecounts.append(i)
h.close()

# wordcloud of words
NUM = 300
w = 1200
h = 600
make_wordcloud(np.array(freq.keys()[:NUM]),np.array(freq.values()[:NUM]),"tweets.png",width=w,height=h)
make_wordcloud(np.array(handles[:NUM]),np.array(handlecounts[:NUM]),"handles.png",width=w,height=h)

 

The following two tabs change content below.

Valerie Coffman

Founder and CEO at Feastie
I'm a physicist turned data scientist and entrepreneur. Founder of Feastie -- search and analytics for the foodie blogosphere. I also blog at valeriecoffman.com.

Latest posts by Valerie Coffman (see all)

This entry was posted in Commentary, Infographics, Python. Bookmark the permalink.
  • http://www.drivenforward.com/ Glen Hellman

    Very cool Valerie

  • Dorothy

    Clean, effective and eye catching! Nice!