Applications Now Open for District Data Labs Incubator and Research Lab

Applications are now open for the Spring 2015 session of the District Data Labs incubator program and research lab.

The Incubator is a structured 3-month project development program where teams of people work on data projects together.  Each team is assigned one project and team members build a data product together over the course of the 3 months.  Team sizes are small (3-4 people per team) and are carefully assembled to contain a mix of quantitative and technical skills.

The Research Lab is an applied research institute focusing on the development of novel, innovative data science solutions whose practical applications have the potential to make a significant impact across multiple industries. These projects aim to push the envelope of current technological possibility, and data science provides the tools to successfully push beyond current limitations.  In addition to being more technologically advanced, the Research Lab projects also typically require a higher level of expertise than the projects in the Incubator Program.

Both programs are free, and if you’d like more information about either of them, head over to the DDL Projects page.

Full Disclosure: District Data Labs is a partner organization who co-hosts weekend workshops with Data Community DC and has overlapping boards. 
Posted in Uncategorized | Leave a comment

DC NLP November 2014 Meetup Announcement: Introduction to Topic Modeling with LDA

Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP November Meetup!


This month’s event features an overview of Latent Dirichlet Allocation and probabilistic topic modeling.

Topic models are a family of models to estimate the distribution of abstract concepts (topics) that make up a collection of documents. Over the last several years, the popularity of topic modeling has swelled. One model, Latent Dirichlet Allocation (LDA), is especially popular.

Tommy Jones, Research Associate Statistician at the Institute for Defense Analyses – Science and Technology Policy Institute, will describe a range of topic modeling algorithms and how they fit into the topic modeling taxonomy. He will then focus on LDA, explaining how to tune its parameters and giving tips for building better LDA models.

Finally, Tommy will present several open statistical questions in topic modeling, particularly LDA. Examples include LDA’s inconsistency, how sample selection affects estimates, and how to best present results. Researchers have begun to tackle some of these issues, but others remain. Still, LDA and other topic models are becoming invaluable resources for researchers in many disciplines.

DC NLP November Meetup
Wednesday, November 12, 2014
6:30 PM to 8:30 PM
Stetsons Famous Bar & Grill
1610 U Street Northwest, Washington, DC

The DC NLP meetup group is for anyone in the Washington, D.C. area working in (or interested in) Natural Language Processing. Our meetings provide an opportunity for folks to network, give presentations about their work or research projects, learn about the latest advancements in our field, and exchange ideas or brainstorm. Topics include computational linguistics, machine learning, text analytics, data mining, information extraction, speech processing, sentiment analysis, and much more.

For more information and to RSVP, please visit:
Follow us on Twitter: @DCNLP

Posted in Events, GuestPost, Meetup | Tagged , , , , | Leave a comment

Notes on a Meetup

This is a guest post by Catherine Madden (@catmule), a lifelong doodler who realized a few years ago that doodling, sketching, and visual facilitation can be immensely useful in a professional environment. The post consists of her notes from the most recent Data Visualization DC Meetup. Catherine works as the lead designer for the Analytics Visualization Studio at Deloitte Consulting, designing user experiences and visual interfaces for visual analytics prototypes. She prefers Paper by Fifty Three and their Pencil stylus for digital note taking. (Click on the image to open full size.)


You can follow Catherine on Twitter to see more of her notes from data-related talks, like these and these. And don’t miss the next Data Visualization DC Meetup, Our Interests Define Our Networks, on October 27!

Posted in Data Visualization DC, Events, GuestPost, Infographics, Meetup, Reviews | Tagged , , , , | Leave a comment

October Starts Off Right: A Full Month of Events for Women in Tech

This is a guest post by Shannon Turner, a software developer and founder of Hear Me Code, offering free, beginner-friendly coding classes for women in the DC area. In her spare time she creates projects like Shut That Down and serves as a mentor with Code for Progress.

Over 200 women were in attendance for the DC Fem Tech Tour de Code Kickoff party held at Google Thursday night.  DC Fem Tech, a collective of over 25 women in tech organizations, collaborates to run events and support the women in DC’s tech community.


The collective came together in early 2014 when Stephanie Nyugen and Shana Glenzer realized the DC women in tech community had many different groups doing similar work lowering the barriers women faced when entering the tech field. By bringing groups that share similar goals together, DC Fem Tech amplifies the voices of each group and builds collective power. Continue reading

Posted in Announcements, Events, GuestPost | Tagged , , , , , , , , | Leave a comment

DC NLP October 2014 Meetup Announcement: Automated Query Parsing, and Fact Checking with Truth Teller

Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP October Meetup!


This month features an introduction to the art of automated query parsing, and a discussion of a WaPo app that automates some of the tedium of fact checking.

Tony Maull is a Senior Director of Enterprise at DataRPM. He will discuss the differences between computational search and content search. His primary focus is how the computation can be relied upon when a natural language question can be asked any number of ways but still needs to drive a consistently accurate answer.

Sara Carothers is a Mobile Project Manager at the Washington Post and the product owner for Truth Teller, an experimental news app that fact-checks political speech. If a politician repeats a talking point that has already been fact-checked, Truth Teller automatically links the reader to that reporting. To accomplish this, the app incorporates speech-to-text processing and preliminary aspects of natural language processing. Sara will introduce the app and speak about the possibilities of collaboration between the fields of journalism and NLP. Continue reading

Posted in Announcements, Events, GuestPost | Tagged | Leave a comment

Announcing the Publication of Practical Data Science Cookbook

Four of DC2′s board members have published a new book! Tony Ojeda, Sean Murphy, Benjamin Bengfort, and Abhijit Dasgupta are proud to announce the arrival of Practical Data Science Cookbook.

Practical Data Science Cookbook

Practical Data Science Cookbook is perfect for those who want to learn data science and numerical programming concepts through hands-on, real-world project examples. Whether you are brand new to data science or you are a seasoned expert, you will benefit from learning about the structure of data science projects, the steps in the data science pipeline, and the programming examples presented in this book. Since the book is formatted to walk you through the projects with examples and explanations along the way, no prior programming experience is required.

And while you’re at it, why not double your impact and check out another book by our board members? Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work by Harlan Harris, Sean Murphy, and Marck Vaisman can be downloaded for free from the O’Reilly website.

Posted in Announcements, Languages, Methods, Press Releases, Python, R, Resources | Tagged , , , , , , , , | Leave a comment

Social Network Analysis with Python Workshop on November 22nd

Social Network Analysis with Python





Data Community DC and District Data Labs are hosting a full-day Social Network Analysis with Python workshop on Saturday November 22nd.  For more info and to sign up, go to  Register before October 31st for an early bird discount!


Social networks are not new, even though websites like Facebook and Twitter might make you want to believe they are; and trust me- I’m not talking about Myspace! Social networks are extremely interesting models for human behavior, whose study dates back to the early twentieth century. However, because of those websites, data scientists have access to much more data than the anthropologists who studied the networks of tribes!

Because networks take a relationship-centered view of the world, the data structures that we will analyze model real world behaviors and community. Through a suite of algorithms derived from mathematical Graph theory we are able to compute and predict behavior of individuals and communities through these types of analyses. Clearly this has a number of practical applications from recommendation to law enforcement to election prediction, and more. Continue reading

Posted in Events, Python | Tagged , , | Leave a comment

Fast Data Applications with Spark & Python Workshop on November 8th

Fast Data Applications with Spark & Python






Data Community DC and District Data Labs are excited to be hosting a Fast Data Applications with Spark & Python workshop on November 8th  For more info and to sign up, go to  There’s even an early bird discount if you register before October 17th!


Hadoop has made the world of Big Data possible by providing a framework for distributed computing on economical, commercial off-the-shelf hardware. Hadoop 2.0 implements a distributed file system, HDFS, and a computing framework, YARN, that allows distributed applications to easily harness the power of clustered computing on extremely large data sets. Over the past decade, the primary application framework has been MapReduce – a functional programming paradigm that lends itself extremely well to designing distributed applications, but carries with it a lot of computational overhead.

Many excellent analytical applications and algorithms have been written in MapReduce, creating an ecosystem that has made Hadoop continue to grow as an effective tool. However, more complex algorithms, especially machine learning algorithms, often require extremely complex chains of jobs to conform to the MapReduce functional paradigm. Enter Spark, an open source Apache project that uses the cluster resource daemons of Hadoop (particularly HDFS and other Hadoop data stores) but allows developers to break out of the MapReduce paradigm and write distributed applications that are much faster.

Spark also distributes applications to a cluster by using distributed executor processes- Spark developers write applications that are intended to work on local data; however unlike with MapReduce, these executors are in communication with each other and can share data via an external store. Spark is intended to work with Hadoop data stores, but can be run in a stand alone mode, or if you already have a Hadoop 2.0 cluster- then Spark can be run with YARN. The flexibility that Spark provides means that it can be used to implement more complex algorithms and applications previously unavailable to MapReduce patterns.

Spark can run in memory, making it hundreds of times faster than disk based MapReduce, and provides a programming API in Scala, Java, and Python – making it more accessible to developers. Spark has an interactive command line interface to quickly interact with data on the cluster, and applications for writing SQL-like queries with Spark and a fairly complete Machine Learning library. Importantly, it can also execute Graph algorithms that were previously unable to be ported to MapReduce frameworks. Continue reading

Posted in Announcements, Events, Python | Tagged , , | Leave a comment

DC NLP September 2014 Meetup Announcement: Natural Language Processing for Assistive Technologies

Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP September Meetup!


This month, we’re joined by Kathy McCoy, Professor of Computer & Information Science and Linguistics at the University of Delaware. Kathy is also a consultant for the National Institute on Disability and Rehabilitation Research (NIDRR) at the U.S. Department of Education. Her research focuses on natural language generation and understanding, particularly for assistive technologies, and she’ll be giving a presentation on Replicating Semantic Connections Made by Visual Readers for a Scanning System for Nonvisual Readers. Continue reading

Posted in Announcements, Community, Events, GuestPost, Meetup | Tagged , , , , | Leave a comment

Natural Language Analysis with NLTK on October 25th

Python NLTK Workshop





Data Community DC and District Data Labs are excited to be hosting a Natural Language Analysis with NLTK workshop on October 25th  For more info and to sign up, go to  There’s even an early bird discount if you register before October 3rd!


Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world – unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK).

NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.

Continue reading

Posted in Announcements, Events, Python, Text Analytics | Tagged , , | Leave a comment