District Data Labs Project Incubator Program

A crucial part of learning data science is applying the skills you learn to real world projects. Working on interesting projects keeps you motivated to continue learning and helps you sharpen your skills. Working in teams helps you learn from the different experiences of others and generate new ideas about learning avenues you can pursue in the future.

That’s why District Data Labs is starting a virtual incubator program for data science projects!

Data Science Project Incubator


The incubator program is:

  • Free (no cost to you)
  • Part-time (you can work on projects in your spare time)
  • Virtual (you don’t need to be located in the DC area)


The first class of the incubator is scheduled to run from May through October 2014.  This is a great way to learn by working on a project with other people and even potentially sharing in the rewards of a project that ends up being commercially viable.

For more info and to apply, you can go to http://bit.ly/1dqp11k.

Applications end soon, so get yours in today!

Posted in Announcements, Community, Projects | Tagged , , , , | 1 Comment

Political Tech: Predicting 2016 Headlines

Mark Stephenson is a Founding Partner at Cardinal Insights, a data analysis, modeling and strategy firm.  Cardinal Insights provides accessible and powerful data targeting tools to Republican campaigns and causes of all sizes.  Twitter:  @markjstephenson  http://www.CardinalInsights.com

The reliance on data in politics comes as no surprise to those who watch trends in technology.  Business and corporate entities have been making major investments in data analysis, warehousing and processing for decades, as have both major political parties.  As the strategic, tactical and demographic winds shift for political operatives, so too has the need to become more effective at building high quality datasets with robust analysis efforts.

Recent efforts by both Republican and Democrat organizations to outpace each other in the analytical race to the top have been well documented by the press[1].  With the 2016 Presidential election cycle already underway (yes…really), I decided to make some headline predictions for what we will see after our next President is elected, as it relates to data, technology and organizational shifts over the next three years.

Continue reading

Posted in Commentary, Community, Consulting | Tagged , , , | Leave a comment

The Evolution of Big Data Platforms and People

This is a guest post by Paco Nathan. Paco is an O’Reilly authorApache Spark open source evangelist with Databricks, and an advisor for ZettacapAmplify Partners, and The Data Guild. Google lives in his family’s backyard. Paco spoke at Data Science DC in 2012. 

Data Workflows for Machine LearningA kind of “middleware” for Big Data has been evolving since the mid–2000s. Abstraction layers help make it simpler to write apps in frameworks such as Hadoop. Beyond the relatively simple issue of programming convenience, there are much more complex factors in play. Several open source frameworks have emerged that build on the notion of workflow, exemplifying highly sophisticated features. My recent talk Data Workflows for Machine Learning considers several OSS frameworks in that context, developing a kind of “scorecard” to help assess best-of-breed features. Hopefully it can help your decisions about which frameworks suit your use case needs.

By definition, a workflow encompasses both the automation that we’re leveraging (e.g., machine learning apps running on clusters) as well as people and process. In terms of automation, some larger players have departed from “conventional wisdom” for their clusters and ML apps. For example, while the rest of the industry embraced virtualization, Google avoided that path by using cgroups for isolation. Twitter sponsored a similar open source approach, Apache Mesos, which was attributed to helping resolve their “Fail Whale” issues prior to their IPO. As other large firms adopt this strategy, the implication is that VMs may have run out of steam. Certainly, single-digit utilization rates at data centers (current industry norm) will not scale to handle IoT data rates: energy companies could not handle that surge, let along the enormous cap-ex implied. I’ll be presenting on Datacenter Computing with Apache Mesos next Tuesday at the Big Data DC Meetup, held at AddThis. We’ll discuss the Mesos approach of mixed workloads for better elasticity, higher utilization rates, and lower latency.

Continue reading

Posted in Events, GuestPost, Management, Meetup | Tagged , , , , | Leave a comment

Calling all Coders! Code-a-Palooza Submissions Now Open

The Health Datapalooza 2014 Code-a-Palooza challenge is now open for submissions! Teams will use newly-released Centers for Medicare and Medicaid Services (CMS) data to create interactive data visualization tools to help consumers improve their health care decision-making.  Prizes totaling $35,000 will be awarded.

HDP_logo-hi-res_RGB  Continue reading

Posted in Announcements, Community, Competitions, Events | Tagged , | Leave a comment

What If Wikipedia Could Update Itself? More at OpenGov WikiHack

James Hare

James Hare, President, Wikimedia DC. Image credit: Wikimedia DC

Wikipedia is known throughout the world as a valuable source of information on almost any subject imaginable. Since its creation in 2001, Wikipedia has amassed over 30 million articles in 287 languages—over 4.4 million in English alone. This is made possible by the countless hours and efforts of volunteers, each contributing bits and pieces of his or her expertise. Unfortunately, despite the continued advance of technology, the act of writing paragraphs of prose has yet to be automated and still requires the efforts of humans.

But that does not mean that every part of Wikipedia is curated by hand. As we speak, automated software processes called “bots” are responsible for all manner of routine maintenance. These include removing vandalism from articles, sorting pages in and out of categories and checking for instances of copyright infringement on newly created articles. One of the earliest bots, Rambot, was created in 2002 to create articles on places throughout the United States, creating almost 37,000 Wikipedia articles in the process. This was made possible with data gathered from the U.S. Census Bureau and other agencies—more details are available here.

Continue reading

Posted in Announcements, Community, Competitions, Events, GuestPost | Leave a comment

Saving Money Using Data Science for Dummies

Ram C Singh

I’d like to tell you a story about how I made “data science” work for me without writing a single line of code, launching a single data analysis or visualization app, or even looking at a single digit of data.

“No, way!”, you say? Way, my friends. Way.

Our company is developing VoteRaise.com, a new way for voters to fundraise for the candidates they’d like to have run for political office. We realized that, even here in DC, there is no active community of people exploring how innovations in technology & techniques impacts political campaigning, so we decided to create it.

Continue reading

Posted in Events, GuestPost, Methods, Tutorials | Leave a comment

Deep Learning Inspires Deep Thinking

This is a guest post by Mary Galvin, founder and managing principal at AIC. Mary provides technical consulting services to clients including LexisNexis’ HPCC Systems team. The HPCC is an open source, massive parallel-processing computing platform that solves Big Data problems. 

Deep Learning is about Cats and Dogs

Data Science DC hosted a packed house at the Artisphere on Monday evening, thanks to the efforts of organizers Harlan Harris, Sean Gonzalez, and several others who helped plan and coordinate the event. Michael Burke, Jr, Arlington County Business Development Manager, provided opening remarks and emphasized Arlington’s commitment to serving local innovators and entrepreneurs. Michael subsequently introduced Sanju Bansal, a former MicroStrategy founder and executive who presently serves as the CEO of an emerging, Arlington-based start-up, Hunch Analytics. Sanju energized the audience by providing concrete examples of data science’s applicability to business; this no better illustrated than by the $930 million acquisition of Climate Corps. roughly 6 months ago.

Michael, Sanju, and the rest of the Data Science DC team helped set the stage for a phenomenal presentation put on by John Kaufhold, Managing Partner and Data Scientist at Deep Learning Analytics. John started his presentation by asking the audience for a show of hands on two items: 1) whether anyone was familiar with deep learning, and 2) of those who said yes to #1, whether they could explain what deep learning meant to a fellow data scientist. Of the roughly 240 attendees present, the majority of hands that answered favorably to question #1 dropped significantly upon John’s prompting of question #2.

I’ll be the first to admit that I was unable to raise my hand for either of John’s introductory questions. The fact I was at least a bit knowledgeable in the broader machine learning topic helped to somewhat put my mind at ease, thanks to prior experiences working with statistical machine translation, entity extraction, and entity resolution engines. That said, I still entered John’s talk fully prepared to brace myself for the ‘deep’ learning curve that lay ahead. Although I’m still trying to decompress from everything that was covered – it being less than a week since the event took place – I’d summarize key takeaways from the densely-packed, intellectually stimulating, 70+ minute session that ensued as follows:

Continue reading

Posted in Data Science DC, Events, GuestPost, Meetup, Reviews | Tagged , , , , , , | Leave a comment

Newsletter! Jobs!

Newsletter sidebarData Community DC is thrilled to announce three new things!

  1. We’ve got a newsletter! Or, rather, a new newsletter! For quite a while we’ve had an automated daily newsletter that you could subscribe to, but it just sent you any new blog content. Our new newsletter is weekly, will be edited/curated, and contains highlight’s of the last week on the blog, next week in events, and more. You should definitely subscribe!
  2. We’ve got job listings! Nothing makes us happier then hearing about people making career connections through DC2′s events. And now we’ve got a new way to actively help you find an amazing new job, or an amazing new employee. Our job ads will be published in the newsletter, and they’ll be short, targeted, and to the point. So subscribe to the newsletter now! If you’ve got a position you’re looking to fill, posting in DC2′s newsletter is your best option for reaching data/statistical/analytics professionals in the DC area. You should definitely submit an ad! Free for nonprofits, government, and sponsors; super-cheap for everyone else.
  3. We’re hiring! DC2 is a volunteer-run organization. But the amazing support from our sponsors, as well as revenue from workshops and advertising, means that we’ve decided to post two roles. First, we want to hire someone for five hours a week to help produce the newsletter, manage our job ads, etc. Second, we’ve got a few graphic/web design needs, and want to find a freelance designer who can help us on a project basis. Interested? More details are posted here.

As always, DC2 mission is to promote, education, and network data professionals in the region. Got an idea for something we should be doing? Get Involved!

Posted in Announcements, Jobs, Newsletter | Leave a comment

Apply to be a TCamp 2014 Scholar – From the Sunlight Foundation

Interested in coming to TCamp? Need some financial assistance to get here? We can help!

Continue reading

Posted in Announcements, Community | Tagged | Leave a comment

Facility Location Analysis Resources Incorporating Travel Time

This is a guest blog post by Alan Briggs. Alan is a operations researcher and data scientist at Elder Research. Alan and Harlan Harris (DC2 President and Data Science DC co-organizer) have co-presented a project with location analysis and Meetup location optimization at the Statistical Programming DC Meetup and an INFORMS-MD chapter meeting. Recently, Harlan presented a version of this work at the New York Statistical Programming Meetup. There was some great feedback on the Meetup page asking for some additional resources. This post by Alan is in response to that question.

If you’re looking for a good text resource to learn some of the basics about facility location, I highly recommend grabbing a chapter of Dr. Michael Kay’s e-book (pfd) available for free from his logistics engineering website. He gives an excellent overview of some of the basics of facility location, including single facility location, multi-facility location, facility location-allocation, etc. At ~20 pages, it’s entirely approachable, but technical enough to pique the interest of the more technically-minded analyst. For a deeper dive into some of the more advanced research in this space, I’d recommend using some of the subject headings in his book as seeds for a simple search on Google Scholar. It’s nothing super fancy, but there are plenty of articles in the public-domain that relate to minisum/minimax optimization and all of their narrowly tailored implementations.

Continue reading

Posted in GuestPost, Meetup, Methods, Resources, Statistical Programming DC | Tagged , , | Leave a comment