Harlan Harris is the President and a co-founder of Data Community DC, and is a long-time fan of DataKind.
Last week, DataKind, the nonprofit that connects pro-bono data and tech folks with nonprofits in need of data help, announced the first regional chapters, in the UK, Bangalore, Dublin, Singapore, San Francisco, and best of all (we think!), Washington, DC!
As they say in their intro blog post:
We bring together high-impact organizations dedicated to solving the world’s biggest challenges with leading data scientists to improve the quality of, access to and understanding of data in the social sector.
Easy right? Well, it can be when we work with some of the top talent in data science and the world’s most incredible organizations. Enter the Washington, DC metropolitan area and our beltway buddies, Maryland and Virginia, with collectively perhaps the nation’s highest density of nonprofits and statisticians per capita!
Our (lucky number) 13th DIDC meetup took place at the spacious offices of Endgame in Clarendon, VA. Endgame very graciously provided incredible gourmet pizza (and beer) for all those who attended.
Beyond such excellent beverages and food, attendees were treated to four separate and compelling talks. For those of you who could not attend, a little information about the talks and speakers is below (as well as contact information) and the slides!
This is a guest post by Lawrence Leemis, a professor in the Department of Mathematics at The College of William & Mary.
A front-page article over the weekend in the Wall Street Journal indicated that the number one profession of interest to tech firms is a data scientist, someone whose analytic skills, computing skills, and domain skills are able to detect signals from data and use them to advantage. Although the terms are squishy, the push today is for “big data” skills and “predictive analytics” skills which allow firms to leverage the deluge of data that is now accessible.
I attended the Joint Statistical Meetings last week in Boston and I was impressed by the number of talks that referred to big data sets and also the number that used the R language. Over half of the technical talks that I attended included a simulation study of one type or another.
The two traditional aspects of the scientific method, namely theory and experimentation, have been enhanced with computation being added as a third leg. Sitting at the center of computation is simulation, which is the topic of this post. Simulation is a useful tool when analytic methods fail because of mathematical intractability. Continue reading
Data Community DC is pleased to announce a new service to the area data community: topic-specific discussion lists! In this way we hope to extend the successes of our Meetups and workshops by providing a way for groups of local people with similar interests to maintain contact and have ongoing discussions.
In a previous post, we announced the formation of the Deep Learning Discussion List for those interested in Deep Learning topics. The second topic-specific discussion group has just been created, a collaboration between Charlie Greenbacker (@greenbacker) and the DC-NLP Meetup Group and Ben Bengfort (@bbengfort) and DIDC - both specialists in Natural Language Processing and Computational Linguistics.
If you’re interested in Natural Language Processing and want to be part of the discussion, sign up here:
This post, from DC2 President Harlan Harris, was originally published on his blog. Harlan was on the board of WINFORMS, the local chapter of the Operations Research professional society, from 2012 until this summer.
Earlier this year, I attended the INFORMS Conference on Business Analytics & Operations Research, in Boston. I was asked beforehand if I wanted to be a conference blogger, and for some reason I said I would. This meant I was able to publish posts on the conference’s WordPress web site, and was also obliged to do so!
Here are the five posts that I wrote, along with an excerpt from each. Please click through to read the full pieces: Continue reading
Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP August Meetup!
This August, we’re joined by Tony Davis, technical manager in the NLP and machine learning group at 3M Health Information Systems and adjunct professor in the Georgetown University Linguistics Department, where he’s taught courses including information retrieval and extraction, and lexical semantics.
Tony will be introducing us to automatic segmentation. Automatic segmentation deals with breaking up unstructured documents into units – words, sentences, topics, etc. Search and retrieval, document categorization, and analysis of dialog and discourse all benefit from segmentation. Continue reading
On February 27, President Obama announced the My Brother’s Keeper initiative, a program that combines the efforts of the government, philanthropic organizations, and the private sector to work with boys and young men of color to close the lingering achievement gap.
If you’re as passionate about social justice as you are about data, MBK is worth paying attention to. The Department of Education has made effective use of data one of the core tenets of MBK.
The first MBK Data Jam will be held on the Georgetown campus on August 2. Register as a participant and spend the whole day jamming on teams comprised of designers, data viz experts, developers, educators, and practitioners to create data visualizations of current challenges and build new tools in order to create ladders of opportunity for all youth, including boys and young men of color; thought leaders and subject matter experts in the MBK focus areas can register as a coach/mentor and spend the afternoon providing feedback to the teams formed in the morning.
Let’s show the world what data can do!
Data Community DC and District Data Labs are excited to be hosting another Building Data Apps with Python workshop on August 23rd. For more info and to sign up, go to http://bit.ly/V4used. There’s even an early bird discount if you register before the end of this month!
Data products are usually software applications that derive their value from data by leveraging the data science pipeline and generate data through their operation. They aren’t apps with data, nor are they one time analyses that produce insights – they are operational and interactive. The rise of these types of applications has directly contributed to the rise of the data scientist and the idea that data scientists are professionals “who are better at statistics than any software engineer and better at software engineering than any statistician.”
These applications have been largely built with Python. Python is flexible enough to develop extremely quickly on many different types of servers and has a rich tradition in web applications. Python contributes to every stage of the data science pipeline including real time ingestion and the production of APIs, and it is powerful enough to perform machine learning computations. In this class we’ll produce a data product with Python, leveraging every stage of the data science pipeline to produce a book recommender.
Announcing the release of a new open source library: Confire is a simple but powerful configuration scheme that builds on the configuration parsers of Scapy, elasticsearch, Django and others. The basic scheme is to have a configuration search path that looks for YAML files in standard locations. The search path is hierarchical (meaning that system configurations are overloaded by user configurations, etc). These YAML files are then added to a default, class-based configuration management scheme that allows for easy development.
Full documentation can be found here: http://confire.readthedocs.org/
In a fit of procrastination, I put my first project on PyPI (the Python Package Index): Confire, a simple app configuration scheme using YAML and class based defaults. It was an incredible learning experience into the amount of work that goes into Python developers being simply able to
pip install something! I wanted to go the whole nine yards, and set up documentation on Read The Docs and an open source platform on Github and even though it took a while, it was well worth the effort!
Data Community DC is pleased to announce a new service to the area data community: topic-specific discussion lists! In this way we hope to extend the successes of our Meetups and workshops by providing a way for groups of local people with similar interests to maintain contact and have ongoing discussions. Our first discussion list will be on the topic of Deep Learning. The below is a guest post from John Kaufhold. Dr. Kaufhold is a data scientist and managing partner of Deep Learning Analytics, a data science company based in Arlington, VA. He presented an introduction to Deep Learning at the March Data Science DC Meetup.
A while back, there was this blog post about Deep Learning. At the end, we asked readers about their interest in hands-on Deep Learning tutorials.
The results are in, and the survey went to 11. And as in all data science, context matters–and this eleven is decidedly less inspiring than Nigel Tufnel’s eleven. That said, ten out of eleven respondents wanted a hands-on Deep Learning tutorial, and eight respondents said they would register for a tutorial even if it required hardware approval or enrollment in a hardware tutorial. But interest in practical hands-on Deep Learning workshops appears to be highly nonuniform. One respondent said they’d drive from hundreds of miles away for these workshops, but of the 3000+ data scientists in DC’s data and analytics community, presumably more local, only eleven total responded with interest.
In short, the survey was a bust. Continue reading