Social Network Analysis with Python Workshop on November 22nd

Social Network Analysis with Python

 

 

 

 

Data Community DC and District Data Labs are hosting a full-day Social Network Analysis with Python workshop on Saturday November 22nd.  For more info and to sign up, go to http://bit.ly/1lWFlLx.  Register before October 31st for an early bird discount!

Overview

Social networks are not new, even though websites like Facebook and Twitter might make you want to believe they are; and trust me- I’m not talking about Myspace! Social networks are extremely interesting models for human behavior, whose study dates back to the early twentieth century. However, because of those websites, data scientists have access to much more data than the anthropologists who studied the networks of tribes!

Because networks take a relationship-centered view of the world, the data structures that we will analyze model real world behaviors and community. Through a suite of algorithms derived from mathematical Graph theory we are able to compute and predict behavior of individuals and communities through these types of analyses. Clearly this has a number of practical applications from recommendation to law enforcement to election prediction, and more. Continue reading

Posted in Events, Python | Tagged , , | Leave a comment

Fast Data Applications with Spark & Python Workshop on November 8th

Fast Data Applications with Spark & Python

 

 

 

 

 

Data Community DC and District Data Labs are excited to be hosting a Fast Data Applications with Spark & Python workshop on November 8th  For more info and to sign up, go to http://bit.ly/Zhj0y1.  There’s even an early bird discount if you register before October 17th!

Overview

Hadoop has made the world of Big Data possible by providing a framework for distributed computing on economical, commercial off-the-shelf hardware. Hadoop 2.0 implements a distributed file system, HDFS, and a computing framework, YARN, that allows distributed applications to easily harness the power of clustered computing on extremely large data sets. Over the past decade, the primary application framework has been MapReduce – a functional programming paradigm that lends itself extremely well to designing distributed applications, but carries with it a lot of computational overhead.

Many excellent analytical applications and algorithms have been written in MapReduce, creating an ecosystem that has made Hadoop continue to grow as an effective tool. However, more complex algorithms, especially machine learning algorithms, often require extremely complex chains of jobs to conform to the MapReduce functional paradigm. Enter Spark, an open source Apache project that uses the cluster resource daemons of Hadoop (particularly HDFS and other Hadoop data stores) but allows developers to break out of the MapReduce paradigm and write distributed applications that are much faster.

Spark also distributes applications to a cluster by using distributed executor processes- Spark developers write applications that are intended to work on local data; however unlike with MapReduce, these executors are in communication with each other and can share data via an external store. Spark is intended to work with Hadoop data stores, but can be run in a stand alone mode, or if you already have a Hadoop 2.0 cluster- then Spark can be run with YARN. The flexibility that Spark provides means that it can be used to implement more complex algorithms and applications previously unavailable to MapReduce patterns.

Spark can run in memory, making it hundreds of times faster than disk based MapReduce, and provides a programming API in Scala, Java, and Python – making it more accessible to developers. Spark has an interactive command line interface to quickly interact with data on the cluster, and applications for writing SQL-like queries with Spark and a fairly complete Machine Learning library. Importantly, it can also execute Graph algorithms that were previously unable to be ported to MapReduce frameworks. Continue reading

Posted in Announcements, Events, Python | Tagged , , | Leave a comment

DC NLP September 2014 Meetup Announcement: Natural Language Processing for Assistive Technologies

Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP September Meetup!

dcnlp_at_stetsons

This month, we’re joined by Kathy McCoy, Professor of Computer & Information Science and Linguistics at the University of Delaware. Kathy is also a consultant for the National Institute on Disability and Rehabilitation Research (NIDRR) at the U.S. Department of Education. Her research focuses on natural language generation and understanding, particularly for assistive technologies, and she’ll be giving a presentation on Replicating Semantic Connections Made by Visual Readers for a Scanning System for Nonvisual Readers. Continue reading

Posted in Announcements, Community, Events, GuestPost, Meetup | Tagged , , , , | Leave a comment

Natural Language Analysis with NLTK on October 25th

Python NLTK Workshop

 

 

 

 

Data Community DC and District Data Labs are excited to be hosting a Natural Language Analysis with NLTK workshop on October 25th  For more info and to sign up, go to http://bit.ly/1pK0pFN.  There’s even an early bird discount if you register before October 3rd!

Overview

Natural Language Processing (NLP) is often taught at the academic level from the perspective of computational linguists. However, as data scientists, we have a richer view of the natural language world – unstructured data that by its very nature has latent information that is important to humans. NLP practitioners have benefited from machine learning techniques to unlock meaning from large corpora, and in this class we’ll explore how to do that particularly with Python and with the Natural Language Toolkit (NLTK).

NLTK is an excellent library for machine-learning based NLP, written in Python by experts from both academia and industry. Python allows you to create rich data applications rapidly, iterating on hypotheses. The combination of Python + NLTK means that you can easily add language-aware data products to your larger analytical workflows and applications.

Continue reading

Posted in Announcements, Events, Python, Text Analytics | Tagged , , | Leave a comment

Welcome to DataKind DC!

Harlan Harris is the President and a co-founder of Data Community DC, and is a long-time fan of DataKind.

Last week, DataKind, the nonprofit that connects pro-bono data and tech folks with nonprofits in need of data help, announced the first regional chapters, in the UK, Bangalore, Dublin, Singapore, San Francisco, and best of all (we think!), Washington, DC!

As they say in their intro blog post:

We bring together high-impact organizations dedicated to solving the world’s biggest challenges with leading data scientists to improve the quality of, access to and understanding of data in the social sector.

Easy right? Well, it can be when we work with some of the top talent in data science and the world’s most incredible organizations. Enter the Washington, DC metropolitan area and our beltway buddies, Maryland and Virginia, with collectively perhaps the nation’s highest density of nonprofits and statisticians per capita!

Continue reading

Posted in Announcements, Community | Tagged , , , , | Leave a comment

Endgame hosts DIDC’s Data and Cyber Security August Event

Our (lucky number) 13th DIDC meetup took place at the spacious offices of Endgame in Clarendon, VA. Endgame very graciously provided incredible gourmet pizza (and beer) for all those who attended.

PANO_20140821_185930

Beyond such excellent beverages and  food, attendees were treated to four separate and compelling talks. For those of you who could not attend, a little information about the talks and speakers is below (as well as contact information) and the slides!

Continue reading

Posted in Data Innovation DC, Reviews | Tagged , , | Leave a comment

Simulation and Predictive Analytics

This is a guest post by Lawrence Leemis, a professor in the Department of Mathematics at The College of William & Mary. 

A front-page article over the weekend in the Wall Street Journal indicated that the number one profession of interest to tech firms is a data scientist, someone whose analytic skills, computing skills, and domain skills are able to detect signals from data and use them to advantage. Although the terms are squishy, the push today is for “big data” skills and “predictive analytics” skills which allow firms to leverage the deluge of data that is now accessible.

I attended the Joint Statistical Meetings last week in Boston and I was impressed by the number of talks that referred to big data sets and also the number that used the R language. Over half of the technical talks that I attended included a simulation study of one type or another.

The two traditional aspects of the scientific method, namely theory and experimentation, have been enhanced with computation being added as a third leg. Sitting at the center of computation is simulation, which is the topic of this post. Simulation is a useful tool when analytic methods fail because of mathematical intractability. Continue reading

Posted in GuestPost, Languages, Methods, R | Tagged , , , , , , | Leave a comment

Natural Language Processing DC Discussion List

Data Community DC is pleased to announce a new service to the area data community: topic-specific discussion lists! In this way we hope to extend the successes of our Meetups and workshops by providing a way for groups of local people with similar interests to maintain contact and have ongoing discussions.

In a previous post, we announced the formation of the Deep Learning Discussion List for those interested in Deep Learning topics. The second topic-specific discussion group has just been created, a collaboration between Charlie Greenbacker (@greenbacker) and the DC-NLP Meetup Group and Ben Bengfort (@bbengfort) and DIDC - both specialists in Natural Language Processing and Computational Linguistics.

If you’re interested in Natural Language Processing and want to be part of the discussion, sign up here:

https://groups.google.com/a/datacommunitydc.org/d/forum/nlp

Continue reading

Posted in Announcements | Tagged , , | Leave a comment

Thoughts on the INFORMS Business Analytics Conference

This post, from DC2 President Harlan Harris, was originally published on his blog. Harlan was on the board of WINFORMS, the local chapter of the Operations Research professional society, from 2012 until this summer.

Earlier this year, I attended the INFORMS Conference on Business Analytics & Operations Research, in Boston. I was asked beforehand if I wanted to be a conference blogger, and for some reason I said I would. This meant I was able to publish posts on the conference’s WordPress web site, and was also obliged to do so!

Here are the five posts that I wrote, along with an excerpt from each. Please click through to read the full pieces: Continue reading

Posted in Commentary, Events | Tagged , , , , | Leave a comment

DC NLP August 2014 Meetup Announcement: Automatic Segmentation

Curious about techniques and methods for applying data science to unstructured text? Join us at the DC NLP August Meetup!

dcnlp_july

This August, we’re joined by Tony Davis, technical manager in the NLP and machine learning group at 3M Health Information Systems and adjunct professor in the Georgetown University Linguistics Department, where he’s taught courses including information retrieval and extraction, and lexical semantics.

Tony will be introducing us to automatic segmentation. Automatic segmentation deals with breaking up unstructured documents into units – words, sentences, topics, etc. Search and retrieval, document categorization, and analysis of dialog and discourse all benefit from segmentation. Continue reading

Posted in Announcements, Events, GuestPost, Meetup | Leave a comment