Perhaps you’ve heard the phrase lately “software is eating the world”. Well, to be successful at that, it’s going to have to do as least as good a job of eating the world’s data as do the systems of Kalev Leetaru, Georgetown/Yahoo! fellow.
Kalev Leetaru, lead investigator on GDELT and other tools, defines “world class” work — certainly in the sense of size and scope of data. The goal of GDELT and related systems is to stream global news and social media in as near realtime as possible through multiple steps. The overall goal is to arrive at reliable tone (sentiment) mining and differential conflict detection and to do so …. globally. It is a grand goal.
Kalev Leetaru’s talk covered several broad areas. History of data and communication, data quality and “gotcha” issues in data sourcing and curation, geography of Twitter, processing architecture, toolkits and considerations, and data formatting observations. In each he had a fresh perspective or a novel idea, born of the requirement to handle enormous quantities of ‘noisy’ or ‘dirty’ data.
Keetaru observed that “the map is not the territory” in the sense that actual voting, resource or policy boundaries as measured by various data sources may not match assigned national boundaries. He flagged this as a question of “spatial error bars” for maps.
Distinguishing Global data science from hard established HPC-like pursuits (such as computational chemistry) Kalev Leetaru observed that we make our own bespoke toolkits, and that there is no single ‘magic toolkit” for Big Data, so we should be prepared and willing to spend time putting our toolchain together.
After talking a bit about the historical evolution and growth of data, Kalev Leetaru asked a few perspective-changing questions (some clearly relevant to intelligence agency needs) How to find all protests? How to locate all law books? Some of the more interesting data curation tools and resources Kalev Leetaru mentioned — and a lot more — might be found by the interested reader in The Oxford Guide to Library Research by Thomas Mann.
GDELT (covered further below), labels parse trees with error rates, and reaches beyond the “WHAT” of simple news media to tell us WHY, and ‘how reliable’. One GDELT output product among many is the Daily Global Conflict Report, which covers world leader emotional state and differential change in conflict, not absolute markers.
One recurring theme was to find ways to define and support “truth.” Kalev Leetaru decried one current trend in Big Data, the so-called “Apple Effect”: making luscious pictures from data; with more focus on appearance than actual ground truth. One example he cited was a conclusion from a recent report on Syria, which — blithely based on geotagged English-language tweets and Facebook postings — cast a skewed light on Syria’s rebels (Bzzzzzt!)
Leetaru provided one answer on “how to ‘ground truth’ data” by asking “how accurate are geotagged tweets?” Such tweets are after all only 3% of the total. But he reliably used those tweets. How? By correlating location to electric power availability. (r = .89) He talked also about how to handle emoticons, irony, sarcasm, and other affective language, cautioning analysts to think beyond blindly plugging data into pictures.
Kalev Leetaru talked engagingly about Geography of Twitter, encouraging us to to more RTFD (D=data) than RTFM. Cut your own way through the forest. The valid maps have not been made yet, so be prepared to make your own. Some of the challenges he cited were how to break up typical #hashtagswithnowhitespace and put them back into sentences, how to build — and maintain — sentiment/tone dictionaries and to expect, therefore, to spend the vast majority of time in innovative projects in human tuning the algorithms and understanding the data, and then iterating the machine. Refreshingly “hands on.”
Scale and Tech Architecture
Kalev Leetaru turned to discuss the scale of data, which is now generating easily in the petabytes per day range. There is no longer any question that automation must be used and that serious machinery will be involved. Our job is to get that automation machinery doing the right thing, and if we do so, we can measure the ‘heartbeat of society.’
For a book images project (60 Million images across hundreds of years) he mentioned a number of tools and file systems (but neither Gluster nor CEPH, disappointingly to this reviewer!) and delved deeply and masterfully into the question of how to clean and manage the very dirty data of “closed captioning” found in news reports. To full-text geocode and analyze half a million hours of news (from the Internet Archives), we need fast language detection and captioning error assessment. What makes this task horrifically difficult is that POS tagging “fails catastrophically on closed captioning” and that CC is worse, far worse in terms of quality than is Optical Character Recognition. The standard Stanford NL Understanding toolkit is very “fragile” in this domain: one reason being that news media has an extremely high density of location references, forcing the analyst into using context to disambiguate.
He covered his GDELT (Global Database of Event, Language and Tone), covering human/societal behavior and beliefs at scale around the world. A system of half a billion plus georeferenced rows, 58 columns wide, comprising 100,000 sources such as broadcast, print, online media back to 1979, it relies on both human translation and Google translate, and will soon be extended across languages and back to the 1800s. Further, he’s incorporating 21 billion words of academic literature into this model (a first!) and expects availability in Summer 2014, (Sources include JSTOR, DTIC, CIA, CVORE CiteSeerX, IA.)
GDELT’s architecture, which relies heavily on the Google Cloud and BigQuery, can stream at 100,000 input observations/second. This reviewer wanted to ask him about update and delete needs and speeds, but the stream is designed to optimize ingest and query. GDELT tools were myriad, but Perl was frequently mentioned (for text processing).
Kalev Leetaru shared some post GDELT construction takeaways — “it’s not all English” and “watch out for full Unicode compliance” in your toolset, lest your lovely data processing stack SEGFAULT halfway through a load. Store data in whatever is easy to maintain and fast. Modularity is good but performance can be an issue; watch out for XML which bogs down processing on highly nested data. Use for interchange more than anything; sharing seems “nice” but “you can’t shared a graph” and “RAM disk is your friend” more so even than SSD, FusionIO, or fast SANs.
The talk, like this blog post, ran over allotted space and time, but the talk was well worth the effort spent understanding it.
Latest posts by Guest Author (see all)
- DC NLP September 2014 Meetup Announcement: Natural Language Processing for Assistive Technologies - September 8, 2014
- Simulation and Predictive Analytics - August 13, 2014
- DC NLP August 2014 Meetup Announcement: Automatic Segmentation - August 11, 2014