Photo credit Georgetown University.
The entire GDELT database is 100% free and open and you can
download the raw datafiles, visualize it using the
GDELT Analysis Service, or analyze it at limitless scale with Google BigQuery.
The GDELT Project is the largest, most comprehensive, and highest resolution open database of human society ever created. Its Event Database archives contain nearly 400M latitude/longitude geographic coordinates spanning over 12,900 days, additionally making it one of the largest open-access spatio-temporal datasets in existance. It truly pushes the boundaries of "big data," weighing in at over a quarter-billion rows with 59 fields for each record, spanning the geography of the entire planet, and covering a time horizon of more than 35 years. Its Global Knowledge Graph connects the world's people, organizations, locations, themes, counts, and emotions into a single holistic network over the entire planet. How can you query, explore, model, visualize, interact, and even forecast this vast archive of human society?
Visualization credit GDELT Project.
Visualize, analyze, explore, and export GDELT right from your browser.
The GDELT Analysis Service is a free cloud-based service that offers a variety of tools and services to allow you to visualize, explore, and export both the GDELT Event Database and the GDELT Global Knowledge Graph. This is a great way to get started exploring GDELT and what it can do for you, even if you don't have a technical background.
Fourteen different tools are available for geographic, temporal, network, and contextual visualizations of both the Event Database and Global Knowledge Graph. No technical expertise is required - you just select the visualization you want, enter your query, and a few minutes later it is delivered right to your email inbox!
A common thread we've heard from all of you is the need for a central set of tools that make it easier to work with GDELT and that can translate its rich multidimensional knowledgebase into file formats and visualizations that analysts and scholars can better make sense of and that are compatible with the toolkits and software you use each day. To this end, each of the tools offers the ability to output in a wide array of relevant file formats, from CSV to Google Earth to Gephi, allowing you for example to construct a network of influencers around an industry and output it as a Gephi file for further analysis.
Often the GDELT Analysis Service is the best place to start when testing out a hypothesis or checking for an emergent trend. You can instantly test out a new query, getting results back in just a few moments, iteratively adjusting your search to see if there is anything worth exploring further. When you find something of interest that you want to explore, visualize, or analyze in more detail, the Analysis Service has a wide range of export options, from speciality file formats to raw export of the underlying CSV records.
If you are trying to locate all attacks on civilians in a certain country over a four-month period, use the TimeMapper tool to determine if there are enough matching events or if you need to adjust your query further, and then use the Exporter tool to download a CSV file containing just the matching records.
Photo credit Google.
Leverage the world's most powerful database platform for realtime querying and analysis.
The entire quarter-billion-record GDELT Event Database is available in Google BigQuery, updated daily. You can query, export, and even conduct sophisticated analyses and modeling of the entire dataset using standard SQL, with even the most complex queries returning in near-realtime.
From the very beginning, one of the greatest challenges in working with GDELT has been in how to interact with a dataset of this magnitude. Few database platforms can handle a dataset this complex with the sheer variety of access patterns and the number of permutations of fields that are collected together into queries each day.
Google's BigQuery database was custom-designed for datasets like GDELT, enabling near-realtime adhoc querying over the entire dataset. This means that no matter how you access GDELT, what columns you look across, what kinds of operators you use, or the complexity of your query, you will still see results pretty much in near-realtime.
For us, the most groundbreaking part of having GDELT in BigQuery is that it opens the door not only to fast complex querying and extracting of data, but also allows for the first time real-world analyses to be run entirely in the database.
Imagine computing the most significant conflict interaction in the world by month over the past 35 years, or performing cross-tabbed correlation over different classes of relationships between a set of countries. Such queries can be run entirely inside of BigQuery and return in just a handful of seconds. This enables you to try out "what if" hypotheses on global-scale trends in near-real time.
Visualization credit GDELT Project.
Download all of GDELT to your own computer.
Advanced users and those with unique use cases can download the entire underlying event and graph datasets in CSV format. Deep technical knowledge and extensive experience working with large datasets is required to make use of these datasets, with the event database alone requiring over 100GB of disk space.
The GDELT Event Database contains over a quarter-billion records organized into a set of tab-delimited files by date. Through March 31, 2013 records are stored in monthly and yearly files by the date the event took place. Beginning with April 1, 2013, files are created daily and records are stored by the date the event was found in the world's news media rather than the date it occurred (97%+ of events are reported within 24 hours of happening, but a small number of events each day are past events being mentioned for the first time - if an event has been seen before it will not be included again). Files are ZIP compressed in tab delimited format, but named with a ".CSV" extension to address some software packages that will not accept .TXT or .TSV files.
Each morning, seven days a week, the latest daily update is posted by 6AM EST. This file is named with the previous day's date in the format "YYYYMMDD.export.CSV.zip" (ie the morning of May 24, 2013 a new file called "20130523.export.CSV.zip" is added). UNIX or Linux users can easily set up a cronjob or other automatic scheduling processes to automatically download the latest daily update each morning and process it for watchboarding, forecasting, early warning, alert services, and other applications.
There is also a special "reduced" event dataset (1.1GB) that uses the "one a day" country-level filtering commonly used in older academic event databases. This version of the data will most closely match the aggregation level users with previous event analysis experience are familiar with and collapses the database on "DATE+ACTOR1+ACTOR2+EVENTCODE" (ie every protest held anywhere in Russia on a given day is collapsed to a single entry). This version is recommended only for those needing compatibility with analyses based on previous generations of academic event databases and covers the period January 1, 1979 to February 17, 2014. It is not updated.
The GDELT Global Knowledge Graph begins April 1, 2013 and consists of two parallel data streams, one encoding the entire knowledge graph with all of its fields, and the other encoding only the subset of the graph that records "counts" of a set of predefined categories like number of protesters, number killed, or number displaced or sickened. Such counts may occur independently of the CAMEO events in the primary GDELT event stream, such as mentions of those killed in industrial accidents (which are not captured in CAMEO) or those displaced by a natural disaster or sickened by a disease epidemic. In this way, the GKG Counts File can be used to produce a daily "Death Tracker" to map all mentions of death across the world each day, or an "Affected Tracker" to indicate how many persons were sickened/displaced/stranded each day (at least as recorded in the global news media). These files are named as "YYYYMMDD.gkg.csv.zip" and posted by 6AM EST each morning seven days a week.
The second file is the full graph file, which contains the actual graph connecting all persons, organizations, locations, emotions, themes, counts, events, and sources together each day. It also contains a list of the EventIDs of each event found in the same article as the extracted information, allowing rich contextualization of events. These files are named as "YYYYMMDD.gkgcounts.csv.zip" and posted by 6AM EST each morning seven days a week.
The Global Knowledge Graph is currently in "alpha" release and may change over time as we introduce new capabilities and expand its underlying algorithms.
Visualization credit GDELT Project.
All there is to know about using GDELT.
Getting Started with GDELT Guide
You'll find all of GDELT's documentation in this section, from user manuals to codebooks, lookup files to normalization spreadsheets. Make sure to start off with the Getting Started with GDELT Guide first.
The following documentation describes the GDELT Event Database, its major data fields and their descriptions and formats, and the codebook for the CAMEO event taxonomy.
The following describes the GDELT Global Knowledge Graph, its major data fields and their descriptions and formats, and the list of currently recognized themes and counts recognized by the system.
The GDELT Event Database uses the CAMEO event taxonomy, which records the actors involved in an event as a series of 3-character codes. These tab-delimited lookup files contain the human-friendly textual labels for each of those codes to make it easier to work with the data for those who have not previously worked with CAMEO.
The GDELT Event Database uses the CAMEO event taxonomy, which is a collection of more than 300 types of events organized into a hierarchical taxonomy and recorded in the files as a numeric code. These tab-delimited lookup files contain the human-friendly textual labels for each of those codes to make it easier to work with the data for those who have not previously worked with CAMEO.
The comma-delimited (CSV) files below are updated daily and record the total number of events in the GDELT Event Database across all event types broken down by time and country. This is important for normalization tasks, to compensate the exponential increase in the availability of global news material over time.
GDELT is the largest, highest resolution, and most detailed open dataset of global human society ever created. This means that working with it can require a lot of careful attention to things like normalization that are often unfamiliar to many disciplines. The "Getting Started with GDELT" guide is a great overview of how to best work with GDELT.