140 lines
7.1 KiB
Markdown
140 lines
7.1 KiB
Markdown
# GDELT 2.0 Doc API Client
|
|
|
|
A Python client to fetch data from the [GDELT 2.0 Doc API](https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/).
|
|
|
|
This allows for simpler, small-scale analysis of news coverage without having to deal with the complexities of downloading and managing the raw files from S3, or working with the BigQuery export.
|
|
|
|
## Installation
|
|
|
|
`gdeltdoc` is on PyPi and is installed through pip:
|
|
|
|
```bash
|
|
pip install gdeltdoc
|
|
```
|
|
|
|
## Use
|
|
|
|
The `ArtList` and `Timeline*` query modes are supported.
|
|
|
|
```python
|
|
from gdeltdoc import GdeltDoc, Filters
|
|
|
|
f = Filters(
|
|
keyword = "climate change",
|
|
start_date = "2020-05-10",
|
|
end_date = "2020-05-11"
|
|
)
|
|
|
|
gd = GdeltDoc()
|
|
|
|
# Search for articles matching the filters
|
|
articles = gd.article_search(f)
|
|
|
|
# Get a timeline of the number of articles matching the filters
|
|
timeline = gd.timeline_search("timelinevol", f)
|
|
```
|
|
|
|
## Integration in `llm_quant`
|
|
|
|
This repository wires `gdeltdoc` into the TuShare ingestion workflow so GDELT headlines arrive alongside the usual market data.
|
|
|
|
- Configuration lives under `gdelt_sources` in `app/data/config.json` (managed via `AppConfig.gdelt_sources`).
|
|
- `app/ingest/gdelt.py` wraps the Doc API, materialising results as `RssItem` objects so they share the same dedupe/heat scoring pipeline as RSS feeds.
|
|
- `app/ingest/coverage.ensure_data_coverage` now calls `ingest_configured_gdelt(...)` after the core TuShare tables, supporting incremental fetches via `ingest_state`.
|
|
|
|
Enable a source by flipping `enabled: true` in the config, optionally providing `start_date`/`end_date` windows or a rolling `timespan`. Subsequent runs only request data beyond the last persisted publish time.
|
|
|
|
### Article List
|
|
|
|
The article list mode of the API generates a list of news articles that match the filters. The client returns this as a pandas DataFrame with columns `url`, `url_mobile`, `title`, `seendate`, `socialimage`, `domain`, `language`, `sourcecountry`.
|
|
|
|
### Timeline Search
|
|
|
|
There are 5 available modes when making a timeline search:
|
|
|
|
- `timelinevol` - a timeline of the volume of news coverage matching the filters, represented as a percentage of the total news articles monitored by GDELT.
|
|
- `timelinevolraw` - similar to `timelinevol`, but has the actual number of articles and a total rather than a percentage
|
|
- `timelinelang` - similar to `timelinevol` but breaks the total articles down by published language. Each language is returned as a separate column in the DataFrame.
|
|
- `timelinesourcecountry` - similar to `timelinevol` but breaks the total articles down by the country they were published in. Each country is returned as a separate column in the DataFrame.
|
|
- `timelinetone` - a timeline of the average tone of the news coverage matching the filters. See [GDELT's documentation](https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/) for more information about the tone metric.
|
|
|
|
### Filters
|
|
|
|
The search query passed to the API is constructed from a `gdeltdoc.Filters` object.
|
|
|
|
```python
|
|
from gdeltdoc import Filters, near, repeat
|
|
|
|
f = Filters(
|
|
start_date = "2020-05-01",
|
|
end_date = "2020-05-02",
|
|
num_records = 250,
|
|
keyword = "climate change",
|
|
domain = ["bbc.co.uk", "nytimes.com"],
|
|
country = ["UK", "US"],
|
|
theme = "GENERAL_HEALTH",
|
|
near = near(10, "airline", "carbon"),
|
|
repeat = repeat(5, "planet")
|
|
)
|
|
```
|
|
|
|
Filters for `keyword`, `domain`, `domain_exact`, `country`, `language` and `theme` can be passed either as a single string or as a list of strings. If a list is passed, the values in the list are wrappeed in a boolean OR.
|
|
|
|
You must pass either `start_date` and `end_date`, or `timespan`
|
|
|
|
- `start_date` - The start date for the filter in YYYY-MM-DD format or as a datetime object in UTC time.
|
|
Passing a datetime allows you to specify a time down to seconds granularity. The API officially only supports the most recent 3 months of articles. Making a request for an earlier date range may still return data, but it's not guaranteed.
|
|
- `end_date` - The end date for the filter in YYYY-MM-DD format or as a datetime object in UTC time.
|
|
- `timespan` - A timespan to search for, relative to the time of the request. Must match one of the API's timespan formats - https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/
|
|
- `num_records` - The number of records to return. Only used in article list mode and can be up to 250.
|
|
- `keyword` - Return articles containing the exact phrase `keyword` within the article text.
|
|
- `domain` - Return articles from the specified domain. Does not require an exact match so passing "cnn.com" will match articles from `cnn.com`, `subdomain.cnn.com` and `notactuallycnn.com`.
|
|
- `domain_exact` - Similar to `domain`, but requires an exact match.
|
|
- `country` - Return articles published in a country or list of countries, formatted as the FIPS 2 letter country code.
|
|
- `language` - Return articles published in the given language, formatted as the ISO 639 language code.
|
|
- `theme` - Return articles that cover one of GDELT's GKG Themes. A full list of themes can be found [here](http://data.gdeltproject.org/api/v2/guides/LOOKUP-GKGTHEMES.TXT)
|
|
- `near` - Return articles containing words close to each other in the text. Use `near()` to construct. eg. `near = near(5, "airline", "climate")`, or `multi_near()` if you want to use multiple restrictions eg. `multi_near([(5, "airline", "crisis"), (10, "airline", "climate", "change")], method="AND")` finds "airline" and "crisis" within 5 words, and "airline", "climate", and "change" within 10 words
|
|
- `repeat` - Return articles containing a single word repeated at least a number of times. Use `repeat()` to construct. eg. `repeat =repeat(3, "environment")`, or `multi_repeat()` if you want to use multiple restrictions eg. `repeat = multi_repeat([(2, "airline"), (3, "airport")], "AND")`
|
|
- `tone` - Return articles above or below a particular tone score (ie more positive or more negative than a certain threshold). To use, specify either a greater than or less than sign and a positive or negative number (either an integer or floating point number). To find fairly positive articles, use `tone=">5"` or to search for fairly negative articles, use `tone="<-5"`
|
|
- tone_absolute - The same as `tone` but ignores the positive/negative sign and lets you search for high emotion or low emotion articles, regardless of whether they were happy or sad in tone
|
|
|
|
## Developing gdelt-doc-api
|
|
|
|
PRs & issues are very welcome!
|
|
|
|
### Setup
|
|
|
|
It's recommended to use a virtual environment for development. Set one up with
|
|
|
|
```
|
|
python -m venv venv
|
|
```
|
|
|
|
and activate it (on Mac or Linux)
|
|
|
|
```
|
|
source venv/bin/activate
|
|
```
|
|
|
|
Then install the requirements
|
|
|
|
```
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
Tests for this package use `unittest`. Run them with
|
|
|
|
```
|
|
python -m unittest
|
|
```
|
|
|
|
If your PR adds a new feature or helper, please also add some tests
|
|
|
|
### Publishing
|
|
|
|
There's a bit of automation set up to help publish a new version of the package to PyPI,
|
|
|
|
1. Make sure the version string has been updated since the last release. This package follows semantic versioning.
|
|
2. Create a new release in the Github UI, using the new version as the release name
|
|
3. Watch as the `publish.yml` Github action builds the package and pushes it to PyPI
|