llm-quant/docs/GDELT_README.md

140 lines
7.1 KiB
Markdown

# GDELT 2.0 Doc API Client
A Python client to fetch data from the [GDELT 2.0 Doc API](https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/).
This allows for simpler, small-scale analysis of news coverage without having to deal with the complexities of downloading and managing the raw files from S3, or working with the BigQuery export.
## Installation
`gdeltdoc` is on PyPi and is installed through pip:
```bash
pip install gdeltdoc
```
## Use
The `ArtList` and `Timeline*` query modes are supported.
```python
from gdeltdoc import GdeltDoc, Filters
f = Filters(
keyword = "climate change",
start_date = "2020-05-10",
end_date = "2020-05-11"
)
gd = GdeltDoc()
# Search for articles matching the filters
articles = gd.article_search(f)
# Get a timeline of the number of articles matching the filters
timeline = gd.timeline_search("timelinevol", f)
```
## Integration in `llm_quant`
This repository wires `gdeltdoc` into the TuShare ingestion workflow so GDELT headlines arrive alongside the usual market data.
- Configuration lives under `gdelt_sources` in `app/data/config.json` (managed via `AppConfig.gdelt_sources`).
- `app/ingest/gdelt.py` wraps the Doc API, materialising results as `RssItem` objects so they share the same dedupe/heat scoring pipeline as RSS feeds.
- `app/ingest/coverage.ensure_data_coverage` now calls `ingest_configured_gdelt(...)` after the core TuShare tables, supporting incremental fetches via `ingest_state`.
Enable a source by flipping `enabled: true` in the config, optionally providing `start_date`/`end_date` windows or a rolling `timespan`. Subsequent runs only request data beyond the last persisted publish time.
### Article List
The article list mode of the API generates a list of news articles that match the filters. The client returns this as a pandas DataFrame with columns `url`, `url_mobile`, `title`, `seendate`, `socialimage`, `domain`, `language`, `sourcecountry`.
### Timeline Search
There are 5 available modes when making a timeline search:
- `timelinevol` - a timeline of the volume of news coverage matching the filters, represented as a percentage of the total news articles monitored by GDELT.
- `timelinevolraw` - similar to `timelinevol`, but has the actual number of articles and a total rather than a percentage
- `timelinelang` - similar to `timelinevol` but breaks the total articles down by published language. Each language is returned as a separate column in the DataFrame.
- `timelinesourcecountry` - similar to `timelinevol` but breaks the total articles down by the country they were published in. Each country is returned as a separate column in the DataFrame.
- `timelinetone` - a timeline of the average tone of the news coverage matching the filters. See [GDELT's documentation](https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/) for more information about the tone metric.
### Filters
The search query passed to the API is constructed from a `gdeltdoc.Filters` object.
```python
from gdeltdoc import Filters, near, repeat
f = Filters(
start_date = "2020-05-01",
end_date = "2020-05-02",
num_records = 250,
keyword = "climate change",
domain = ["bbc.co.uk", "nytimes.com"],
country = ["UK", "US"],
theme = "GENERAL_HEALTH",
near = near(10, "airline", "carbon"),
repeat = repeat(5, "planet")
)
```
Filters for `keyword`, `domain`, `domain_exact`, `country`, `language` and `theme` can be passed either as a single string or as a list of strings. If a list is passed, the values in the list are wrappeed in a boolean OR.
You must pass either `start_date` and `end_date`, or `timespan`
- `start_date` - The start date for the filter in YYYY-MM-DD format or as a datetime object in UTC time.
Passing a datetime allows you to specify a time down to seconds granularity. The API officially only supports the most recent 3 months of articles. Making a request for an earlier date range may still return data, but it's not guaranteed.
- `end_date` - The end date for the filter in YYYY-MM-DD format or as a datetime object in UTC time.
- `timespan` - A timespan to search for, relative to the time of the request. Must match one of the API's timespan formats - https://blog.gdeltproject.org/gdelt-doc-2-0-api-debuts/
- `num_records` - The number of records to return. Only used in article list mode and can be up to 250.
- `keyword` - Return articles containing the exact phrase `keyword` within the article text.
- `domain` - Return articles from the specified domain. Does not require an exact match so passing "cnn.com" will match articles from `cnn.com`, `subdomain.cnn.com` and `notactuallycnn.com`.
- `domain_exact` - Similar to `domain`, but requires an exact match.
- `country` - Return articles published in a country or list of countries, formatted as the FIPS 2 letter country code.
- `language` - Return articles published in the given language, formatted as the ISO 639 language code.
- `theme` - Return articles that cover one of GDELT's GKG Themes. A full list of themes can be found [here](http://data.gdeltproject.org/api/v2/guides/LOOKUP-GKGTHEMES.TXT)
- `near` - Return articles containing words close to each other in the text. Use `near()` to construct. eg. `near = near(5, "airline", "climate")`, or `multi_near()` if you want to use multiple restrictions eg. `multi_near([(5, "airline", "crisis"), (10, "airline", "climate", "change")], method="AND")` finds "airline" and "crisis" within 5 words, and "airline", "climate", and "change" within 10 words
- `repeat` - Return articles containing a single word repeated at least a number of times. Use `repeat()` to construct. eg. `repeat =repeat(3, "environment")`, or `multi_repeat()` if you want to use multiple restrictions eg. `repeat = multi_repeat([(2, "airline"), (3, "airport")], "AND")`
- `tone` - Return articles above or below a particular tone score (ie more positive or more negative than a certain threshold). To use, specify either a greater than or less than sign and a positive or negative number (either an integer or floating point number). To find fairly positive articles, use `tone=">5"` or to search for fairly negative articles, use `tone="<-5"`
- tone_absolute - The same as `tone` but ignores the positive/negative sign and lets you search for high emotion or low emotion articles, regardless of whether they were happy or sad in tone
## Developing gdelt-doc-api
PRs & issues are very welcome!
### Setup
It's recommended to use a virtual environment for development. Set one up with
```
python -m venv venv
```
and activate it (on Mac or Linux)
```
source venv/bin/activate
```
Then install the requirements
```
pip install -r requirements.txt
```
Tests for this package use `unittest`. Run them with
```
python -m unittest
```
If your PR adds a new feature or helper, please also add some tests
### Publishing
There's a bit of automation set up to help publish a new version of the package to PyPI,
1. Make sure the version string has been updated since the last release. This package follows semantic versioning.
2. Create a new release in the Github UI, using the new version as the release name
3. Watch as the `publish.yml` Github action builds the package and pushes it to PyPI