Whistlerlib
Whistlerlib is a Python library for distributed processing of large social-media datasets, developed at the CentroGeo Metropolitan Observatory. It combines social-network-analysis (SNA) and natural-language-processing (NLP) primitives with a Dask-backed execution model, so a single analytical query, top-k hashtags, weighted co-occurrence networks, Spanish sentiment ranges, etc., fans out across a cluster of workers and comes back as a pandas DataFrame or an igraph.Graph.
What it does
| Family | Examples |
|---|---|
| Frequency analytics | Top-k hashtag / mention / n-gram histograms |
| Sentiment | Spanish sentiment scores via sentiment-analysis-spanish; multilingual emotion vectors via the syuzhet R package |
| Networks | Weighted co-occurrence networks of hashtags and mentions, returned as igraph.Graph |
Each analytic comes in two flavours:
*_alt_python, pure Python implementation (usesadvertools,nltk,sentiment-analysis-spanish, …).*_r, runs anRscriptsubprocess on each worker, wrapping a third-party R library (tm,syuzhet,radvertools, …).
Both flavours produce identically-shaped results; you pick based on which third-party tooling you trust for the domain at hand. See Algorithm families for the dispatch story.
When to use it
Whistlerlib is built for the case where your dataset is too large for a single-process pandas workflow but doesn't need a Spark cluster. Typical use:
- Tweet / post corpora with millions to hundreds-of-millions of rows.
- A small Dask cluster (one scheduler, a handful of workers) running on a researcher's lab machines or a few cloud VMs.
- Pipeline output you want to slot into downstream pandas / Jupyter analysis.
If you only have a few thousand rows, pandas + the underlying libraries (advertools, nltk, igraph) are simpler. If you have petabyte-scale data, look at Spark or Ray.
30-second tour
from whistlerlib import Context
# Connect a client to a running Dask scheduler.
ctx = Context('processes', '127.0.0.1', 8786)
# Wrap a CSV in a 8-partition Dask DataFrame.
ds = ctx.load_csv(
filen='posts.csv',
meta={
'column_mapping': {'date_column': 'Date', 'text_column': 'text'},
'file_encoding': 'utf-8',
},
num_partitions=8,
)
# Top-5 hashtags, distributed.
top5 = ds.hashtag_histogram_alt_python(k=5)
print(top5)
The full quickstart, including how to spin up a local cluster with Docker Compose, lives in Tutorial 01.
Next steps
- Install: pip (client) or Docker (cluster).
- Learn the model: Architecture, Context & datasets, Algorithm families.
- Run something: Tutorials, seven runnable end-to-end examples.
- Upgrade from 0.1.0: Migration guide.
License
GPL-3.0-or-later. See the LICENSE file.