Skip to main content

Whistlerlib

Whistlerlib is a Python library for distributed processing of large social-media datasets, developed at the CentroGeo Metropolitan Observatory. It combines social-network-analysis (SNA) and natural-language-processing (NLP) primitives with a Dask-backed execution model, so a single analytical query, top-k hashtags, weighted co-occurrence networks, Spanish sentiment ranges, etc., fans out across a cluster of workers and comes back as a pandas DataFrame or an igraph.Graph.

What it does

FamilyExamples
Frequency analyticsTop-k hashtag / mention / n-gram histograms
SentimentSpanish sentiment scores via sentiment-analysis-spanish; multilingual emotion vectors via the syuzhet R package
NetworksWeighted co-occurrence networks of hashtags and mentions, returned as igraph.Graph

Each analytic comes in two flavours:

  • *_alt_python, pure Python implementation (uses advertools, nltk, sentiment-analysis-spanish, …).
  • *_r, runs an Rscript subprocess on each worker, wrapping a third-party R library (tm, syuzhet, radvertools, …).

Both flavours produce identically-shaped results; you pick based on which third-party tooling you trust for the domain at hand. See Algorithm families for the dispatch story.

When to use it

Whistlerlib is built for the case where your dataset is too large for a single-process pandas workflow but doesn't need a Spark cluster. Typical use:

  • Tweet / post corpora with millions to hundreds-of-millions of rows.
  • A small Dask cluster (one scheduler, a handful of workers) running on a researcher's lab machines or a few cloud VMs.
  • Pipeline output you want to slot into downstream pandas / Jupyter analysis.

If you only have a few thousand rows, pandas + the underlying libraries (advertools, nltk, igraph) are simpler. If you have petabyte-scale data, look at Spark or Ray.

30-second tour

from whistlerlib import Context

# Connect a client to a running Dask scheduler.
ctx = Context('processes', '127.0.0.1', 8786)

# Wrap a CSV in a 8-partition Dask DataFrame.
ds = ctx.load_csv(
filen='posts.csv',
meta={
'column_mapping': {'date_column': 'Date', 'text_column': 'text'},
'file_encoding': 'utf-8',
},
num_partitions=8,
)

# Top-5 hashtags, distributed.
top5 = ds.hashtag_histogram_alt_python(k=5)
print(top5)

The full quickstart, including how to spin up a local cluster with Docker Compose, lives in Tutorial 01.

Next steps

License

GPL-3.0-or-later. See the LICENSE file.