Distributed by default
Built on Dask. Hashtag, mention, and n-gram histograms fan out across your cluster and return as pandas DataFrames.
Architecture →Python or R, your choice
Every analytic ships in two flavors. *_alt_python wraps advertools / nltk / sklearn; *_r shells out to Rscript inside the worker image.
Docker-deployable cluster
One published image (albertogarob/whistlerlib) for both scheduler and workers. Compose for dev, Swarm for prod.
Hashtags, mentions, n-grams at scale
Wrap a CSV in a Dask DataFrame, then call a single method on the resulting TweetDataset. The top-k histogram fans out across the cluster as a map_partitions + distributed groupby, and lands back as a pandas DataFrame.
from whistlerlib import Context
ctx = Context('processes', '127.0.0.1', 8786)
ds = ctx.load_csv(
filen='posts.csv',
meta={
'column_mapping': {'date_column': 'Date', 'text_column': 'text'},
'file_encoding': 'utf-8',
},
num_partitions=8,
)
top5 = ds.hashtag_histogram_alt_python(k=5)
print(top5)
Two-layer dispatch: Python or R, same surface
Every analytic ships in matched *_alt_python / *_r pairs that compute the same shape with different per-partition extractors. The alt-Python layer covers most cases with zero R install; the R layer wraps tm, syuzhet, RWeka, and friends when you want their specific behavior.
# Pure Python (advertools-backed). No R install required on the worker.
ds.hashtag_histogram_alt_python(k=5)
# Same shape, different engine: R subprocess via the worker image.
# Wraps the 'tm' R package via Rscript per partition.
ds.hashtag_histogram_r(k=5)
Co-occurrence networks as igraph.Graph
hashtag_weighted_coonet and mention_weighted_coonet return both an edge DataFrame and a ready-to-analyze igraph.Graph. Edges are sorted and deduplicated for deterministic output regardless of partition count.
# Returns (edges_df, igraph.Graph).
edges, g = ds.hashtag_weighted_coonet()
print(f"{g.vcount()} nodes, {g.ecount()} edges")
# Hand g to igraph for community detection, centrality, layouts:
communities = g.community_multilevel()
Spanish sentiment ranges out of the box
sentiment_range_spanish_alt_python scores every row against the sentiment-analysis-spanish Keras model and returns only rows whose score lies in a chosen interval. The model is loaded per worker, the filter happens on the cluster.
# Keep only rows whose Spanish sentiment score falls in [0.0, 0.5].
# Score is computed per-row via sentiment-analysis-spanish (TF/Keras),
# distributed across workers, then filtered with a boolean mask.
neutral_or_negative = ds.sentiment_range_spanish_alt_python(
left_end=0.0,
right_end=0.5,
)
R bridge inside a Docker worker
The albertogarob/whistlerlib image bakes in R plus tm, syuzhet, RWeka, radvertools, and the system libraries they need. Both the scheduler and the workers use the same image; the scheduler just overrides the entrypoint to dask-scheduler. Your host never installs R.
# Bring up scheduler + 2 workers (Compose) on the local host.
docker compose -f docker/docker-compose.yml up -d
# Or pin to a published image tag:
WHISTLERLIB_TAG=0.2.0 docker compose -f docker/docker-compose.yml up -d
# Connect from your Python client:
# from whistlerlib import Context
# ctx = Context('processes', 'localhost', 8786)
Time profiling baked into every primitive
Pass return_time_profile=True to any analytic to get a TimeProfile alongside the result. Each stage is labeled with a _dist or _local suffix so you can see which steps ran on workers and which ran on the client.
# Every analytic accepts return_time_profile=True to get a per-stage
# timing breakdown alongside the result. Useful when tuning partition
# counts or comparing alt-python vs R implementations.
top5, profile = ds.hashtag_histogram_alt_python(
k=5,
return_time_profile=True,
)
print(profile)
# Stages are labelled <NN>_<name>_dist | _local so you can see which steps
# ran on the workers and which ran client-side.
Get started
pip install whistlerlib
