Whistlerlib

Distributed by default

Built on Dask. Hashtag, mention, and n-gram histograms fan out across your cluster and return as pandas DataFrames.

Python or R, your choice

Every analytic ships in two flavors. *_alt_python wraps advertools / nltk / sklearn; *_r shells out to Rscript inside the worker image.

Algorithm families →

Docker-deployable cluster

One published image (albertogarob/whistlerlib) for both scheduler and workers. Compose for dev, Swarm for prod.

Docker install →

Hashtags, mentions, n-grams at scale

Wrap a CSV in a Dask DataFrame, then call a single method on the resulting TweetDataset. The top-k histogram fans out across the cluster as a map_partitions + distributed groupby, and lands back as a pandas DataFrame.

from whistlerlib import Context

ctx = Context('processes', '127.0.0.1', 8786)
ds = ctx.load_csv(
    filen='posts.csv',
    meta={
        'column_mapping': {'date_column': 'Date', 'text_column': 'text'},
        'file_encoding': 'utf-8',
    },
    num_partitions=8,
)

top5 = ds.hashtag_histogram_alt_python(k=5)
print(top5)

Two-layer dispatch: Python or R, same surface

Every analytic ships in matched *_alt_python / *_r pairs that compute the same shape with different per-partition extractors. The alt-Python layer covers most cases with zero R install; the R layer wraps tm, syuzhet, RWeka, and friends when you want their specific behavior.

# Pure Python (advertools-backed). No R install required on the worker.
ds.hashtag_histogram_alt_python(k=5)

# Same shape, different engine: R subprocess via the worker image.
# Wraps the 'tm' R package via Rscript per partition.
ds.hashtag_histogram_r(k=5)

Co-occurrence networks as igraph.Graph

hashtag_weighted_coonet and mention_weighted_coonet return both an edge DataFrame and a ready-to-analyze igraph.Graph. Edges are sorted and deduplicated for deterministic output regardless of partition count.

# Returns (edges_df, igraph.Graph).
edges, g = ds.hashtag_weighted_coonet()

print(f"{g.vcount()} nodes, {g.ecount()} edges")
# Hand g to igraph for community detection, centrality, layouts:
communities = g.community_multilevel()

Spanish sentiment ranges out of the box

sentiment_range_spanish_alt_python scores every row against the sentiment-analysis-spanish Keras model and returns only rows whose score lies in a chosen interval. The model is loaded per worker, the filter happens on the cluster.

# Keep only rows whose Spanish sentiment score falls in [0.0, 0.5].
# Score is computed per-row via sentiment-analysis-spanish (TF/Keras),
# distributed across workers, then filtered with a boolean mask.
neutral_or_negative = ds.sentiment_range_spanish_alt_python(
    left_end=0.0,
    right_end=0.5,
)

R bridge inside a Docker worker

The albertogarob/whistlerlib image bakes in R plus tm, syuzhet, RWeka, radvertools, and the system libraries they need. Both the scheduler and the workers use the same image; the scheduler just overrides the entrypoint to dask-scheduler. Your host never installs R.

# Bring up scheduler + 2 workers (Compose) on the local host.
docker compose -f docker/docker-compose.yml up -d

# Or pin to a published image tag:
WHISTLERLIB_TAG=0.2.0 docker compose -f docker/docker-compose.yml up -d

# Connect from your Python client:
#   from whistlerlib import Context
#   ctx = Context('processes', 'localhost', 8786)

Time profiling baked into every primitive

Pass return_time_profile=True to any analytic to get a TimeProfile alongside the result. Each stage is labeled with a _dist or _local suffix so you can see which steps ran on workers and which ran on the client.

# Every analytic accepts return_time_profile=True to get a per-stage
# timing breakdown alongside the result. Useful when tuning partition
# counts or comparing alt-python vs R implementations.
top5, profile = ds.hashtag_histogram_alt_python(
    k=5,
    return_time_profile=True,
)

print(profile)
# Stages are labelled <NN>_<name>_dist | _local so you can see which steps
# ran on the workers and which ran client-side.

Get started

pip install whistlerlib

Quickstart Tutorials API reference GitHub PyPI