API reference
The user-facing surface is two classes:
Context— the entry point. Connects to a Dask cluster and produces datasets.TweetDataset— the analytic surface. Every method returns apandas.DataFrameor anigraph.Graph.
For the architecture behind these methods, see Algorithm families and Context and datasets.
At a glance
| Method | Layer | Underlying library | Returns |
|---|---|---|---|
hashtag_histogram_alt_python | Python | advertools | top-k (tag, freq) DataFrame |
hashtag_histogram_r | R-bridge | R tm + radvertools | top-k (tag, freq) DataFrame |
mention_histogram_alt_python | Python | advertools | top-k (Mentions, Frequency) DataFrame |
mention_histogram_r | R-bridge | R tm + radvertools | top-k (mentions, Freq) DataFrame |
ngram_histogram_alt_python | Python | sklearn.CountVectorizer + NLTK stopwords | top-k (N_Tokens, Freq) DataFrame |
ngram_histogram_r | R-bridge | R RWeka::NGramTokenizer | top-k (N_Tokens, Freq) DataFrame |
sentiment_range_spanish_alt_python | Python | sentiment-analysis-spanish + NLTK Spanish stopwords | filtered (text, score) DataFrame |
sentiment_histogram_and_sum_r | R-bridge | R syuzhet | per-emotion (emotion, count, sum) DataFrame |
hashtag_weighted_coonet | Python | advertools (extraction) + igraph (graph) | (edge DataFrame, igraph.Graph) |
mention_weighted_coonet | Python | advertools (extraction) + igraph (graph) | (edge DataFrame, igraph.Graph) |
Utility methods on TweetDataset (no analytics, just dataset shaping): tweet_count, group_by_date, range_by_dates, repartition, get_num_partitions, create_index.
Common parameters
Several analytic methods share the same flags. They behave identically across methods:
| Parameter | Type | Default | Meaning |
|---|---|---|---|
k | int | (required) | Top-k cutoff. The result is sorted by frequency, descending, and truncated to the top k rows. |
distributed_sorting | bool | False | When False, the top-k selection runs locally on the client after distributed aggregation (deterministic). When True, nlargest runs in the Dask graph; ties may be ordered non-deterministically (depends on partition count). |
return_time_profile | bool | False | When True, returns a tuple (result, time_profile_df) where the profile DataFrame breaks down per-stage wall time with _dist / _local suffixes marking distributed vs client-local work. See Architecture. |
Context
The entry point. Connects to a Dask cluster and serves as a factory for datasets.
Constructor
Context(dask_scheduler, dask_scheduler_host, dask_scheduler_port)
| Parameter | Type | Description |
|---|---|---|
dask_scheduler | str | The Dask scheduling mode; typically 'processes' for a remote cluster, 'threads' for a local thread pool. Forwarded to dask.config.set(scheduler=...). |
dask_scheduler_host | str | Hostname or IP of the scheduler, e.g. 'localhost' for a local Compose cluster. |
dask_scheduler_port | int | TCP port of the scheduler, typically 8786. |
from whistlerlib import Context
ctx = Context('processes', 'localhost', 8786)
Context.load_csv
ctx.load_csv(filen, meta, num_partitions=1) -> TweetDataset
Reads a CSV from the dataset repository and returns a partitioned TweetDataset.
| Parameter | Type | Description |
|---|---|---|
filen | str | Path to the CSV (relative to the dataset repository root). |
meta | dict | Must contain 'column_mapping' with keys 'date_column' and 'text_column'. Optional 'file_encoding' (defaults to 'utf-8'). |
num_partitions | int | Dask partition count. Default 1. |
The loader reads only the two named columns and forces the date column to tz-naive (Dask cannot compare tz-naive and tz-aware datetimes across partitions). See Context and datasets for the full contract.
ds = ctx.load_csv(
filen='posts.csv',
meta={
'column_mapping': {'date_column': 'Date', 'text_column': 'text'},
'file_encoding': 'utf-8',
},
num_partitions=8,
)
TweetDataset analytics
All analytic methods are called on a TweetDataset instance produced by Context.load_csv(...) (or returned by TweetDataset.range_by_dates(...)). Every method honors return_time_profile.
Hashtag histograms
hashtag_histogram_alt_python
ds.hashtag_histogram_alt_python(k, distributed_sorting=False, return_time_profile=False)
Top-k hashtags in the dataset, pure-Python implementation.
- Layer: pure Python (
alt_python_algs). - Underlying library:
advertoolsextract_hashtags. - Base primitive:
compute_vector_histogram. - Returns:
pandas.DataFramewith columns['tag', 'freq'], sorted byfreqdescending, length≤ k.
Tutorial: 01 — Quickstart hashtag histogram.
hashtag_histogram_r
ds.hashtag_histogram_r(k, distributed_sorting=False, return_time_profile=False)
Top-k hashtags, R-bridge implementation. Requires the albertogarob/whistlerlib worker image.
- Layer: R-bridge (
r_algs); subprocess toRscript. - Underlying library: R
tm+radvertoolsvia thegetMFHashtags.Rscript. - Base primitive:
compute_vector_histogram. - Returns:
pandas.DataFramewith columns['tag', 'freq'], sorted byfreqdescending, length≤ k. Same shape as_alt_python.
Mention histograms
mention_histogram_alt_python
ds.mention_histogram_alt_python(k, distributed_sorting=False, return_time_profile=False)
Top-k user mentions (@handle), pure-Python implementation.
- Layer: pure Python.
- Underlying library:
advertoolsextract_mentions. - Base primitive:
compute_vector_histogram. - Returns:
pandas.DataFramewith columns['Mentions', 'Frequency'], sorted byFrequencydescending, length≤ k.
Tutorial: 02 — Mention histogram.
mention_histogram_r
ds.mention_histogram_r(k, distributed_sorting=False, return_time_profile=False)
Top-k user mentions, R-bridge implementation.
- Layer: R-bridge.
- Underlying library: R
tm+radvertoolsvia thegetMentions.Rscript. - Base primitive:
compute_vector_histogram. - Returns:
pandas.DataFramewith columns['mentions', 'Freq']. Same row-shape as_alt_pythonbut the column names are lowercased and abbreviated.
:::note Column-name divergence
mention_histogram_alt_python returns ['Mentions', 'Frequency'] while mention_histogram_r returns ['mentions', 'Freq']. Downstream code that needs to swap implementations should .rename(columns=...) after the call.
:::
N-gram histograms
ngram_histogram_alt_python
ds.ngram_histogram_alt_python(n, k, lan, w, distributed_sorting=False, return_time_profile=False)
Top-k word or character n-grams, pure-Python implementation.
- Layer: pure Python.
- Underlying library:
sklearn.feature_extraction.text.CountVectorizer+ NLTK stopwords. - Base primitive:
compute_vector_histogram. - Returns:
pandas.DataFramewith columns['N_Tokens', 'Freq'], sorted byFreqdescending, length≤ k.
| Parameter | Type | Description |
|---|---|---|
n | int | n-gram order (e.g. 1 for unigrams, 2 for bigrams). |
k | int | Top-k cutoff. |
lan | str | NLTK language code for the stopword list (e.g. 'english', 'spanish'). On first use, the stopwords corpus is downloaded automatically. |
w | str | CountVectorizer's analyzer: 'word' for word n-grams, 'char' for character n-grams. |
Tutorial: 03 — N-gram histogram (bilingual).
ngram_histogram_r
ds.ngram_histogram_r(n, k, distributed_sorting=False, return_time_profile=False)
Top-k word n-grams, R-bridge implementation.
- Layer: R-bridge.
- Underlying library: R
RWekaNGramTokenizer+tm. - Base primitive:
compute_vector_histogram. - Returns:
pandas.DataFramewith columns['N_Tokens', 'Freq'], same shape as_alt_python.
Note: the R version does not expose lan or w parameters; it tokenizes words using RWeka's defaults.
Spanish sentiment range
sentiment_range_spanish_alt_python
ds.sentiment_range_spanish_alt_python(left_end, right_end, return_time_profile=False)
Returns all rows whose Spanish-sentiment score falls in [left_end, right_end]. Unlike the histograms, this is a per-row filter, not an aggregation.
- Layer: pure Python.
- Underlying library:
sentiment-analysis-spanish(deep-learning Spanish polarity model, score in[0, 1]) + NLTK Spanish stopwords for text cleaning. - Base primitive:
compute_vector_range. - Returns:
pandas.DataFramewith columns['text', 'score']. Order is not guaranteed.
| Parameter | Type | Description |
|---|---|---|
left_end | float | Inclusive lower bound (typically in [0, 1]). |
right_end | float | Inclusive upper bound. |
Tutorial: 04 — Spanish sentiment.
Emotion vectors (Syuzhet)
sentiment_histogram_and_sum_r
ds.sentiment_histogram_and_sum_r(language, method, return_time_profile=False)
Per-emotion counts and score sums, using the syuzhet R package's emotion lexicon (NRC). Requires the R-bridge worker image.
- Layer: R-bridge.
- Underlying library: R
syuzhet(NRC emotion lexicon). - Base primitive:
compute_matrix_nz_histogram_and_sum. - Returns:
pandas.DataFramewith columns roughly['emotion', 'count', 'sum']— one row per emotion (Anger,Anticipation,Disgust,Fear,Joy,Sadness,Surprise,Trust,Negative,Positive), sorted bycountdescending.countis the number of tweets that scored non-zero on the emotion;sumis the total score across the corpus.
| Parameter | Type | Description |
|---|---|---|
language | str | Language code for syuzhet, e.g. 'english', 'spanish'. Must match a syuzhet-supported lexicon. |
method | str | syuzhet sentiment method: 'syuzhet', 'bing', 'afinn', 'nrc'. For per-emotion output, use 'nrc'. |
Co-occurrence networks
Both methods return a tuple (edge_df, graph), where edge_df is a pandas.DataFrame of weighted edges and graph is an igraph.Graph with the edges and nodes already loaded. With return_time_profile=True, the return is (edge_df, graph, time_profile_df).
The output is deterministic across partition counts (edges are sorted by (source, target), nodes alphabetically inside compute_weighted_coonet).
hashtag_weighted_coonet
ds.hashtag_weighted_coonet(return_time_profile=False)
Hashtag co-occurrence: edges connect two hashtags that appeared together in the same tweet; the edge weight is the count of such tweets.
- Layer: pure Python (via
coonet_algs). - Underlying library:
advertools(extraction) +igraph(graph object). - Base primitive:
compute_weighted_coonet. - Returns:
(pandas.DataFrame, igraph.Graph). Edge DataFrame columns:['source', 'target', 'weight'].
Tutorial: 05 — Hashtag co-occurrence network.
mention_weighted_coonet
ds.mention_weighted_coonet(return_time_profile=False)
Mention co-occurrence: edges connect two @handles mentioned together in the same tweet.
- Layer: pure Python (via
coonet_algs). - Underlying library:
advertools+igraph. - Base primitive:
compute_weighted_coonet. - Returns:
(pandas.DataFrame, igraph.Graph). Edge DataFrame columns:['source', 'target', 'weight'].
Tutorial: 06 — Mention co-occurrence network.
TweetDataset utilities
These methods reshape or summarize the dataset; they don't run analytics.
tweet_count
ds.tweet_count(return_time_profile=False) -> int
Number of rows in the dataset. Triggers a Dask len(...) (which materializes partition lengths). With return_time_profile=True, returns (count, time_profile_df).
group_by_date
ds.group_by_date() -> pandas.DataFrame
Returns a pandas.DataFrame with columns ['Date', 'Count']: posts per calendar day, computed via a Dask groupby(date_column.dt.date).size().
range_by_dates
ds.range_by_dates(start_date, end_date) -> TweetDataset
Returns a new TweetDataset filtered to rows whose date column lies in [start_date, end_date], repartitioned and persisted. The original ds is unchanged.
repartition
ds.repartition(num_partitions) -> None
Re-partitions the underlying Dask DataFrame in place. Mutates ds.
get_num_partitions
ds.get_num_partitions() -> int
Returns the current partition count.
create_index
ds.create_index() -> None
Mutates ds to add an integer index across all partitions (used for paging). Persists the result, so the cluster keeps the indexed dataset in memory.
Internal modules
These are stable internal modules that power the public surface. They are not the user-facing API; they're listed here for orientation if you're contributing or reading source.
| Module | What lives here |
|---|---|
whistlerlib | Context (re-exported from whistlerlib.context). |
whistlerlib.context | Context class. |
whistlerlib.dataset | TweetDataset class, all analytic methods. |
whistlerlib.dask.alt_python_algs | compute_* wrappers for the pure-Python algorithm family. |
whistlerlib.dask.r_algs | compute_* wrappers for the R-bridged algorithm family. |
whistlerlib.dask.coonet_algs | to_graph, compute_*_weighted_coonet. |
whistlerlib.dask.base_algs | The four base Dask primitives (compute_vector_histogram, compute_vector_range, compute_matrix_nz_histogram_and_sum, compute_weighted_coonet). |
whistlerlib.clients, whistlerlib.config, whistlerlib.logger, whistlerlib.time_profile, and every funcs/ subpackage are private.
Roadmap
This page is a curated, hand-maintained catalog. Per-symbol generated pages (via pdoc) are planned: a CI workflow on tag pushes will run uvx pdoc -o website/docs/api/generated whistlerlib and commit the per-module reference alongside this catalog. Until then, the source is the ground truth for signatures.