Skip to main content

API reference

The user-facing surface is two classes:

  • Context — the entry point. Connects to a Dask cluster and produces datasets.
  • TweetDataset — the analytic surface. Every method returns a pandas.DataFrame or an igraph.Graph.

For the architecture behind these methods, see Algorithm families and Context and datasets.

At a glance

MethodLayerUnderlying libraryReturns
hashtag_histogram_alt_pythonPythonadvertoolstop-k (tag, freq) DataFrame
hashtag_histogram_rR-bridgeR tm + radvertoolstop-k (tag, freq) DataFrame
mention_histogram_alt_pythonPythonadvertoolstop-k (Mentions, Frequency) DataFrame
mention_histogram_rR-bridgeR tm + radvertoolstop-k (mentions, Freq) DataFrame
ngram_histogram_alt_pythonPythonsklearn.CountVectorizer + NLTK stopwordstop-k (N_Tokens, Freq) DataFrame
ngram_histogram_rR-bridgeR RWeka::NGramTokenizertop-k (N_Tokens, Freq) DataFrame
sentiment_range_spanish_alt_pythonPythonsentiment-analysis-spanish + NLTK Spanish stopwordsfiltered (text, score) DataFrame
sentiment_histogram_and_sum_rR-bridgeR syuzhetper-emotion (emotion, count, sum) DataFrame
hashtag_weighted_coonetPythonadvertools (extraction) + igraph (graph)(edge DataFrame, igraph.Graph)
mention_weighted_coonetPythonadvertools (extraction) + igraph (graph)(edge DataFrame, igraph.Graph)

Utility methods on TweetDataset (no analytics, just dataset shaping): tweet_count, group_by_date, range_by_dates, repartition, get_num_partitions, create_index.

Common parameters

Several analytic methods share the same flags. They behave identically across methods:

ParameterTypeDefaultMeaning
kint(required)Top-k cutoff. The result is sorted by frequency, descending, and truncated to the top k rows.
distributed_sortingboolFalseWhen False, the top-k selection runs locally on the client after distributed aggregation (deterministic). When True, nlargest runs in the Dask graph; ties may be ordered non-deterministically (depends on partition count).
return_time_profileboolFalseWhen True, returns a tuple (result, time_profile_df) where the profile DataFrame breaks down per-stage wall time with _dist / _local suffixes marking distributed vs client-local work. See Architecture.

Context

The entry point. Connects to a Dask cluster and serves as a factory for datasets.

Constructor

Context(dask_scheduler, dask_scheduler_host, dask_scheduler_port)
ParameterTypeDescription
dask_schedulerstrThe Dask scheduling mode; typically 'processes' for a remote cluster, 'threads' for a local thread pool. Forwarded to dask.config.set(scheduler=...).
dask_scheduler_hoststrHostname or IP of the scheduler, e.g. 'localhost' for a local Compose cluster.
dask_scheduler_portintTCP port of the scheduler, typically 8786.
from whistlerlib import Context

ctx = Context('processes', 'localhost', 8786)

Context.load_csv

ctx.load_csv(filen, meta, num_partitions=1) -> TweetDataset

Reads a CSV from the dataset repository and returns a partitioned TweetDataset.

ParameterTypeDescription
filenstrPath to the CSV (relative to the dataset repository root).
metadictMust contain 'column_mapping' with keys 'date_column' and 'text_column'. Optional 'file_encoding' (defaults to 'utf-8').
num_partitionsintDask partition count. Default 1.

The loader reads only the two named columns and forces the date column to tz-naive (Dask cannot compare tz-naive and tz-aware datetimes across partitions). See Context and datasets for the full contract.

ds = ctx.load_csv(
filen='posts.csv',
meta={
'column_mapping': {'date_column': 'Date', 'text_column': 'text'},
'file_encoding': 'utf-8',
},
num_partitions=8,
)

TweetDataset analytics

All analytic methods are called on a TweetDataset instance produced by Context.load_csv(...) (or returned by TweetDataset.range_by_dates(...)). Every method honors return_time_profile.

Hashtag histograms

hashtag_histogram_alt_python

ds.hashtag_histogram_alt_python(k, distributed_sorting=False, return_time_profile=False)

Top-k hashtags in the dataset, pure-Python implementation.

  • Layer: pure Python (alt_python_algs).
  • Underlying library: advertools extract_hashtags.
  • Base primitive: compute_vector_histogram.
  • Returns: pandas.DataFrame with columns ['tag', 'freq'], sorted by freq descending, length ≤ k.

Tutorial: 01 — Quickstart hashtag histogram.

hashtag_histogram_r

ds.hashtag_histogram_r(k, distributed_sorting=False, return_time_profile=False)

Top-k hashtags, R-bridge implementation. Requires the albertogarob/whistlerlib worker image.

  • Layer: R-bridge (r_algs); subprocess to Rscript.
  • Underlying library: R tm + radvertools via the getMFHashtags.R script.
  • Base primitive: compute_vector_histogram.
  • Returns: pandas.DataFrame with columns ['tag', 'freq'], sorted by freq descending, length ≤ k. Same shape as _alt_python.

Mention histograms

mention_histogram_alt_python

ds.mention_histogram_alt_python(k, distributed_sorting=False, return_time_profile=False)

Top-k user mentions (@handle), pure-Python implementation.

  • Layer: pure Python.
  • Underlying library: advertools extract_mentions.
  • Base primitive: compute_vector_histogram.
  • Returns: pandas.DataFrame with columns ['Mentions', 'Frequency'], sorted by Frequency descending, length ≤ k.

Tutorial: 02 — Mention histogram.

mention_histogram_r

ds.mention_histogram_r(k, distributed_sorting=False, return_time_profile=False)

Top-k user mentions, R-bridge implementation.

  • Layer: R-bridge.
  • Underlying library: R tm + radvertools via the getMentions.R script.
  • Base primitive: compute_vector_histogram.
  • Returns: pandas.DataFrame with columns ['mentions', 'Freq']. Same row-shape as _alt_python but the column names are lowercased and abbreviated.

:::note Column-name divergence

mention_histogram_alt_python returns ['Mentions', 'Frequency'] while mention_histogram_r returns ['mentions', 'Freq']. Downstream code that needs to swap implementations should .rename(columns=...) after the call.

:::

N-gram histograms

ngram_histogram_alt_python

ds.ngram_histogram_alt_python(n, k, lan, w, distributed_sorting=False, return_time_profile=False)

Top-k word or character n-grams, pure-Python implementation.

  • Layer: pure Python.
  • Underlying library: sklearn.feature_extraction.text.CountVectorizer + NLTK stopwords.
  • Base primitive: compute_vector_histogram.
  • Returns: pandas.DataFrame with columns ['N_Tokens', 'Freq'], sorted by Freq descending, length ≤ k.
ParameterTypeDescription
nintn-gram order (e.g. 1 for unigrams, 2 for bigrams).
kintTop-k cutoff.
lanstrNLTK language code for the stopword list (e.g. 'english', 'spanish'). On first use, the stopwords corpus is downloaded automatically.
wstrCountVectorizer's analyzer: 'word' for word n-grams, 'char' for character n-grams.

Tutorial: 03 — N-gram histogram (bilingual).

ngram_histogram_r

ds.ngram_histogram_r(n, k, distributed_sorting=False, return_time_profile=False)

Top-k word n-grams, R-bridge implementation.

  • Layer: R-bridge.
  • Underlying library: R RWeka NGramTokenizer + tm.
  • Base primitive: compute_vector_histogram.
  • Returns: pandas.DataFrame with columns ['N_Tokens', 'Freq'], same shape as _alt_python.

Note: the R version does not expose lan or w parameters; it tokenizes words using RWeka's defaults.

Spanish sentiment range

sentiment_range_spanish_alt_python

ds.sentiment_range_spanish_alt_python(left_end, right_end, return_time_profile=False)

Returns all rows whose Spanish-sentiment score falls in [left_end, right_end]. Unlike the histograms, this is a per-row filter, not an aggregation.

  • Layer: pure Python.
  • Underlying library: sentiment-analysis-spanish (deep-learning Spanish polarity model, score in [0, 1]) + NLTK Spanish stopwords for text cleaning.
  • Base primitive: compute_vector_range.
  • Returns: pandas.DataFrame with columns ['text', 'score']. Order is not guaranteed.
ParameterTypeDescription
left_endfloatInclusive lower bound (typically in [0, 1]).
right_endfloatInclusive upper bound.

Tutorial: 04 — Spanish sentiment.

Emotion vectors (Syuzhet)

sentiment_histogram_and_sum_r

ds.sentiment_histogram_and_sum_r(language, method, return_time_profile=False)

Per-emotion counts and score sums, using the syuzhet R package's emotion lexicon (NRC). Requires the R-bridge worker image.

  • Layer: R-bridge.
  • Underlying library: R syuzhet (NRC emotion lexicon).
  • Base primitive: compute_matrix_nz_histogram_and_sum.
  • Returns: pandas.DataFrame with columns roughly ['emotion', 'count', 'sum'] — one row per emotion (Anger, Anticipation, Disgust, Fear, Joy, Sadness, Surprise, Trust, Negative, Positive), sorted by count descending. count is the number of tweets that scored non-zero on the emotion; sum is the total score across the corpus.
ParameterTypeDescription
languagestrLanguage code for syuzhet, e.g. 'english', 'spanish'. Must match a syuzhet-supported lexicon.
methodstrsyuzhet sentiment method: 'syuzhet', 'bing', 'afinn', 'nrc'. For per-emotion output, use 'nrc'.

Co-occurrence networks

Both methods return a tuple (edge_df, graph), where edge_df is a pandas.DataFrame of weighted edges and graph is an igraph.Graph with the edges and nodes already loaded. With return_time_profile=True, the return is (edge_df, graph, time_profile_df).

The output is deterministic across partition counts (edges are sorted by (source, target), nodes alphabetically inside compute_weighted_coonet).

hashtag_weighted_coonet

ds.hashtag_weighted_coonet(return_time_profile=False)

Hashtag co-occurrence: edges connect two hashtags that appeared together in the same tweet; the edge weight is the count of such tweets.

  • Layer: pure Python (via coonet_algs).
  • Underlying library: advertools (extraction) + igraph (graph object).
  • Base primitive: compute_weighted_coonet.
  • Returns: (pandas.DataFrame, igraph.Graph). Edge DataFrame columns: ['source', 'target', 'weight'].

Tutorial: 05 — Hashtag co-occurrence network.

mention_weighted_coonet

ds.mention_weighted_coonet(return_time_profile=False)

Mention co-occurrence: edges connect two @handles mentioned together in the same tweet.

  • Layer: pure Python (via coonet_algs).
  • Underlying library: advertools + igraph.
  • Base primitive: compute_weighted_coonet.
  • Returns: (pandas.DataFrame, igraph.Graph). Edge DataFrame columns: ['source', 'target', 'weight'].

Tutorial: 06 — Mention co-occurrence network.

TweetDataset utilities

These methods reshape or summarize the dataset; they don't run analytics.

tweet_count

ds.tweet_count(return_time_profile=False) -> int

Number of rows in the dataset. Triggers a Dask len(...) (which materializes partition lengths). With return_time_profile=True, returns (count, time_profile_df).

group_by_date

ds.group_by_date() -> pandas.DataFrame

Returns a pandas.DataFrame with columns ['Date', 'Count']: posts per calendar day, computed via a Dask groupby(date_column.dt.date).size().

range_by_dates

ds.range_by_dates(start_date, end_date) -> TweetDataset

Returns a new TweetDataset filtered to rows whose date column lies in [start_date, end_date], repartitioned and persisted. The original ds is unchanged.

repartition

ds.repartition(num_partitions) -> None

Re-partitions the underlying Dask DataFrame in place. Mutates ds.

get_num_partitions

ds.get_num_partitions() -> int

Returns the current partition count.

create_index

ds.create_index() -> None

Mutates ds to add an integer index across all partitions (used for paging). Persists the result, so the cluster keeps the indexed dataset in memory.

Internal modules

These are stable internal modules that power the public surface. They are not the user-facing API; they're listed here for orientation if you're contributing or reading source.

ModuleWhat lives here
whistlerlibContext (re-exported from whistlerlib.context).
whistlerlib.contextContext class.
whistlerlib.datasetTweetDataset class, all analytic methods.
whistlerlib.dask.alt_python_algscompute_* wrappers for the pure-Python algorithm family.
whistlerlib.dask.r_algscompute_* wrappers for the R-bridged algorithm family.
whistlerlib.dask.coonet_algsto_graph, compute_*_weighted_coonet.
whistlerlib.dask.base_algsThe four base Dask primitives (compute_vector_histogram, compute_vector_range, compute_matrix_nz_histogram_and_sum, compute_weighted_coonet).

whistlerlib.clients, whistlerlib.config, whistlerlib.logger, whistlerlib.time_profile, and every funcs/ subpackage are private.

Roadmap

This page is a curated, hand-maintained catalog. Per-symbol generated pages (via pdoc) are planned: a CI workflow on tag pushes will run uvx pdoc -o website/docs/api/generated whistlerlib and commit the per-module reference alongside this catalog. Until then, the source is the ground truth for signatures.