01. Quickstart: top hashtags
The minimum-viable Whistlerlib workflow. Connects to a running Dask cluster, loads a CSV of tweets, computes the top-k hashtags as a distributed histogram.
What you'll see
Loaded 10 tweets.
Top 5 hashtags:
tag freq
#ai 5
#python 4
#data 3
#ml 2
#climate 1
The code
The tutorial ships ten inline rows of fake tweets so it runs without any external data:
_ROWS = [
('2022-01-01T00:00:00', 'morning model release #ai #python'),
('2022-01-01T01:00:00', 'dataset paper accepted #ai #data'),
('2022-01-01T02:00:00', 'climate insights from satellites #ai #climate'),
('2022-01-01T03:00:00', 'training loop in python #python #ml'),
# ...5 more rows...
('2022-01-01T09:00:00', 'jupyter notebook collection'),
]
These rows get written to a tempfile (_write_csv()) so the cluster's workers can read them through the Compose bind-mount on host /tmp. The analytical work itself is six lines:
from whistlerlib import Context
ctx = Context('processes', 'localhost', 8786)
ds = ctx.load_csv(
filen=csv_path,
meta={
'column_mapping': {'date_column': 'Date', 'text_column': 'text'},
'file_encoding': 'utf-8',
},
num_partitions=2,
)
print(f'Loaded {ds.tweet_count()} tweets.')
histogram = ds.hashtag_histogram_alt_python(k=5)
print(histogram.to_string(index=False))
Context(...) opens a Dask client against the scheduler exposed by the master service in docker/docker-compose.yml. load_csv(...) wraps dask.dataframe.read_csv, reads only the two columns named in column_mapping, and partitions the result into two Dask partitions. hashtag_histogram_alt_python(k=5) ships a map_partitions closure to each worker (using advertools to extract hashtags), the scheduler merges the partial frequency tables, and the top-5 by frequency is returned as a pandas DataFrame.
The full file (including the tempfile setup and CLI shim) is at
examples/01-quickstart-hashtag-histogram/example.py.
How it works
The three calls above form the canonical Whistlerlib pipeline: Context is the entry point, load_csv returns a TweetDataset backed by a partitioned Dask DataFrame, and the analytic method (hashtag_histogram_alt_python) is what actually does the distributed work. Every other tutorial in this series follows the same skeleton with a different analytic.
Run it
# From examples/01-quickstart-hashtag-histogram/, bring up a local Dask cluster, run the example, tear it down.
docker compose -f ../../docker/docker-compose.yml up -d
python example.py
docker compose -f ../../docker/docker-compose.yml down