Skip to main content

Quickstart

The minimum-viable Whistlerlib workflow. Connects to a running Dask cluster, loads a CSV of tweets, computes the top-k hashtags as a distributed histogram.

What you'll see

Loaded 10 tweets.
Top 5 hashtags:
tag freq
#cdmx 3
#política 2
#méxico 2
#noticias 2
#ciencia 1

How it works

  1. Context('processes', host, port) opens a Dask client against the scheduler exposed by the master service in docker/docker-compose.yml.
  2. load_csv(...) wraps dask.dataframe.read_csv and returns a TweetDataset over a Dask DataFrame partitioned across the cluster's workers.
  3. hashtag_histogram_alt_python(k=5) ships a map_partitions closure to each worker, each worker runs advertools-style hashtag extraction on its slice, the scheduler merges the partial frequency tables, and the top-5 by frequency is returned as a pandas DataFrame.

Run it

docker compose -f ../../docker/docker-compose.yml up -d
python example.py
docker compose -f ../../docker/docker-compose.yml down

Or via pytest:

uv run pytest -m docker tests/integration/test_01_quickstart_hashtag_histogram.py