Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.2.0] - 2026-05-23
First public release of Whistlerlib as a packaged Python library. Brings the
codebase to current Python (3.11+), Dask (2026.3.x), and pandas (2.x); adds a
comprehensive test suite, runnable examples, a portable docs tree, and the
albertogarob/whistlerlib Docker image. Upgrade notes from 0.1.0:
Migration guide.
Available on:
- GitHub:
v0.2.0 - Docker Hub:
albertogarob/whistlerlibwith tags0.2.0,0.2,latest(multi-arch:linux/amd64+linux/arm64) - PyPI: not yet published.
Added
- Seven runnable examples under
examples/(01-quickstart-hashtag-histogram,02-mention-histogram,03-ngram-histogram-bilingual,04-sentiment-spanish[slow],05-hashtag-coonet,06-mention-coonet,07-r-bridge-mfhashtags). Each example dir containsexample.py+README.md. The matching docker-backed integration test for each example lives undertests/integration/test_<slug>.pyand shares a session-scoped Docker-cluster fixture. tests/integration/conftest.pywith a session-scopedwhistlerlib_swarmfixture that brings up a real local Docker cluster (locally-builtalbertogarob/whistlerlib:devfor both scheduler and workers) via Compose, waits for scheduler readiness, yields(host, port), and tears down on session exit. Auto-skips when Docker is unavailable.- New pytest marker
docker, deselected from defaultpytestrun; opt in with-m docker. - CI
docker-examplesjob (runs onworkflow_dispatchand on push tomain) builds the worker image and runs the integration suite end-to-end. albertogarob/whistlerlibDocker image: built withuvin a multi-stagepython:3.11-bookwormbase, shipping the whistlerlib package + R + the full R library set (r-cran-tm,r-cran-slam,r-cran-snowballc,r-cran-reshape2,r-cran-dplyr,r-cran-tidyr,r-cran-nlp,r-cran-stringr,r-cran-remotes,r-cran-vctrs,r-cran-arrow) +RWeka/syuzhet/arrowfrom Posit Package Manager +radvertoolsfrom upstream GitHub. This is the only custom image whistlerlib publishes; both the scheduler ("master") and worker services indocker/docker-compose.ymlanddocker/stack.ymluse this image, with the master service overriding theENTRYPOINTtodask-scheduler.docker/Dockerfile.worker,docker/docker-compose.yml(single-host development),docker/stack.yml(Swarm production).docker/smoke.py, runs hashtag / mention / coonet + (when R env vars are set) R-bridge calls against a running scheduler. Baked into the worker image at/app/smoke.py..github/workflows/docker-publish.yml, multi-arch (linux/amd64,linux/arm64) buildx workflow for the worker image. Publishes onv*git tags;workflow_dispatchallows manual build-only runs (opt-inpushinput)..dockerignorefor lean build context.- Comprehensive unit test suite under
tests/unit/: 119 fast, isolated tests covering the pure-Pythonfuncs/(cleanText, getHashtags, getMentions, getNgrams, getSentimentScore), the orchestration layer (Context, TweetDataset, config, DatasetRepositoryClient), and the fourdask/*/algs.pydispatchers. No Dask cluster, no real models, no network; full run in ~12s. - Unit-test coverage gate:
fail_under = 80in[tool.coverage.report]. Current measured coverage: 97% (everyalgs.pyandfuncs/module at 100%;dataset.pyat 89%,time_profile.pyat 95%). - CI runs
pytest tests/unit/ --cov ... --cov-report=htmland uploads the HTML report as a build artifact. - Portable docs tree under
docs/(intro, install/pip, install/docker, concepts/architecture, concepts/context-and-datasets, concepts/algorithm-families, tutorials/, api/ stub, migration/from-0.1.0.md). Inline Mermaid diagrams; no binary asset pipeline. Renders on GitHub today; lifts cleanly into a Docusaurus 3 site later. - CI doc job: runs
scripts/sync_tutorials.py --check(fails on drift betweenexamples/*/README.mdanddocs/tutorials/) andscripts/check_doc_links.py(verifies every relative link indocs/resolves). Also enforces the no-em-dash style rule across*.py/*.md/*.yml/*.tomlfiles. - PEP 621
pyproject.tomlwith the hatchling build backend. src/whistlerlib/package layout.- Python 3.11+ requirement declared via
requires-pythonand.python-version. CHANGELOG.mdanddocs/migration/from-0.1.0.md.- Test suite ported into
tests/(legacy LocalCluster-backed suite undertests/test_*.py, modern unit + docker integration suites undertests/unit/andtests/integration/). pytest-timeout(30s per-test cap) andpytest-covadded to thetestsextra.emoji>=2to runtime dependencies (was a transitive untracked need ofcleanText).- Session-scoped
LocalClustertest fixture and synthetic 10-row Spanish / English tweet CSV; tests no longer need an external Dask scheduler or proprietary CSVs. - Bundled NLTK stopwords corpus under
tests/fixtures/nltk_data/so the suite runs offline. r_requiredskip marker andslowmarker so contributors without R (or who don't want the 30s+ sentiment-model load) get a clean green default run.ruff(lint) configured inpyproject.tomland wired into CI.- GitHub Actions workflow (
.github/workflows/ci.yml) runninguv lock --locked+uv sync+ruff+pyteston Python 3.11 and 3.12, plus the docs job and the docker-examples job. - Citation block in
README.mdreferencing the project's published paper in Multimedia Tools and Applications (DOI10.1007/s11042-024-19827-z).
Changed
- Package source relocated from
whistlerlib/tosrc/whistlerlib/(viagit mvto preserve history). - Dependency floors raised to a Dask 2026.x-compatible set:
dask>=2024.1,pandas>=2.2,numpy>=1.26,<3,scikit-learn>=1.4. whistlerlib/config/config.py: removed module-levelassertonWHISTLERLIB_R_*env vars (brokeimport whistlerlibfor non-R workflows). Validation moved to call-time at the R-bridge sites.whistlerlib/dask/alt_python_algs/algs.py:nltk.download('stopwords')now skipped when the corpus is already cached (nltk.data.findcheck first). Important for offline / pre-baked Docker workers.whistlerlib/dask/alt_python_algs/funcs/cleanText.py: migrated toemoji.replace_emoji()(emoji 2.0+ API;get_emoji_regexp()was removed upstream).whistlerlib/dask/alt_python_algs/funcs/getNgrams.py: migrated toCountVectorizer.get_feature_names_out()(sklearn 1.0+ rename).whistlerlib/dask/alt_python_algs/funcs/getHashtags.py: rewritten withrename_axis('tag').reset_index(name='freq')so the result is robust across pandas versions (modernvalue_counts()returns a'count'-named Series).
Removed
- Pinned
requirements*.txtfiles (replaced bypyproject.tomldependencies + extras). dask_sqldependency and the SQL surface it backed (TweetDataset.run_query(),Context.dask_sql_context,TweetDataset.dask_sql_context/dask_sql_tablename/query_result/query).dask_sql==2024.5.0is incompatible withdask>=2025because Dask foldeddask_exprinto the main namespace, and upstreamdask_sqlis unmaintained. See Migration guide for the workaround.
[0.1.0] - 2024-05-01
Initial public commit of Whistlerlib on GitHub (observatoriogeo/whistlerlib, commit 75565fd). Snapshot of the library at the time the public repository was opened. Not packaged: consumed by cloning the repo and adding whistlerlib/ to PYTHONPATH.
Added
- Full library source under
whistlerlib/:ContextandTweetDataset(the entry classes).whistlerlib.dask.alt_python_algs(pure-Python implementations of hashtag, mention, n-gram, sentiment, and co-occurrence-network analytics).whistlerlib.dask.r_algs(R-subprocess implementations of the same analytics via thetm,RWeka,syuzhet, andradvertoolsR packages).whistlerlib.dask.base_algs(the four distributed primitives:compute_vector_histogram,compute_vector_range,compute_matrix_nz_histogram_and_sum,compute_weighted_coonet).whistlerlib.dask.coonet_algs(igraph-based co-occurrence-network builders).whistlerlib.time_profile(per-stage timing breakdowns threaded through every base primitive).
README.mdwith library overview, intended use cases, and basic-usage code sample.LICENSE(GPL-3.0-or-later)..gitignore.