Contributing

Thank you for your interest in contributing to the mmh3 project. We appreciate your support and look forward to your contributions.

Please read README to get an overview of the mmh3 project, and follow our Code of Conduct (ACM Code of Ethics and Professional Conduct).

Submitting issues

We welcome your contributions, whether it’s submitting a bug report or suggesting a new feature through the issue tracker.

Before creating a new issue, please check the Known Issues section in README to see if the problem has already been noted.

Project structure

As of version 5.0.0-dev, the project layout is structured as follows:

  • src/mmh3

    • mmh3module.c: the main file that serves as the interface between Python and the MurmurHash3 c implementations.

    • murmurhash.c: implementations of the MurmurHash3 family. Auto-generated from Austin Appleby’s original code. DO NOT edit this file manually. See README in the util directory for details.

    • murmurhash.h: headers and macros for MurmurHash3. Auto-generated from util/refresh.py. DO NOT edit this file manually.

    • hashlib.h: taken from CPython’s code base.

  • util

    • refresh.py: file that generates src/mmh3/murmurhash.c and src/mmh3/murmurhash.h from the original MurmurHash3 C++ code. Edit this file to modify the contents of these files.

  • benchmark

    • benchmark.py: script to run benchmarks.

    • plot_graph.py: script to plot benchmark results.

  • docs: project documentation directory

  • .github/workflows: GitHub Actions workflows

Project setup

Run:

git clone https://github.com/hajimes/mmh3.git

This project uses tox to automate testing and other tasks. You can install tox by running:

pipx install tox

In addition, npx (included with npm >= 5.2.0) is required within the tox environments to run linters.

Testing and linting

Before submitting your changes, make sure to run the project’s tests to ensure everything is working as expected.

To run all tests, use the following command:

tox

During development, you can run the tests for a specific environment by specifying the environment name. For example, to run tests for a specific version of Python (e.g., Python 3.12), use:

tox -e py312

For type checking, run:

tox -e type

To run linters with automated formatting, use:

tox -e lint

(Optional) Testing on s390x

When you have modified the code in a way which may cause endian issues, you may want to locally test on s390x, the only big-endian platform officially supported by Python.

Emulating a big-endian s390x with QEMU by Simon Willison is a good introduction to Docker/QEMU settings for emulating s390x.

If the above does not work, you may also want to try the following:

docker run --rm --privileged tonistiigi/binfmt --install all
docker buildx create --name mybuilder --use
docker run -it multiarch/ubuntu-core:s390x-focal /bin/bash

Pull request

Once you’ve pushed your changes to your fork, you can create a pull request (PR) on the main project repository. Please provide a clear and detailed description of your changes in the PR, and reference any related issues.

util directory

Algorithm implementations used by the mmh3 module

The util directory contains C files that were generated from the SMHasher C++ project by Austin Appleby.

The idea of the subproject directory loosely follows the hashlib implementation of CPython.

Updating mmh3 core C code

Run tox -e build_cfiles. This will fetch Appleby’s original SMHasher project as a git submodule and then generate PEP 7-compliant C code from the original project.

To perform further edits, add transformation code to the refresh.py script, instead of editing murmurhash3.* files manually. Then, run tox -e build_cfiles again to update the murmurhash3.* files.

Local files

  1. ./util/README.md

  2. ./util/refresh.py

  3. ./util/FILE_HEADER

Generated files

  1. ./src/mmh3/murmurhash3.c

  2. ./src/mmh3/murmurhash3.h

Benchmarking

To run benchmarks locally, try the following command:

tox -e benchmark -- -o OUTPUT_FILE \
            --test-hash HASH_NAME --test-buffer-size-max HASH_SIZE

where OUTPUT_FILE is the output file name (json formatted), HASH_NAME is the name of the hash, and HASH_SIZE is the maximum buffer size to be tested in bytes.

For example,

mkdir -p _results
tox -e benchmark -- -o _results/mmh3_128.json \
            --test-hash mmh3_128 --test-buffer-size-max 262144

As of version 4.2.0, the following hash function identifiers are available for benchmarking: mmh3_32, mmh3_128, xxh_32, xxh_64, xxh3_64, xxh3_128, pymmh3_32, pymmh3_128, md5, and sha1.

The owner of the repository can run the benchmark on GitHub Actions by using the workflow defined in .github/workflows/benchmark.yml.

After obtaining the benchmark results, you can plot graphs by plot_graph.py. The following is an example of how to run the script:

tox -e plot -- --output-dir docs/_static RESULT_DIR/*.json

where RESULT_DIR is the directory containing the benchmark results. The names of json files should be in the format of HASH_IDENTIFER.json, e.g., mmh3_128.json.

Documentation

Project documentation files are mainly written in the Markdown format and are located in the docs. The documentation is automatically built and hosted on the Read the Docs.

To build the documentation locally, use the following command:

tox -e docs

To check the result of the built documentation, open docs/_build/html/index.html in your browser.