Contributing Guidelines¶

Thank you for choosing to contribute to AgML!

Contributing Data¶

If you've found (or already have) a new dataset and you want to contribute the dataset to AgML, then the instructions below will help you format and the data to the AgML standard.

Dataset Formats¶

Currently, we have image classification, object detection, and semantic segmentation datasets available in AgML. These sources are synthesized to standard annotation formats, namely the following:

Image Classification: Image-To-Label-Number
Object Detection: COCO JSON
Semantic Segmentation: Dense Pixel-Wise

Image Classification¶

Image classification datasets are organized in the following directory tree:

<dataset name>
    ├── <label 1>
    │   ├── image1.png
    │   ├── image2.png
    │   └── image3.png
    └── <label 2>
        ├── image1.png
        ├── image2.png
        └── image3.png

The AgMLDataLoader generates a mapping between each of the label names "label 1", "label 2", etc., and a numerical value.

Object Detection¶

Object detection datasets are constructed using COCO JSON formatting. For a general overview, see https://cocodataset.org/#format-data. Another good resource is https://docs.aws.amazon.com/rekognition/latest/customlabels-dg/cd-transform-coco.html. Once you have the images and the bounding box annotations, this involves generating a dictionary with four keys:

images: A list of dictionaries with the following items:
- The image file name (without the parent directory!) in file_name
- The ID (a unique number, usually from 1 to num_images) in id,
- The height/width of the image in height and width, respectively.
annotations: A list of dictionaries with each dictionary representing a unique bounding box (do not stack multiple bounding boxes into a single dictionary, even if they are for the same image!), and containing:
- The area of the bounding box in area.
- The bounding box itself in bbox. Note: The bounding box should have four coordinates. The first two are the x, y of the top-left corner of the bounding box, the other two are its height and width.
- The class label (numerical) of the image in category_id.
- The ID (NOT the filename) of the image it corresponds to in image_id.
- The ID of the bounding box in id. For instance, if a unique image has six corresponding bounding boxes, then each of them would be given an id from 1-6.
- iscrowd should be set to 0 by default, unless the dataset explicitly comes with iscrowd as 1.
- ignore should be 0 by default.
- segmentation only applies for instance segmentation datasets. If converting an instance segmentation dataset to object detection, you can leave the polygonal segmentation as is. Otherwise, put this as an empty list.
category: A list of dictionaries with each category, where each of these dictionaries contains:
- The human-readable name of the class (e.g., "strawberry") in name.
- The supercategory of the class, if there are nested classes, in supercategory. Otherwise, just leave this as the string "none".
- The numerical ID of the class in id.
info: A single dictionary with metadata and information about the dataset:
- description: A basic description of the dataset.
- url: The URL from which the dataset was acquired.
- version: The dataset version. Set to 1.0 if unknown.
- year: The year in which the dataset was released.
- contributor: The author(s) of the dataset.
- date_created: The date when the dataset was published. Give an approximate year if unknown.

The dictionary containing this information should be written to a file called annotations.json, and the file structure will be:

<dataset name>
    ├── annotations.json
    └── images
        ├── image1.png
        ├── image2.png
        └── image3.png

Semantic Segmentation¶

Semantic segmentation datasets are constructed using pixel-wise annotation masks. Each image in the dataset has a corresponding annotation mask. These masks have the following properties:

Two-dimensional, so no channel shape. Their complete shape will be (image_height, image_width).
Each of the pixels will be a numerical class label or 0 for background.

The directory tree should look like follows:

<dataset name>
    ├── annotations
    │   ├── mask1.png
    │   ├── mask2.png
    │   └── mask3.png
    └── images
        ├── image1.png
        ├── image2.png
        └── image3.png

Contributing a Dataset¶

If you've found a new dataset that isn't already being used in AgML and you want to add it, there's a few things you need to do.

Any preprocessing code being used for the dataset can be kept in agml/_internal/preprocess.py, by adding an elif statement to the preprocess() method with the dataset name. If there is no preprocessing code, then just put a pass statement in the block.

Some Things to Check¶

Make sure each image is in the range of 0-255 in integers as opposed to 0-1 as floats. This will prevent any loss of data that could adversely affect training.
For a semantic segmentation dataset, put the masks in a png format as opposed to jpg or other.

Compiling the Dataset¶

After processing and standardizing the dataset, make sure that the dataset is organized in one of the formats above, and then go to the parent directory of the directory of the dataset (for example, if the dataset is in /root/my_new_dataset, go to /root). Then run the following command:

zip -r my_new_dataset.zip my_new_dataset -x ".*"

If running on MacOS, use the following command:

zip -r my_new_dataset.zip my_new_dataset -x ".*"  -x "__MACOSX"

Updating the Source Files¶

Next, you need to update the public_datasources.json and source_citations.json files. These two can be found in the agml/_assets folder. You will need to update the public_datasources.json file in the following way:

    "my_new_dataset": {
        "classes": {
            "1": "class_1",
            "2": "class_2",
            "3": "class_3"
        },
        "ml_task": "See the table for the different dataset types.",
        "ag_task": "The agricultural task that is associated with the dataset.",
        "location": {
            "continent": "The continent the dataset was collected on.",
            "country": "The country the dataset was collected in."
        },
        "sensor_modality": "Usually rgb, but can include other image modalities.",
        "real_synthetic": "Are the images real or synthetically generated?",
        "platform": "handheld or ground",
        "input_data_format": "See the table for the different dataset types.",
        "annotation_format": "See the table for the different dataset types.",
        "n_images": "The total number of images in the dataset.",
        "docs_url": "Where can the user find the most clear information about the dataset?"
    }

Note: If the dataset is captured in multiple countries or you don't know where it is from, then put "worldwide" for both "continent" and "country".

Dataset Format	`ml_task`	`annotation_format`
Image Classification	`image_classification`	`directory_names`
Object Detection	`object_detection`	`coco_json`
Semantic Segmentation	`semantic_segmentation`	`image`

The source_citations.json file should be updated this way:

    "my_new_dataset": {
        "license": "The license being used by the dataset.",
        "citation": "The paper/library to cite for the dataset."
    }

If the dataset has no license or has no citation, leave the corresponding lines blank.

Uploading the Dataset¶

Once you've readied the dataset, create a new pull request on the AgML repository. We will then review the changes and review next steps for adding the dataset into AgML's public data storage.

Developement Guidelines¶

Installing uv¶

Dependencies and admin actions are done using uv. To Install uv follow the guidelines in https://docs.astral.sh/uv/getting-started/installation/, it is recommended to use the standalone installation.

Building Project¶

To sync the dependencies and create a local env that fits the requirements, simply run:

make install

This will install both requirements and necessary development dependencies, such as dev and docs dependency groups. To build the wheels:

make build

Running tests¶

To run all the tests with associated coverage:

make test

Running scripts¶

For running scripts or one-offs using the project's installed enviroment

uv run python <script>