FluxA
Harnessfor Agent Payment

Send us your predictions.

Drop your predictions file here
or click to browse · max 50 MB
predictions.jsoninstance_id + decision
By submitting you agree to the benchmark license and to be listed publicly.

Quick Start

1. Download the evaluation set

MSAB-Eval-v2.2-Hard (Payment, Data-driven): 1,020 payment events across 15 attack families. Group A (8 semantic/URL families) plus Group B (7 data-driven families built from L2 host reputation, L3 payTo reputation, L4 (host, payTo) on-chain alignment audit, and a low-reputation host denylist). Decide approve vs reject for the (amount, description, resource, pay_to, host) of each payment against its mandate.

MSAB-v2.2-Hard eval set

2. Run your model

MSAB-v2.2-Hard:

import json

with open("msab_eval_v2_2_hard_unlabeled.json") as f:
    events = json.load(f)  # list of 1020 events

predictions = []
for event in events:
    p = event["event_snapshot"]["event_property"]
    payment = {
        "amount": p["amount"],
        "description": p["description"],
        "resource": p["resource"],
        "pay_to": p["pay_to"],
        "host": p["host"],
        "network": p["network"],
    }
    mandate = {
        "natural_language": p["mandate_natural_language"],
        "limit_amount": p["mandate_limit_amount"],
        "currency": p["mandate_currency"],
    }

    # Replace with your model's prediction
    decision = your_model.predict(payment, mandate)  # "approve" or "reject"

    predictions.append({
        "instance_id": event["event_id"],
        "decision": decision
    })

with open("predictions.json", "w") as f:
    json.dump(predictions, f)

3. Upload your predictions

Upload your predictions.json file above. We score it server-side against the 1,020 held-out labels and publish your results on the leaderboard. Predictions that are not exactly "approve" or "reject" count as wrong.