
MSAB-Eval-v2.2-Hard (Payment, Data-driven): 1,020 payment events across 15 attack families. Group A (8 semantic/URL families) plus Group B (7 data-driven families built from L2 host reputation, L3 payTo reputation, L4 (host, payTo) on-chain alignment audit, and a low-reputation host denylist). Decide approve vs reject for the (amount, description, resource, pay_to, host) of each payment against its mandate.
MSAB-v2.2-Hard:
import json
with open("msab_eval_v2_2_hard_unlabeled.json") as f:
events = json.load(f) # list of 1020 events
predictions = []
for event in events:
p = event["event_snapshot"]["event_property"]
payment = {
"amount": p["amount"],
"description": p["description"],
"resource": p["resource"],
"pay_to": p["pay_to"],
"host": p["host"],
"network": p["network"],
}
mandate = {
"natural_language": p["mandate_natural_language"],
"limit_amount": p["mandate_limit_amount"],
"currency": p["mandate_currency"],
}
# Replace with your model's prediction
decision = your_model.predict(payment, mandate) # "approve" or "reject"
predictions.append({
"instance_id": event["event_id"],
"decision": decision
})
with open("predictions.json", "w") as f:
json.dump(predictions, f)Upload your predictions.json file above. We score it server-side against the 1,020 held-out labels and publish your results on the leaderboard. Predictions that are not exactly "approve" or "reject" count as wrong.