๐Ÿ”ง issue-grooming-env

An OpenEnv-compatible Reinforcement Learning environment that trains AI agents to do open-source issue triage โ€” the way a real maintainer would.

OpenEnv RL Meta Hackathon Python 3.10+ Pydantic v2

What is this project?

Small open-source repositories (under 500 GitHub stars) accumulate issue debt fast โ€” duplicates, vague bug reports, stale PRs, and requests the maintainer will never action. Maintainers spend their weekends cleaning this up by hand. This project trains an LLM agent to do it for them.

The agent reads a backlog of GitHub-style issues and must: (1) triage each item as keep, close, need-info, or duplicate, (2) link duplicates to their canonical issue, and (3) assign a release priority. A programmatic grader scores the agent against a hidden ground-truth label set.

๐Ÿ‘ฅ

Real-world domain

1โ€“20 open-source maintainers. A problem they face every weekend. Not a toy benchmark.

๐Ÿ“

Deterministic grading

Graders return a float in [0.0, 1.0] based on hidden ground truth. Same input always produces the same score.

๐Ÿ”„

Multi-step consistency

Global consistency matters โ€” the agent must maintain coherent decisions across the entire backlog, not just per-item.

๐Ÿ“ˆ

Incremental reward

Reward is emitted on every step, not just at the end. Good actions earn +0.08โ€“0.12. Bad ones penalize up to โˆ’0.15.

Architecture โ€” how the files connect

Project structure & data flow
inference.py Entry point ยท reads env vars Environment vars HF_TOKEN ยท API_BASE_URL ยท MODEL_NAME LLM API Groq ยท HF Router ยท OpenRouter IssueGroomingEnv env/environment.py ยท reset / step / grade models.py Issue ยท Action ยท Observation ยท Reward tasks/ task_easy ยท task_medium ยท task_hard graders/ grader_easy ยท medium ยท hard Ground truth label sets Hidden correct triage + priority per issue openenv.yaml Metadata for hackathon validator Dockerfile Packages everything for submission stdout: [START] [STEP] [END] lines Required hackathon output format

Why does each file exist?

inference.py โ€” the entry point

This is the only file the hackathon validator cares about. It must live at the project root. It reads three environment variables, connects to the LLM, runs the agent loop over all three tasks, and prints the required [START] / [STEP] / [END] lines to stdout. Every other file exists to support it.

The hackathon requires inference.py in the root directory. If it's inside a subfolder like scripts/, submission validation fails automatically.

env/environment.py โ€” the RL environment

Implements the OpenEnv interface: reset() loads a task's issues into memory, step(action) applies one agent action and returns (observation, reward, done, info), and grade() calls the task's grader and returns a final score. It also handles loop detection โ€” if the agent keeps repeating the same action, penalty accumulates at โˆ’0.05 per repeat.

env/models.py โ€” why Pydantic?

Pydantic is imported because it provides data validation with zero boilerplate. Issue, Action, Observation, and Reward are all BaseModel subclasses. This means:

env/tasks/task_easy/medium/hard.py

Each file contains two things: a list of Issue objects (the synthetic backlog) and a GROUND_TRUTH dict mapping item IDs to the correct triage decision, priority, and duplicate target. The ground truth is never shown to the agent โ€” it's only used by the grader.

env/graders/grader_*.py

Each grader receives the agent's trajectory and final state, then computes a weighted score. All graders are deterministic โ€” same input always gives same output. The weights are: triage accuracy 55%, priority accuracy 35%, duplicate detection 10%. Hard difficulty adds extra penalties for deprioritizing security issues.

openenv.yaml

Metadata consumed by the openenv validate CLI and the hackathon submission system. Lists task IDs, difficulty levels, action space description, and reward range. Without this file, openenv validate . fails.

Dockerfile

Packages the entire project into a container with Python 3.11 and the two required dependencies (openai, pydantic). The hackathon runs submissions inside Docker with 2 vCPU / 8 GB RAM. The entrypoint is python inference.py.

Why are these libraries imported?

Library Where used Why it's needed
openai inference.py The hackathon mandates the OpenAI client for all LLM calls. It supports any OpenAI-compatible endpoint via base_url, so the same code works with Groq, HuggingFace, OpenRouter, etc.
pydantic models.py, all tasks Strict data validation. Ensures Reward.value is always a float, Action.action_type is always a string, etc. Prevents silent bugs from malformed agent output.
os inference.py Reads environment variables (HF_TOKEN, API_BASE_URL, MODEL_NAME) via os.getenv().
json inference.py Parses the LLM's JSON response into an Action object. Also handles stripping markdown code fences that some models include.
sys task_*.py Used to add the project root to Python's module search path so tasks can import from env.models without needing a package install.
copy environment.py Deep-copies issue lists on reset() so each episode starts from a clean state without mutating the original template.

Environment variables

HF_TOKEN
Required ยท no default
Your API key. Passed as api_key to the OpenAI client. For HuggingFace Router this is your HF token. For Groq it's your gsk_... key. For OpenRouter it's your sk-or-... key.
API_BASE_URL
Optional ยท default: https://api.openai.com/v1
The LLM endpoint. Change this to point at Groq, HuggingFace Router, or any other OpenAI-compatible server.
MODEL_NAME
Optional ยท default: gpt-4.1-mini
The model identifier. Must match exactly what the chosen provider expects. E.g. llama-3.1-8b-instant for Groq, Qwen/Qwen2.5-7B-Instruct for HF Router.

How the agent loop works

Per-task inference loop
env.reset(task_id) format_observation(obs) Shows next untriaged item to the model client.chat.completions.create() Full conversation history in context parse_action(raw_text) Extracts first valid JSON block env.step(action) Returns obs, reward, done, info done โ†’ env.grade() not done prints [STEP]

The three tasks explained

Task Items What the agent faces Key challenges
easy 10 Clean backlog with clear labels. One obvious duplicate pair (#3 and #7, both report the same KeyError). Two items to close outright (Python 2 request, GUI request). Learning the basic signal: what's a real bug vs noise
medium 30 Noisy descriptions, borderline cases, multiple duplicate clusters, stale items (365-day-old question). Two open PRs that need to stay in the backlog. Distinguishing "need-info" from "close", finding the right duplicate_of target
hard 61 Two-release milestone scope (v2.6 + v3.0), security issues that carry extra penalties if deprioritized, 6 cascading duplicate clusters, stale PRs from 2 years ago, conflicting community signals. Global consistency, security sensitivity, release scoping, cascading duplicates

How reward is calculated

โœ… Correct triage (+0.10)

Agent assigns the exact right decision: keep, close, need-info, or duplicate. Most common reward in a good run.

๐Ÿ”— Correct duplicate (+0.12)

Agent not only marks item as duplicate but also links it to the right canonical issue ID. Slightly higher reward because it requires cross-backlog awareness.

โญ Correct priority (+0.08)

Assigns the right bucket: next_release, backlog, or wont_fix. Adjacent errors get partial credit (+0.02).

โŒ Wrong close (โˆ’0.15)

Closing a valid issue or a need-info item. The harshest single-action penalty โ€” maintainers hate when their valid bugs get closed.

๐Ÿ” Loop penalty (โˆ’0.05/repeat)

Each time the agent repeats an identical action, penalty accumulates. Forces the model to make progress rather than getting stuck.

๐Ÿ›ก๏ธ Security penalty (โˆ’0.06, hard only)

In the hard task, failing to triage security issues or deprioritizing them carries double the normal penalty.

Understanding the output you saw

Your run produced scores of easy: 1.0000, medium: 0.9554, hard: 0.2828. Here's what those numbers mean and why the gap exists:

Easy โ€” 10 issues, clear signals 1.0000 โ€” perfect
The LLM correctly identified all triage decisions, both duplicates, all priorities. No surprises.
Medium โ€” 30 issues, noisy signals 0.9554 โ€” excellent
A small number of borderline cases or near-miss priority assignments caused the slight drop from perfect.
Hard โ€” 61 issues, cascading complexity 0.2828 โ€” needs work
The hard task has 6 duplicate clusters, security penalties, and a 61-item backlog. Smaller models struggle to maintain global consistency over many items. The 50-step limit (from max_steps cap) also forces truncation before all items are triaged.
The hard score gap is expected with a 7B model like Qwen2.5-7B. Larger models (70B+) score significantly higher on hard because they can reason about cross-item consistency across a longer context.

The required stdout format

The hackathon validator reads stdout line by line and parses these three line types exactly:

[START] task=easy env=issue-grooming-env model=Qwen/Qwen2.5-7B-Instruct
[STEP] step=1 action=triage_item({"item_id":1,"decision":"keep"}) reward=0.10 done=false error=null
[STEP] step=2 action=mark_duplicate({"item_id":3,"duplicate_of":2}) reward=0.12 done=false error=null
[STEP] step=10 action=done({}) reward=-0.20 done=true error=null
[END] success=true steps=10 rewards=0.10,0.12,0.08,...,-0.20

Rules: one [START] per task, one [STEP] per env.step() call, one [END] always emitted even on exception. success=true means the final grade was โ‰ฅ 0.5.

Real-world use cases

๐Ÿค–

GitHub Actions bot

Run the trained agent as a scheduled workflow. It reads open issues via the GitHub API, posts triage decisions as comments, and applies labels โ€” with a maintainer approval step before closing anything.

๐Ÿ’ฌ

Slack maintainer assistant

When a new issue is filed, the bot posts a suggested triage card in a Slack channel. The maintainer clicks โœ… or โœ๏ธ to override. Over time the model learns from corrections.

๐Ÿ“‹

Weekly grooming report

Every Sunday, the agent processes the backlog and generates a prioritized report: "12 items to close, 3 need-info responses to send, 5 ready to ship in v2.6." The maintainer reviews before acting.

๐Ÿ”

Duplicate detector

Used as a focused duplicate-detection step: when a new issue is filed, instantly check if it matches an existing open issue and suggest a merge. Reduces noise before it accumulates.

Setup and running

Prerequisites

Install and run (PowerShell / Windows)

python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt

$env:HF_TOKEN     = "gsk_your_groq_key"
$env:API_BASE_URL = "https://api.groq.com/openai/v1"
$env:MODEL_NAME   = "llama-3.1-8b-instant"

python inference.py

Install and run (bash / macOS / Linux)

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

export HF_TOKEN="gsk_your_groq_key"
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"

python inference.py

Docker (for hackathon submission)

docker build -t issue-grooming-env .
docker run -e HF_TOKEN=$env:HF_TOKEN `
           -e API_BASE_URL=$env:API_BASE_URL `
           -e MODEL_NAME=$env:MODEL_NAME `
           issue-grooming-env

Use as a Python library

from env import IssueGroomingEnv, Action

env = IssueGroomingEnv(task_id="easy")
obs = env.reset()

action = Action(
    action_type="triage_item",
    payload={"item_id": 2, "decision": "keep"}
)
obs, reward, done, info = env.step(action)
print(reward.message)  # "Triaged #2 as 'keep'. Score: +0.10"

final_score = env.grade()  # float in [0.0, 1.0]

Grading formula

Score composition (all three tasks)
Triage accuracy 55% weight keep / close / need-info / duplicate Priority 35% Dupes 10%

Penalties subtract from the raw score before clamping to [0.0, 1.0]. The hard grader adds an extra penalty if security issues are deprioritized.

What's currently implemented vs what's next

FeatureStatus
3 tasks (easy / medium / hard) with synthetic issuesDone
Deterministic graders with partial creditDone
Loop detection and penaltiesDone
Hackathon-compliant stdout formatDone
OpenEnv interface (reset / step / state / grade)Done
Improve hard task score with larger modelIn progress
Real GitHub issue data via API (instead of synthetic)Planned
Maintainer feedback loop for ground truth correctionPlanned
Fine-tuning on successful trajectoriesPlanned
issue-grooming-env ยท Meta OpenEnv Hackathon submission ยท Python 3.11 ยท openai ยท pydantic