An OpenEnv-compatible Reinforcement Learning environment that trains AI agents to do open-source issue triage โ the way a real maintainer would.
Small open-source repositories (under 500 GitHub stars) accumulate issue debt fast โ duplicates, vague bug reports, stale PRs, and requests the maintainer will never action. Maintainers spend their weekends cleaning this up by hand. This project trains an LLM agent to do it for them.
The agent reads a backlog of GitHub-style issues and must: (1) triage each item as keep, close, need-info, or duplicate, (2) link duplicates to their canonical issue, and (3) assign a release priority. A programmatic grader scores the agent against a hidden ground-truth label set.
1โ20 open-source maintainers. A problem they face every weekend. Not a toy benchmark.
Graders return a float in [0.0, 1.0] based on hidden ground truth. Same input always produces the same score.
Global consistency matters โ the agent must maintain coherent decisions across the entire backlog, not just per-item.
Reward is emitted on every step, not just at the end. Good actions earn +0.08โ0.12. Bad ones penalize up to โ0.15.
inference.py โ the entry pointThis is the only file the hackathon validator cares about. It must live at the project root. It reads three environment variables, connects to the LLM, runs the agent loop over all three tasks, and prints the required [START] / [STEP] / [END] lines to stdout. Every other file exists to support it.
inference.py in the root directory. If it's inside a subfolder like scripts/, submission validation fails automatically.
env/environment.py โ the RL environmentImplements the OpenEnv interface: reset() loads a task's issues into memory, step(action) applies one agent action and returns (observation, reward, done, info), and grade() calls the task's grader and returns a final score. It also handles loop detection โ if the agent keeps repeating the same action, penalty accumulates at โ0.05 per repeat.
env/models.py โ why Pydantic?Pydantic is imported because it provides data validation with zero boilerplate. Issue, Action, Observation, and Reward are all BaseModel subclasses. This means:
.model_dump() converts any model to a plain dict (needed by graders).model_copy(deep=True) deep-copies an issue without manual recursionenv/tasks/task_easy/medium/hard.pyEach file contains two things: a list of Issue objects (the synthetic backlog) and a GROUND_TRUTH dict mapping item IDs to the correct triage decision, priority, and duplicate target. The ground truth is never shown to the agent โ it's only used by the grader.
env/graders/grader_*.pyEach grader receives the agent's trajectory and final state, then computes a weighted score. All graders are deterministic โ same input always gives same output. The weights are: triage accuracy 55%, priority accuracy 35%, duplicate detection 10%. Hard difficulty adds extra penalties for deprioritizing security issues.
openenv.yamlMetadata consumed by the openenv validate CLI and the hackathon submission system. Lists task IDs, difficulty levels, action space description, and reward range. Without this file, openenv validate . fails.
DockerfilePackages the entire project into a container with Python 3.11 and the two required dependencies (openai, pydantic). The hackathon runs submissions inside Docker with 2 vCPU / 8 GB RAM. The entrypoint is python inference.py.
| Library | Where used | Why it's needed |
|---|---|---|
openai |
inference.py | The hackathon mandates the OpenAI client for all LLM calls. It supports any OpenAI-compatible endpoint via base_url, so the same code works with Groq, HuggingFace, OpenRouter, etc. |
pydantic |
models.py, all tasks | Strict data validation. Ensures Reward.value is always a float, Action.action_type is always a string, etc. Prevents silent bugs from malformed agent output. |
os |
inference.py | Reads environment variables (HF_TOKEN, API_BASE_URL, MODEL_NAME) via os.getenv(). |
json |
inference.py | Parses the LLM's JSON response into an Action object. Also handles stripping markdown code fences that some models include. |
sys |
task_*.py | Used to add the project root to Python's module search path so tasks can import from env.models without needing a package install. |
copy |
environment.py | Deep-copies issue lists on reset() so each episode starts from a clean state without mutating the original template. |
api_key to the OpenAI client. For HuggingFace Router this is your HF token. For Groq it's your gsk_... key. For OpenRouter it's your sk-or-... key.llama-3.1-8b-instant for Groq, Qwen/Qwen2.5-7B-Instruct for HF Router.| Task | Items | What the agent faces | Key challenges |
|---|---|---|---|
| easy | 10 | Clean backlog with clear labels. One obvious duplicate pair (#3 and #7, both report the same KeyError). Two items to close outright (Python 2 request, GUI request). |
Learning the basic signal: what's a real bug vs noise |
| medium | 30 | Noisy descriptions, borderline cases, multiple duplicate clusters, stale items (365-day-old question). Two open PRs that need to stay in the backlog. | Distinguishing "need-info" from "close", finding the right duplicate_of target |
| hard | 61 | Two-release milestone scope (v2.6 + v3.0), security issues that carry extra penalties if deprioritized, 6 cascading duplicate clusters, stale PRs from 2 years ago, conflicting community signals. | Global consistency, security sensitivity, release scoping, cascading duplicates |
Agent assigns the exact right decision: keep, close, need-info, or duplicate. Most common reward in a good run.
Agent not only marks item as duplicate but also links it to the right canonical issue ID. Slightly higher reward because it requires cross-backlog awareness.
Assigns the right bucket: next_release, backlog, or wont_fix. Adjacent errors get partial credit (+0.02).
Closing a valid issue or a need-info item. The harshest single-action penalty โ maintainers hate when their valid bugs get closed.
Each time the agent repeats an identical action, penalty accumulates. Forces the model to make progress rather than getting stuck.
In the hard task, failing to triage security issues or deprioritizing them carries double the normal penalty.
Your run produced scores of easy: 1.0000, medium: 0.9554, hard: 0.2828. Here's what those numbers mean and why the gap exists:
The hackathon validator reads stdout line by line and parses these three line types exactly:
[START] task=easy env=issue-grooming-env model=Qwen/Qwen2.5-7B-Instruct
[STEP] step=1 action=triage_item({"item_id":1,"decision":"keep"}) reward=0.10 done=false error=null
[STEP] step=2 action=mark_duplicate({"item_id":3,"duplicate_of":2}) reward=0.12 done=false error=null
[STEP] step=10 action=done({}) reward=-0.20 done=true error=null
[END] success=true steps=10 rewards=0.10,0.12,0.08,...,-0.20
Rules: one [START] per task, one [STEP] per env.step() call, one [END] always emitted even on exception. success=true means the final grade was โฅ 0.5.
Run the trained agent as a scheduled workflow. It reads open issues via the GitHub API, posts triage decisions as comments, and applies labels โ with a maintainer approval step before closing anything.
When a new issue is filed, the bot posts a suggested triage card in a Slack channel. The maintainer clicks โ or โ๏ธ to override. Over time the model learns from corrections.
Every Sunday, the agent processes the backlog and generates a prioritized report: "12 items to close, 3 need-info responses to send, 5 ready to ship in v2.6." The maintainer reviews before acting.
Used as a focused duplicate-detection step: when a new issue is filed, instantly check if it matches an existing open issue and suggest a merge. Reduces noise before it accumulates.
python -m venv .venv
.venv\Scripts\activate
pip install -r requirements.txt
$env:HF_TOKEN = "gsk_your_groq_key"
$env:API_BASE_URL = "https://api.groq.com/openai/v1"
$env:MODEL_NAME = "llama-3.1-8b-instant"
python inference.py
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
export HF_TOKEN="gsk_your_groq_key"
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="llama-3.1-8b-instant"
python inference.py
docker build -t issue-grooming-env .
docker run -e HF_TOKEN=$env:HF_TOKEN `
-e API_BASE_URL=$env:API_BASE_URL `
-e MODEL_NAME=$env:MODEL_NAME `
issue-grooming-env
from env import IssueGroomingEnv, Action
env = IssueGroomingEnv(task_id="easy")
obs = env.reset()
action = Action(
action_type="triage_item",
payload={"item_id": 2, "decision": "keep"}
)
obs, reward, done, info = env.step(action)
print(reward.message) # "Triaged #2 as 'keep'. Score: +0.10"
final_score = env.grade() # float in [0.0, 1.0]
Penalties subtract from the raw score before clamping to [0.0, 1.0]. The hard grader adds an extra penalty if security issues are deprioritized.
| Feature | Status |
|---|---|
| 3 tasks (easy / medium / hard) with synthetic issues | Done |
| Deterministic graders with partial credit | Done |
| Loop detection and penalties | Done |
| Hackathon-compliant stdout format | Done |
| OpenEnv interface (reset / step / state / grade) | Done |
| Improve hard task score with larger model | In progress |
| Real GitHub issue data via API (instead of synthetic) | Planned |
| Maintainer feedback loop for ground truth correction | Planned |
| Fine-tuning on successful trajectories | Planned |