stock trader rl environment

openenv-compliant reinforcement learning environment for evaluating llm trading agents on indian equity markets. qualified for the meta pytorch openenv hackathon finale (top 800 out of 32,000+ teams).

/ what it does

simulates daily stock trading on 68 nifty stocks using ~5 years of real historical ohlcv data
agents connect via http/websocket, receive market observations with technical indicators, and respond with plain-text trade actions (buy, sell, hold)
three difficulty tiers: single stock (20 days), portfolio (30 days), full autonomous (40 days) — each with escalating constraints like transaction costs, slippage, position limits, and regime gates

/ how it works

market simulator replays historical price windows with a 50-day lookback buffer for indicator computation
feature engine computes rsi, macd, bollinger bands, volume spikes, trend, momentum, and volatility — served as human-readable text summaries for llm agents
step-level reward shaping: pnl reward, discipline bonus, regime gate penalty, trade limit violations
task-specific graders score the full trajectory on sharpe ratio, discipline, regime compliance, and risk management

/ why it matters — rlvr & grpo

the grading system is designed as a verifiable reward function (rlvr) — deterministic scores that replace traditional reward models
this enables grpo-based training: generate multiple rollouts through the environment, rank them by grader score, and update model weights to favor better trading trajectories
no separate reward model or critic needed — the environment's graders are the reward signal
the trader agent project is the target policy model — the goal is to train it using this environment's verifiable rewards to improve its trading decisions

/ what's next (in progress)

integrating with unsloth/trl for grpo-based rl training of the trader agent
vllm deployment for inference optimization during rollout generation
data pipeline for collecting and processing thousands of training rollouts
the environment (phase 1) is complete — now building the training loop (phase 2)

/ how it works

01agent connects via http/websocket and resets environment with a task and seed

02environment selects a random market window and returns initial observation

03agent reads market summary with technical indicators and submits trade action

04environment executes trade with realistic costs/slippage, computes reward, advances to next day

05at episode end, grader scores the full trajectory on task-specific criteria

/ features

meta pytorch hackathon finale

qualified for the finale (top 800 out of 32,000+ teams). built and presented the environment to meta engineers in bangalore, april 2026.

three difficulty tiers

single stock (easy, 20 days), portfolio (medium, 30 days), and full autonomous (hard, 40 days) with escalating constraints — transaction costs, slippage, position limits, trade caps, and regime gates.

verifiable reward design (rlvr)

deterministic grading functions that score agents on sharpe ratio, discipline, regime compliance, and risk management. designed to serve as verifiable rewards for grpo-based rl training — no separate reward model needed.

real market data & technical analysis

68 nifty stocks with ~5 years of daily ohlcv data. featureengine computes rsi, macd, bollinger bands, volume spikes, trend, momentum, and volatility from a 50-day lookback buffer.

llm-native interface

plain-text action space (buy, sell, hold) and human-readable market summaries — any llm can act as an agent without special tooling. invalid actions gracefully default to hold.

seed-reproducible episodes

fully deterministic episodes for reproducible evaluation. same seed produces same market window and sequence.