back to home

stock trader rl environment

#python#fastapi#pydantic#docker#websocket#openenv#hugging face spaces

openenv-compliant reinforcement learning environment for evaluating llm trading agents on indian equity markets. qualified for the meta pytorch openenv hackathon finale (top 800 out of 32,000+ teams).

/ what it does

  • simulates daily stock trading on 68 nifty stocks using ~5 years of real historical ohlcv data
  • agents connect via http/websocket, receive market observations with technical indicators, and respond with plain-text trade actions (buy, sell, hold)
  • three difficulty tiers: single stock (20 days), portfolio (30 days), full autonomous (40 days) — each with escalating constraints like transaction costs, slippage, position limits, and regime gates

/ how it works

  • market simulator replays historical price windows with a 50-day lookback buffer for indicator computation
  • feature engine computes rsi, macd, bollinger bands, volume spikes, trend, momentum, and volatility — served as human-readable text summaries for llm agents
  • step-level reward shaping: pnl reward, discipline bonus, regime gate penalty, trade limit violations
  • task-specific graders score the full trajectory on sharpe ratio, discipline, regime compliance, and risk management

/ why it matters — rlvr & grpo

  • the grading system is designed as a verifiable reward function (rlvr) — deterministic scores that replace traditional reward models
  • this enables grpo-based training: generate multiple rollouts through the environment, rank them by grader score, and update model weights to favor better trading trajectories
  • no separate reward model or critic needed — the environment's graders are the reward signal
  • the trader agent project is the target policy model — the goal is to train it using this environment's verifiable rewards to improve its trading decisions

/ what's next (in progress)

  • integrating with unsloth/trl for grpo-based rl training of the trader agent
  • vllm deployment for inference optimization during rollout generation
  • data pipeline for collecting and processing thousands of training rollouts
  • the environment (phase 1) is complete — now building the training loop (phase 2)

/ how it works

01agent connects via http/websocket and resets environment with a task and seed
02environment selects a random market window and returns initial observation
03agent reads market summary with technical indicators and submits trade action
04environment executes trade with realistic costs/slippage, computes reward, advances to next day
05at episode end, grader scores the full trajectory on task-specific criteria

/ features

meta pytorch hackathon finale
qualified for the finale (top 800 out of 32,000+ teams). built and presented the environment to meta engineers in bangalore, april 2026.
three difficulty tiers
single stock (easy, 20 days), portfolio (medium, 30 days), and full autonomous (hard, 40 days) with escalating constraints — transaction costs, slippage, position limits, trade caps, and regime gates.
verifiable reward design (rlvr)
deterministic grading functions that score agents on sharpe ratio, discipline, regime compliance, and risk management. designed to serve as verifiable rewards for grpo-based rl training — no separate reward model needed.
real market data & technical analysis
68 nifty stocks with ~5 years of daily ohlcv data. featureengine computes rsi, macd, bollinger bands, volume spikes, trend, momentum, and volatility from a 50-day lookback buffer.
llm-native interface
plain-text action space (buy, sell, hold) and human-readable market summaries — any llm can act as an agent without special tooling. invalid actions gracefully default to hold.
seed-reproducible episodes
fully deterministic episodes for reproducible evaluation. same seed produces same market window and sequence.