tl;dr

highlights:: ~1% of network emissions, backed by yzi labs, dojo interface coder 7b, sft and dpo datasets, signature verification for multisig setups, kami
status:: live for 1 year 2 months, then archived
year:: 2024-2025
role:: architect, product owner, sole developer → tech lead
stack:: python, go, typescript, nextjs, react, litellm, instructor, langfuse, openrouter, huggingface, playwright, typescript-lsp, postgres, milvus, redis, docker, aws, pm2, fastapi, gin, cohere.ai, siws
challenges:: game theory, incentive mechanism design, anti-gaming mechanisms in an adversarial open-source environment, embracing new ai tooling when it hasn't matured, designing cicd flows to prevent vibe coded slop
lessons:: build the product first then decentralize, narrow scope beats broad ambition, without evals every architecture decision is a judgment call, assume everything will be reverse-engineered in open-source adversarial environments, technical decisions need to be made by the people who understand the engineering
links:: synthetic api, dojo v1, worker api, messaging, kami, docs

Origin Story

GPT-3.5 just dropped. AI hype was peaking, and OpenAI's APIs were outputting around 15-30 tokens per second (TPS). I remember at the time sweep.dev had figured out how to increase their TPS by load balancing across multiple Azure regions. It was very clever though obviously not a concern anymore today (2026).

I built 2 RAG chatbots (Maiko.ai, and frenscanner) and finally Tensorplex Stream, which was used alongside our LST product to raise seed funds successfully, but did the world really need another AI chatbot?

That's when my team and I started asking what bigger problems could be tackled. I noticed that everyone was tackling compute problems, but few teams were tackling data problems, where data is arguably why models are so incredible now (due to high quality data provided by humans).

Another key insight was that most teams on this frontier were using models to evaluate other models, relying on LLM-as-a-judge rather than having humans rate response quality. LLMs had no eyes and no way of interacting with webpages (at the time) and not much spatial awareness of web components, so I knew there was a gap to be filled.

This wasn't a straightforward decision. At first we tried to figure out if we could work on reward modeling. I drafted a few designs but couldn't make it work - the incentive mechanisms were too easily gameable and we wanted to involve humans in the loop (HITL). I landed on data labeling, specifically for code generation as we believed reasoning abilities in code generation would translate into other domains.

We were drawn to Bittensor because it was permissionless. Anyone with enough TAO tokens could purchase a slot on the network to register a subnet, and build for their use case and be subsidized by on-chain incentives.

During research I looked at Scale AI as a reference point, but they were heavily enterprise-focused and not suited for someone with a smaller dataset wanting to label their data. I saw other projects like Argilla and Prolific as well, but what if we needed more bespoke solutions? What if we wanted to allow users to define the type of data they require? What if other methods beyond Direct Preference Optimization (DPO), Proximal Policy Optimization (PPO) may arise?

What is Dojo and How it Works

Dojo is a platform that crowdsources human intelligence for AI/ML development on Bittensor. For context on how Bittensor works, their docs explain subnet architecture and incentive mechanisms well. The key thing to understand is that incentive mechanism design is the most important part of a subnet because participants (miners, validators) are businesses that aim to extract maximum value while minimizing effort.

Participants and Roles

Validators generate synthetic code interfaces for labellers to rate, validate the outputs, and assign scores to each miner for incentive distribution and penalties.
Miners host the decentralized platform and industrialize their operations by finding labellers through any means - their goal is to maximize incentive with the least amount of work possible.
Labellers (sometimes the same entity as a miner) - label data by any means

Technical Approach and Being Ahead of the Curve

Our work was ahead of its time. I looked at how I was coding day-to-day and tried to incorporate those methods into the pipeline: defining the problem, researching tech stack, drafting designs, writing code, checking the language server protocol (LSP) for diagnostics, then iterating based on that feedback.

That's essentially what Cursor and Claude Code are doing now, using LSP and CoT to autonomously fix errors. Early on we integrated typescript-lsp, linted synthetically generated code, then performed chain-of-thought (CoT) reasoning to fix issues - a feedback loop with execution via a code sandbox between code generation and static analysis.

We focused on generating frontend interfaces because we wanted to give human labellers something interactive to evaluate. It was also easy for humans but hard for LLMs - a deliberate anti-gaming mechanism. This was late 2023/2024 - the approach was to take generated code and run it through a sandbox to actually render and test it.

We started with codesandbox.io, which was completely free. We were sending so many requests that they started adding rate-limiting and using redirects. So we moved to e2b.dev, an agent sandbox, but it was limited to Jupyter kernels at the time. Eventually we gave up on both and built our own sandbox so we could control the security policies on our frontend.

It's not surprising we got rate-limited. We were sending out a task every 15 minutes. Each task went to roughly 240 miners on the network (256 maximum participants per subnet, shared between validators and miners). Each miner produced 4 code outputs. That's 960 outputs per task - and every validator was generating its own task. With ~10 validators, that's around 9,600 outputs every 15 minutes, or roughly 640 sandbox calls per minute. No free sandbox tier was going to survive that.

Release Phases

Bootstrap: Synthetically generated problems using SOTA methods (PersonaHub + data augmentation in the negative direction), 1 validator running (our own), a few miners running, low QPS on a centralized platform we built.
Scale: All validators onboarded and running, plenty of labelling tasks for miners, all miner slots filled, max QPS on centralized platform.
Decentralize: Distributed hosting by miners, rewrite messaging layer to facilitate higher throughput.
Cross-subnet: Use validators as webservers to interface with other subnets, allowing cross-subnet validation.

The Hard Parts

The most important and most difficult part of designing the subnet was the incentive mechanism. With core Bittensor issues outside our control (like weight copying), we had to steer participants in the right direction while assuming they would find exploits and game our mechanisms no matter what anti-gaming measures we put in place.

The fundamental questions were:

How do you get participants to do useful work honestly?
How do you get validators to generate useful data instead of duplicate synthetic code interfaces?
How do you verify that work?
How do you create a feedback loop to verify the quality and integrity of the data?

Sybil Attacks

Registration was permissionless, so miners could register using multiple wallet addresses and perform a many-to-one mapping - compute the answers via LLMs or do honest work once, then share the answer with every other one of their machines. This is because each slot costs a certain number of TAO tokens, and by doing so they'd reduce their compute costs and maximize ROI as miners/labellers are inherently lazy.

In other subnets you might detect sybil attacks via a combination of IP, port, and wallet coldkey (where a coldkey has an associated child hotkey). But the dimensionality of data received from miners - 4 floating point values - made it impossible to differentiate, and this was a fundamental flaw in design of the subnet.

Synthetic Code Generation & Ordinal Ranking

The synthetic API outputted 4 interactive code outputs, with prompts used to synthetically generate and then augment these outputs to be worse or better to produce an ordinal ranking. This was extremely difficult - it was hard to ensure the differences followed a linear scale so that raters (humans or LLMs) couldn't immediately figure out the ordering.

Since everything was open source, anyone could clone the repo and inspect the code. Even if we used something like Nuitka to compile our Python code, someone could monitor the API calls to read the Openrouter requests.

The Decentralization Paradox

You have no way of stopping sybil attacks unless you centralize your platform - but if you centralize, it goes against the principles of bittensor. That tension defined most of our design constraints. The architecture was complex, with the design catering to both a centralized platform hosted by us and eventually distributed platform hosted by miners, with a complex docker compose setup.

Core SDK Issues

Bittensor's core layers lacked proper abstractions and flexibility. Substrate was written in Rust, and bittensor's SDK was written in Python where the Subtensor produced python bindings using PyO3. On top of that some teams decided to write their subnets in Golang, or TypeScript. Poor documentation and duplicate substrate definitions everywhere. It was also frustrating that the underlying py-polkadot-sdk had no async support. So I drew inspiration from LSPs, because I just thought there must be a way for us to abstract the nitty gritty of substrate calls into some RESTful API so that teams could focus on business logic. Hence Kami was born. Additionally, everyone was writing their own messaging layer, including us. I got frustrated by the lack of typing in bittensor's SDK as well as the lack of compression like using zstd to compress copious amounts of data.

What Happened

Dojo was shipped at like 2am on a Wednesday night. We launched on testnet in August 2024 and hit mainnet on December 12, 2024.

We reached phases 1 through 3. Our subnet ran with ~240 miners and ~10 validators across the 256 available slots. Each miner either labelled tasks manually or industrialized their operations - in total, roughly 3,000 labellers across major miner groups located in China, Russia, Vietnam, and other countries across APAC, US, and Europe. We collected 3.2 million tasks and distilled them into a 12.5k DPO dataset and a 25k SFT dataset, which were used to fine-tune Dojo Interface Coder 7B. The team was trying to reach phase 4 (cross subnet integrations) but the cost of daily maintenance were very high, leaving little time for collaboration.

V1 Design Flaws

The core issue with V1 was architectural: why should validators be the ones generating synthetic code? Each validator was spending ~$900/month in Openrouter credits to generate outputs that miners would then label. A better design would have miners produce the code directly, collect all samples in a vector database to evaluate uniqueness, and use LLMs as a judge to score readability, modularity, and diagnostic error count.

V2 and Why It Stalled

I designed V2 with several ideas I wanted to push forward: centralizing the task API so we were the sole authority (eliminating code injection and prompt attacks that polluted the dataset), restructuring the miner/validator mechanisms, and writing our own messaging layer for higher throughput. But these proposals were consistently blocked.

The team structure was nominally flat, but in practice decision-making authority sat with people removed from the day-to-day engineering reality of the subnet. My role had evolved from sole developer to architect and tech lead - I was designing the systems, consulting the team, and driving decisions forward - but I didn't have the corresponding authority to ship them. I'd gather input from engineers, collate concerns, and present proposals in team meetings, but the proposals were consistently deprioritized. Eventually, the gap between the decisions being made and the engineering reality became too wide.

After I left Tensorplex in August 2025, the V2 design was approved. However, the team decided to stop the development of Dojo.

Reflections

In hindsight, we were tackling a problem that was too big. Our goal was to support every modality & task (text, audio, video, 3D assets) - we were working with one subnet that output 3D assets as LiDAR point clouds for the Unity marketplace, and another that did bounding box tracking for soccer prediction markets. Each modality required its own agreed-upon data format, and tackling each one required domain expertise. The massive scope made it impossible to focus on anything well, and a domain expert was required for each domain which we clearly didn't have.

For the synthetic data generator, we used CoT. I also experimented with ReWOO, but it was ultimately discarded by the team since we had no evals in place to measure whether performance would increase or decrease with new systems. That itself was a lesson - without evals, you can't make informed decisions about architecture changes, and everything becomes a judgment call.

It was also difficult to coordinate between our research arm and the engineering team. We were constantly trying to integrate new methods like Rich Human Feedback for the image modality, and would need to pioneer new methods of data labelling for each new modality we wanted to support.

I knew going into this that our product was racing against time before the big players (OpenAI, Anthropic, Google) released features that allowed LLMs to browse the web, and progress was made on VLMs. Still, it was an extremely challenging and rewarding problem to work on, although it is a pity that the tangible outcomes of Dojo could've been more.

The bigger mistake was that we were looking at the problem and trying to fit it into Bittensor, instead of solving the problem first and then using Bittensor to bootstrap distribution. Build the product, then decentralize - not the other way around.