AI/ML2026-04-0612 min readBy Abhishek Nair - Fractional CTO for Deep Tech & AI

I Ran Karpathy's Autoresearch on a $1,299 MacBook โ€” Here's What Happened

#AI/ML#Autoresearch#Apple Silicon#MLX#Machine Learning#LLM Training#Karpathy#MacBook
Loading...

Running Karpathy's Autoresearch on an M2 MacBook Pro: Real Numbers, Real Limits

What happens when you run 700-experiment AI research infrastructure on a $2,000 laptop instead of an $80,000 GPU cluster.

12 min read | AI Engineering & ML Experiments


When Andrej Karpathy released autoresearch on March 7, 2026, the premise was irresistible: an autonomous agent that designs, runs, and evaluates ML experiments in a loop. Point it at a problem, go to sleep, wake up to a better model. He ran 700 experiments on an H100 in two days, found 20 improvements, and squeezed out an 11% speedup. Shopify CEO Tobi Lutke ran it overnight on company data and got a 19% gain. The repo crossed 21K GitHub stars before most people had finished reading the README.

I don't have an H100. I have an M2 MacBook Pro with 16 GB of unified memory. So I did what any reasonable engineer would do: I ran the experiments anyway and measured everything.

This post is the full accounting. Two community forks, five stability runs, an agent loop with four experiments, and a cost analysis that explains why this matters even at ~96x slower than an H100. Every number here comes from actual runs on my machine.


๐Ÿงช The Setup: Two Forks, Two Approaches

Autoresearch's original codebase targets CUDA. It won't run on Apple Silicon out of the box. But within days of release, the community produced two macOS forks, each taking a different approach to the portability problem.

MLX Fork (thenamangoyal/autoresearch)

This fork rewrites the training loop in MLX, Apple's native ML framework for Apple Silicon. MLX operates directly on the unified memory architecture -- no driver translation layer, no CPU fallback. The model architecture is a compact 11.5M parameter network: depth 4, 256 embedding dimensions, 2 attention heads, using SSSL attention (a sliding-window variant optimized for small models).

MPS Fork (miolini/autoresearch-macos)

This fork keeps PyTorch but routes computation through Apple's Metal Performance Shaders (MPS) backend. It uses a different architecture: full-context attention (labeled "L"), Muon + AdamW optimizer, and trains in float32. The model is architecturally different enough that direct performance comparisons between the forks are misleading -- which is itself an interesting finding.

Getting started with the MLX fork takes about two minutes:

curl -LsSf https://astral.sh/uv/install.sh | sh git clone https://github.com/thenamangoyal/autoresearch.git && cd autoresearch uv sync uv run prepare.py uv run train.py

That's it. No CUDA toolkit, no driver debugging, no Docker containers. uv resolves dependencies, prepare.py downloads and tokenizes the dataset, and train.py runs a full training loop. The entire setup-to-first-result pipeline is under five minutes.


๐Ÿ”ฌ Head-to-Head: MLX vs MPS on the Same Hardware

I ran both forks on the same machine under identical conditions. Here's what the numbers look like.

MetricMLX ForkMPS Fork
val_bpb2.3521.773*
tok/sec (steady state)~17,000-18,000~14,000-15,000
Steps in 5 min8176
Peak memory11,024 MB (~10.8 GB)Similar
Wall clock (total)5m 40s13m 27s
Training time302s302s
Eval time30s505s
Compile overhead4sN/A
MPS CPU fallback warningsN/A0
PrecisionMixedfloat32
Attention typeSSSL (sliding window)Full-context "L"
OptimizerDefaultMuon + AdamW

*The MPS fork's lower val_bpb is NOT directly comparable -- it uses a fundamentally different architecture, optimizer, and attention mechanism. Comparing these two numbers would be like comparing lap times between a go-kart and a sedan on different tracks.

What the Numbers Tell Us

Throughput. MLX is about 20% faster on raw token throughput (17-18K vs 14-15K tok/sec). This makes sense -- MLX talks directly to Apple Silicon without the MPS translation layer. No CPU fallback warnings on either fork, which means both are genuinely GPU-accelerated.

Wall clock. The MLX fork finishes in 5m 40s. The MPS fork takes 13m 27s. The training phase is identical (302s), but evaluation is where they diverge dramatically: 30 seconds for MLX versus 505 seconds for MPS. The MPS fork's evaluation pipeline is the bottleneck -- 16x slower than MLX's. For an agent loop that runs dozens of experiments, this gap compounds. MLX gives you roughly 2.4x more experiments per hour.

Memory. At 10.8 GB peak, the MLX fork consumes most of the 16 GB unified memory. This is tight. You won't be running a browser with 40 tabs, Slack, and Spotify alongside this. Close everything nonessential before running.

Think of the MLX fork like a Formula E car -- purpose-built for the track it's racing on. The MPS fork is more like running stock PyTorch through a compatibility shim. Both work. One is native.

Loading...

๐Ÿ” Stability: Can You Trust Overnight Runs?

Before handing control to an agent for a full night, I needed to answer one question: will this thing crash at 3 AM?

I ran the MLX fork five consecutive times. Same config, same seed conditions, back-to-back. Here are the results.

Runval_bpbPeak MemoryWall Clock
12.29111,024 MB5m 37s
22.35511,024 MB5m 38s
32.32011,024 MB5m 36s
42.34711,024 MB5m 37s
52.30811,024 MB5m 38s

val_bpb spread: 0.064 (2.291 to 2.355)

Memory: constant 11,024 MB across all runs. No leak. The allocator is well-behaved -- memory goes up during training and comes back down cleanly. This matters enormously for overnight runs. A slow memory leak that adds 100 MB per experiment would crash the machine after 50 runs.

Wall clock: constant ~5m 37s. No thermal throttling. The M2 ran at consistent speed across all five runs. This surprised me -- sustained compute workloads on laptops often show degradation as the chassis heats up. The M2's thermal design handles this workload without throttling.

Zero crashes. Zero NaN values. Every run completed cleanly and produced valid metrics.

The verdict: the MLX fork is stable enough for unattended overnight runs. I'd estimate you can safely chain 100+ experiments without intervention. The only constraint is power -- plug in the charger.


๐Ÿค– The Agent Loop: Four Experiments, One Question

The point of autoresearch isn't running a single training job -- it's having an agent propose modifications, run experiments, evaluate results, and decide what to keep. I set up a four-experiment loop using the MLX fork to test whether a Mac can meaningfully participate in this optimization cycle.

Baseline: val_bpb = 2.559

The agent generated four hypotheses:

Experiment 1: Full-Context Attention "L"

Hypothesis: Replacing SSSL sliding-window attention with full-context attention might capture longer-range dependencies.

Result: val_bpb = 2.582 (worse by 0.023)

Decision: DISCARD. Full-context attention added compute cost without improving quality. At this model size, the sliding-window approach is more parameter-efficient. Interesting result -- the MPS fork uses full-context attention by default, which may explain its different performance profile.

Experiment 2: Higher Learning Rate + Warmup

Hypothesis: Increase matrix LR from 0.04 to 0.06 and add 10% warmup steps.

Result: val_bpb = 2.551 (improved by 0.008)

Decision: KEEP. Small but consistent improvement. The warmup prevents early instability at the higher learning rate. This is the kind of optimization that compounds with other changes.

Experiment 3: SwiGLU Activation

Hypothesis: SwiGLU has shown benefits in larger transformers. Apply it to the feed-forward layers.

Result: val_bpb = 2.563 (worse by 0.004)

Decision: DISCARD. SwiGLU adds parameters in the FFN layer, and at 11.5M total parameters, the model is too small to absorb the cost. The additional expressiveness doesn't compensate for the parameter budget consumed. This is a useful negative result -- it tells you SwiGLU has a minimum effective model size.

Experiment 4: Halve Batch Size

Hypothesis: Reduce batch size from 2^16 to 2^15 to get more gradient updates per epoch.

Result: val_bpb = 2.432 (improved by 0.127)

Decision: KEEP. This is the big win. Halving the batch size gave a 5% improvement in val_bpb -- the single largest gain across all four experiments. The intuition: on a memory-constrained device, smaller batches mean more frequent updates, and the model benefits from the additional optimization steps more than it loses from noisier gradients.

Final Score

2.559 to 2.432 = 5.0% improvement in 4 experiments.

Four experiments, about 25 minutes of compute, and a meaningful improvement. The agent correctly identified two keepers and two discards. The pattern here matters: the Mac finds different optimizations than the H100. Karpathy's cluster experiments optimize for throughput at scale. The Mac experiments naturally surface step-efficiency improvements -- optimizations that squeeze more learning per gradient update rather than more updates per second.

The Mac and the H100 aren't in competition โ€” they surface different parts of the optimization landscape.

Loading...

๐Ÿ’ฐ The Economics: $0 Compute, Real Results

Let's talk about what this costs.

ApproachCompute CostAgent API CostExperiments/HourOvernight (~8h)
M2 MacBook Pro$0$2-5~12~100
H100 cloud (Lambda/RunPod)$16-24$2-5~1,100+~9,000+
H100 owned$80K+ capital$2-5~1,100+~9,000+

The M2 is roughly 96x slower than an H100 on raw throughput. That sounds devastating until you consider the context.

For exploration and prototyping, you don't need 9,000 experiments. You need 50-100 to understand your search space, validate your approach, and identify which directions are worth scaling. The Mac handles that in a single overnight run at zero compute cost.

For education and understanding, there's no substitute for running experiments yourself. Watching the training curves, seeing which modifications work and which don't, building intuition about hyperparameter sensitivity -- you can't get this from reading papers. The Mac makes this free to iterate on.

For cost-sensitive teams, the math is clear. A cloud H100 for 8 hours costs $16-24 before agent API fees. If you're a solo developer, a small startup, or a student, that adds up fast. Running initial explorations on your Mac and only scaling to cloud for final optimization runs is the rational strategy.

It's like prototyping a circuit on a breadboard before ordering a PCB. The breadboard is slower and messier, but it's free and it's on your desk. You don't need a fab to validate your design.

The agent API cost ($2-5 per session) comes from the LLM calls that propose and evaluate experiments. This is the same regardless of hardware. The compute is free. The intelligence costs money. That ratio favors the Mac -- you're paying the same for the agent brain whether it runs experiments in 5 minutes or 5 seconds.


โš ๏ธ Honest Limitations

I want to be specific about where this setup falls short, because the limitations matter as much as the results.

16 GB is the floor, not comfortable. At 10.8 GB peak memory, you have about 5 GB left for the OS and background processes. The system never swapped during my tests, but I had everything closed. If you're the type to keep 30 Chrome tabs open, you'll hit swap, and training performance will crater. 32 GB or 64 GB unified memory would be significantly more comfortable. The 8 GB M2 models cannot run this at all.

96x slower is real. An H100 churns through experiments at a pace the Mac simply cannot match. For production optimization -- the final push for those last few percentage points of performance -- you need cloud GPUs. The Mac is for exploration, not exploitation.

Small models only. The 11.5M parameter model used here fits in memory with room to breathe. A 100M parameter model would not. Autoresearch's design space is intentionally small-model-focused, which works in the Mac's favor, but don't extrapolate these results to larger architectures.

No multi-GPU scaling. An H100 cluster can distribute work across multiple GPUs. The Mac has one GPU, period. Parallelism isn't available.

Thermal management matters. My five-run stability test showed no throttling, but that's in a climate-controlled room. Running overnight in a warm apartment with the laptop on a soft surface could produce different results. Use a hard, flat surface with good airflow.


๐Ÿงญ When to Use Your Mac vs. Cloud GPUs

Based on my experiments, here's my honest recommendation:

Use your Mac when:

  • You're exploring a new search space and need 50-100 experiments to build intuition
  • You're learning how autoresearch works and want to iterate without cost pressure
  • You're prototyping experiment configurations before scaling to cloud
  • You're a solo developer or student and $20/night of cloud compute adds up
  • You want overnight runs with zero per-experiment cost

Scale to cloud when:

  • You've identified promising directions and want to run 500+ experiments to converge
  • You need models larger than ~15M parameters
  • You're optimizing for production and every fraction of a percent matters
  • You're running against a deadline and can't wait for 5-minute experiment cycles

The optimal workflow: prototype on Mac, validate on cloud, iterate on Mac, ship from cloud. Each stage uses hardware that matches the cost-sensitivity and throughput requirements of that phase.


๐Ÿ› ๏ธ Getting Started: The Five-Minute Version

If you have a MacBook Pro with Apple Silicon (M1/M2/M3/M4, 16 GB+ RAM), here's the fastest path to your first result:

# 1. Install uv (Python package manager) curl -LsSf https://astral.sh/uv/install.sh | sh # 2. Clone the MLX fork git clone https://github.com/thenamangoyal/autoresearch.git cd autoresearch # 3. Install dependencies uv sync # 4. Download and prepare the dataset uv run prepare.py # 5. Run your first training experiment uv run train.py

Expected output: ~81 steps in 5 minutes, val_bpb around 2.35, peak memory ~10.8 GB. Total time from zero to first result: under 10 minutes including downloads.

For the agent loop (autonomous experiment iteration), you'll need to configure an LLM API key and run the agent wrapper. The specifics vary by which agent framework you use, but the core concept is: the agent reads the training output, proposes a modification to the config, runs another training loop, and evaluates whether the modification helped.

If you're interested in how autonomous agent loops work at a deeper level -- what's feasible, what fails, and how to design guardrails around them -- I wrote a detailed analysis in The Agent Harness Inflection Point: What's Actually Feasible.


๐Ÿ”ฎ What This Means for the ML Research Bottleneck

Karpathy's core claim with autoresearch: the bottleneck in ML research is not ideas โ€” it's experiment throughput. If you can run 700 experiments in two days instead of 10 experiments in a week, you compress the research cycle by orders of magnitude.

The Mac doesn't change the argument. It changes who gets to participate.

Before autoresearch, running ML experiments required either institutional compute access or significant cloud spend. Now, anyone with a reasonably specced MacBook can run 100 experiments overnight for the cost of a few dollars in API calls. The experiments are slower, and the insights are real. Anyone with a 16 GB MacBook can now run 100 meaningful experiments for a few dollars in API calls.

This connects to a pattern I keep seeing in robotics and ML: the most valuable experiments are rarely the ones that require the most compute. They're the ones that test the right hypothesis. An H100 running 700 poorly-designed experiments produces less insight than a MacBook running 50 well-designed ones. The hardware accelerates execution. The human accelerates understanding.

Autoresearch on a Mac won't win any speed benchmarks. But it makes the experimentation loop accessible to anyone willing to close their Chrome tabs and plug in their charger. That's a meaningful shift.


Where This Lands

The M2 MacBook Pro is a legitimate autoresearch platform. Not the fastest, not the most capable, but stable, free to run, and good enough for exploration and prototyping. The numbers don't lie: 5% improvement in 4 experiments, zero crashes across 5 stability runs, ~12 experiments per hour with no compute cost.

The practical workflow is clear: explore on your Mac, scale to cloud when you've found something worth optimizing. Use both.

If you're evaluating whether autoresearch -- or autonomous agent loops more broadly -- makes sense for your team's ML workflow, I help companies assess exactly this kind of feasibility. And if you need a technical partner to guide the broader AI strategy, from experiments like these to production deployment, that's the fractional CTO engagement.

The practical limit isn't hardware or cost. It's knowing which hypotheses are worth testing. The Mac gets you into the loop fast enough to find out.

  • Agentic AI Assessment: /services/agentic-ai-assessment
  • Fractional CTO: /services/fractional-cto
Abhishek Nair - Fractional CTO for Deep Tech & AI
Abhishek Nair - Fractional CTO for Deep Tech & AI
Robotics & AI Engineer
About & contact
Why trust this guide?

Follow Me