Meta Muse Spark: First Look, Benchmarks & Claude Mythos Comparison (2026)

Meta shipped Muse Spark on April 8, 2026. Their first frontier model since Llama 4 tanked a year ago. The first model Meta has ever kept closed-source. And the first real output from Meta Superintelligence Labs (MSL). One day earlier, Anthropic dropped Claude Mythos Preview, their most powerful model ever, and immediately said you can't have it. Two frontier models in 24 hours, and developers can't properly use either one.

Here's what's real, what's still a press release, and what this means if you're building with these models.

Key Takeaways#

Muse Spark scored 52 on the Artificial Analysis Intelligence Index v4.0. That's 4th place behind Gemini 3.1 Pro (57), GPT-5.4 (57), and Claude Opus 4.6 (53)
It leads on health benchmarks (HealthBench Hard: 42.8%) and figure reasoning (CharXiv: 86.4), beating every other frontier model
It trails badly on coding (Terminal-Bench 59.0 vs GPT-5.4's 75.1) and abstract reasoning (ARC-AGI-2: 42.5 vs 76+)
Access is gated: private API preview only. Free via meta.ai chat with a Facebook or Instagram login
Meta's first proprietary model ever. A clean break from Llama's open-weights legacy
Claude Mythos Preview dropped one day earlier. Anthropic's most powerful model, withheld for security. Neither model is usable at production scale right now
I haven't run hands-on tests yet. This post will be updated once API access opens up

TL;DR: Should You Care Today?#

If you're building coding agents: No. Claude Opus 4.6 and GPT-5.4 still own this space. Muse Spark is 16 points behind on Terminal-Bench.
If you're building health or vision apps: Worth watching closely. Muse Spark genuinely leads HealthBench Hard at 42.8% and MMMU-Pro vision at 80.5%.
If you're waiting for a cheaper frontier alternative: Also worth watching. Muse Spark is free via meta.ai today, but no API pricing exists yet.
If you want to try it right now: Go to meta.ai, log in with Facebook or Instagram, pick "Thinking" mode. That's all you get.

What Meta Actually Shipped#

The Three Modes: Instant, Thinking, Contemplating#

Muse Spark runs in three modes on meta.ai:

Instant: fast default, quick answers
Thinking: visible extended reasoning, similar to Claude's extended thinking or Gemini's thinking mode
Contemplating: the headline feature. Multiple parallel reasoning agents competing to produce the best answer. Meta positions this against Gemini Deep Think and GPT-5.4 Pro

Here's the thing: Contemplating mode is rolling out gradually. Most users won't see it on day one. Meta published benchmarks for Contemplating mode (58% on Humanity's Last Exam, 38% on FrontierScience Research), but you can't independently verify those numbers yet.

16 Tools and a Full Stack Rebuild#

Simon Willison probed the system and found 16 tools inside meta.ai: web search, social semantic search, product catalog, image generation, Python execution (3.9.25, which is EOL), visual grounding, sub-agent spawning, and a code interpreter. Credit to Meta for not hiding this. The model disclosed its full tool list when asked.

Muse Spark accepts text, image, and voice input. Meta rebuilt their entire pretraining stack and claims it reaches Llama 4 Maverick capability with over an order of magnitude less compute, using a technique called "thought compression." The internal jump tells the story: Llama 4 Maverick scored 18 on the Artificial Analysis Intelligence Index. Muse Spark scores 52.

Muse Spark Benchmarks: Where It Wins, Where It Loses#

Intelligence Index: 4th Place (and Why That's Still Notable)#

Artificial Analysis Intelligence Index v4.0

Model	Score	Rank
Gemini 3.1 Pro	57	#1
GPT-5.4	57	#1
Claude Opus 4.6	53	#3
Muse Spark	52	#4
Llama 4 Maverick (2025)	18	—

4th place doesn't sound exciting until you remember Llama 4 Maverick scored 18 and Scout scored 13. The gap between 4th (52) and 1st (57) is now smaller than the gap between Muse Spark and Meta's last attempt.

Wins: Health, Vision, Figure Reasoning#

Where Muse Spark Wins

Benchmark	Muse Spark	Claude Opus 4.6	GPT-5.4	Gemini 3.1 Pro
HealthBench Hard	42.8%	14.8%	40.1%	20.6%
MMMU-Pro (vision)	80.5%	—	—	82.4%
CharXiv Reasoning	86.4	—	82.8	80.2
HLE (Thinking)	50.2%	—	43.9%	48.4%

HealthBench Hard at 42.8% vs Claude Opus 4.6's 14.8%, nearly 3x. Meta trained with over 1,000 physicians on health domain data. CharXiv figure reasoning at 86.4 also leads the pack. Muse Spark is genuinely strong at charts, graphs, and visual data.

Losses: Coding, Agentic Tasks, Abstract Reasoning#

Where Muse Spark Falls Short

Benchmark	Muse Spark	Leader	Gap
Terminal-Bench 2.0	59.0	GPT-5.4: 75.1	-16.1
ARC-AGI-2	42.5	Gemini: 76.5	-34.0
GDPval-AA (agentic)	1,427 ELO	GPT-5.4: 1,672	-245 ELO
GPQA Diamond	89.5%	Gemini: 94.3%	-4.8%

Terminal-Bench 59.0 vs 75.1 is not close. Meta acknowledged it themselves: they "continue to invest in areas with current performance gaps, such as long-horizon agentic systems and coding workflows." ARC-AGI-2 at 42.5 vs 76+ is a 34-point canyon. If you're using AI for coding, Muse Spark is not the answer today.

The Quiet Story: Token Efficiency#

Here's where it gets interesting for cost-conscious builders. Running the full Artificial Analysis Intelligence Index:

Muse Spark: 58M output tokens
Claude Opus 4.6 (adaptive/max): 157M tokens
GPT-5.4 (xhigh): 120M tokens
GLM-5: 110M tokens

Muse Spark gets competitive results while burning 2.7x fewer tokens than Claude Opus 4.6. When Meta ships API pricing, that efficiency could mean real cost savings, assuming the per-token rate is reasonable.

Two frontier models in 24 hours

Get hands-on benchmark updates and pricing breakdowns as Muse Spark and Claude Mythos access opens up.

Muse Spark vs Claude Mythos Preview: Both Announced, Neither Usable#

Claude Mythos Preview: Anthropic's Best, Under Lock and Key#

One day before Muse Spark, Anthropic dropped Claude Mythos Preview. Their most powerful model ever: 93.9% on SWE-bench Verified, 181 autonomous Firefox exploits (Opus 4.6 managed 2), and 595 OSS-Fuzz crashes including 10 full control-flow hijacks.

Anthropic won't release it publicly. Over 99% of the vulnerabilities it found remain unpatched. Instead they launched Project Glasswing, giving ~40 organizations (Amazon, Apple, CrowdStrike, Microsoft, etc.) access for defensive security. The rest of us wait.

The Access Parallel#

Access Reality: Muse Spark vs Claude Mythos Preview

	Muse Spark	Claude Mythos Preview
Announced	April 8, 2026	April 7, 2026
Free consumer access	Yes (meta.ai chat)	No
Public API	No (private preview)	No (40 orgs via Glasswing)
Can you build on it?	No	No
Pricing	TBD	TBD
Why gated?	Not ready / rollout	Security risk (unpatched vulns)
Open source?	No (first closed Meta model)	No

Different reasons, same result. Meta is still rolling out infrastructure, Anthropic is worried about security. You can read about these models, but you can't build with them.

For developers who need a frontier model today: Claude Opus 4.6 ($5/$25, 1M context, SWE-bench ~80.8%), GPT-5.4 (75.1 Terminal-Bench), or Gemini 3.1 Pro (#1 Intelligence Index, tiered pricing). Nothing about this week changes what you should ship with today.

The Closed-Source Pivot and Why It Matters#

Every previous Meta frontier model shipped as open weights. Llama through Llama 4. Kimi K2.5, GLM-5, and dozens of other models built on that open foundation. Muse Spark breaks the streak. Meta says open weights "may follow." Treat that as a maybe.

Internally codenamed Avocado, Muse Spark was previously delayed for underperforming on reasoning and coding. Alexandr Wang (former Scale AI CEO, now Meta's Chief AI Officer) led a 9-month ground-up rebuild under MSL, which Zuckerberg formed in June 2025 after the $14.3 billion Scale AI deal. Meta spent $14.3B and rebuilt everything. The result is competitive, but not leading.

The bigger picture: open-source advocates lost their biggest benefactor, and the gap between "announced" and "usable" keeps growing. Anthropic has Mythos locked up, Meta shipped without an API, and OpenAI reportedly has "Spud" coming.

How to Use Muse Spark Today#

Go to meta.ai, log in with Facebook or Instagram, pick "Thinking" mode for harder tasks. That's it. No pip install, no API key, no integration path. Contemplating mode is rolling out gradually. For API access, join the private preview waitlist.

What I'll Update When I Have Hands-On Access#

This is a first-look based on published benchmarks and third-party analysis. Once API access opens, I'll add: coding benchmarks on real tasks, head-to-head with Claude Opus 4.6, API pricing, Contemplating mode in practice, and whether Muse Spark can replace anything in my current stack.

Last updated: April 8, 2026, initial first look (no hands-on testing yet)

Meta Muse Spark: First Look at MSL's New Reasoning Model (2026)