🟢 Anthropic Updates: Claude 3.5 Haiku and Upgraded Sonnet

🟢 This article is rated easy

Reading Time: 14 minutes

Last updated on October 24, 2024

Last updated on October 24, 2024 by Valeriia Kuka

What is Claude 3.5 Haiku and Upgraded Sonnet?

Claude 3.5 Haiku and the upgraded Claude 3.5 Sonnet are two newly released models in the Claude 3 family developed by Anthropic. These models showcase improved reasoning, coding, and visual processing capabilities. They also introduce notable advancements in tool use and autonomous task completion.

The upgraded Claude 3.5 Sonnet model's standout feature is its ability to use computers, interpreting screenshots of graphical user interfaces (GUIs) and generating tool calls to perform tasks. This innovation enables Claude to interact with websites and applications, completing multi-step tasks.

On the other hand, Claude 3.5 Haiku is a text-only model that excels in reasoning and instruction following, achieving comparable results to the Claude 3.5 Sonnet.

How the upgraded Claude 3.5 Sonnet works:

Screenshot Interpretation: The model analyzes screenshots and understands visual elements on the screen.
Tool Calls: Based on this understanding, it generates tool calls (commands like clicking buttons, typing, etc.) to perform tasks within the graphical user interface.
Complex Task Execution: Claude navigates applications, making decisions along the way to accomplish multi-step processes, such as filling out forms or completing a workflow.

What's New in These Models?

Enhanced Visual Processing: The upgraded Sonnet surpasses earlier models by processing visual data directly from screenshots, unlike prior models that relied solely on text-based inputs.
Autonomous Tool Use: While earlier Claude models demonstrated reasoning and coding skills, the new Sonnet model extends its abilities to autonomous GUI interaction, making it unique in handling real-world tasks on a computer interface.
Agentic Task Completion: Both models show improved performance in agentic tasks (tasks requiring decision-making and self-correction), particularly with coding and complex workflow automation.
Claude 3.5 Haiku’s Refinement: Despite being text-based, Claude 3.5 Haiku matches or even surpasses previous models, focusing on tasks requiring structured reasoning and instruction adherence.

How to Use Claude 3.5 Haiku and Upgraded Sonnet

Practical Applications

Automation: The upgraded Sonnet can handle tasks like navigating websites, filling forms, and interacting with web applications.
Software Development: These models excel at agentic coding tasks, allowing them to autonomously fix bugs, generate code, and resolve issues in development environments.
Customer Service Automation: Both models can assist in customer service scenarios, automating complex interactions by following predefined protocols.

Example Prompt for Sonnet’s GUI Use:

Prompt

Please fill out the vendor request form for Ant Equipment Co. using data from either the vendor spreadsheet or search portal tabs in window one. List & verify each field as you complete the form in window two.

Here's this prompt tested:

Results of Claude 3.5 Haiku and the upgraded Claude 3.5 Sonnet

Key Improvements:

Sonnet's GUI Tasks: 22% success rate in OSWorld tests with 50 interaction steps (up from 14.9%).
Agentic Coding: Sonnet (New) achieved a pass@1 of 49% on SWE-bench, a significant improvement from its predecessor.

Computer Use Results of Claude 3.5 Sonnet (New)

Category	Claude 3.5 Sonnet (New) - 15 steps	Claude 3.5 Sonnet (New) - 50 steps	Human Success Rate
Success Rate	95% CI	Success Rate	95% CI
OS	54.2% [34.3, 74.1]%	41.7% [22.0, 61.4]%	75.00%
Office	7.7% [2.9, 12.5]%	17.9% [11.0, 24.8]%	71.79%
Daily	16.7% [8.4, 25.0]%	24.4% [14.9, 33.9]%	70.51%
Professional	24.5% [12.5, 36.5]%	40.8% [27.0, 54.6]%	73.47%
Workflow	7.9% [2.6, 13.2]%	10.9% [4.9, 17.0]%	73.27%
Overall	14.9% [11.3, 18.5]%	22% [17.8, 26.2]%	72.36%

Results of Claude 3.5 Sonnet (New) on Multimodal Evaluation

Task	Claude 3.5 Sonnet (New)	Claude 3.5 Sonnet	Claude 3 Opus	Claude 3 Sonnet	GPT-4o	Gemini 1.5 Pro
Visual Question Answering	70.4%	68.3%	59.4%	53.1%	69.1%	65.9%
MathVista (Testmini)	70.7%	67.7%	50.5%	47.9%	63.8%	68.1%
AI2D (Test)	95.3%	94.7%	88.1%	88.7%	94.2%	—
ChartQA (Test, Relaxed Accuracy)	90.8%	90.8%	80.8%	81.1%	85.7%	—
DocVQA (Test, ANLS Score)	94.2%	95.2%	89.3%	89.5%	92.8%	—

Results of Claude 3.5 Sonnet (New) on Reasoning, Math, Coding, and Q&A Evaluations

Task	Claude 3.5 Sonnet (New)	Claude 3.5 Sonnet	Claude 3 Opus	Claude 3 Sonnet	GPT-4o	Gemini 1.5 Pro	Llama 3.1 (405B)
Graduate Level Q&A (0-shot CoT)	65.0%	59.4%	50.4%	40.4%	53.6%	59.1%	51.1%
MMLU (5-shot CoT)	90.5%	90.4%	88.2%	81.5%	—	—	—
MMLU Pro (0-shot CoT)	78.0%	75.1%	67.9%	54.9%	—	75.8%	73.3%
MATH (0-shot CoT)	78.3%	71.1%	60.1%	43.1%	76.6%	86.5%	73.8%
HumanEval (Python Coding Tasks)	93.7%	92.0%	84.9%	73.0%	90.2%	—	89.0%

Evaluation Results for Claude 3.5 Haiku and Peer Models

Task	Claude 3.5 Haiku	Claude 3 Haiku	GPT-4o mini	Gemini 1.5 Flash
Graduate Level Q&A (0-shot CoT)	41.6%	33.3%	40.2%	51.0%
MMLU (General Reasoning 5-shot CoT)	80.9%	76.7%	—	—
MMLU (General Reasoning 5-shot)	77.6%	75.2%	—	—
MMLU (General Reasoning 0-shot CoT)	80.3%	74.0%	82.0%	—
MMLU Pro (General Reasoning 0-shot CoT)	65.0%	49.0%	—	67.3%
MATH (Mathematical Problem Solving 0-shot CoT)	69.2%	38.9%	70.2%	77.9% (4-shot CoT)
HumanEval (Python Coding Tasks 0-shot)	88.1%	75.9%	87.2%	—
MGSM (Multilingual Math 0-shot CoT)	85.6%	75.1%	87.0%	—
DROP (Reading Comprehension, Arithmetic F1 Score, 3-shot)	83.1	78.4	79.7	—
BIG-Bench Hard (Mixed Evaluations 3-shot CoT)	86.6%	73.7%	—	—
AIME 2024 (High School Math Competition 0-shot CoT)	5.3%	0.8%	—	—
Maj@64 (0-shot CoT)	10.1%	0.4%	—	—
IFEval (Instruction Following)	85.9%	77.2%	—	—

Valeriia Kuka

Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.

Edit this page

Word count: 0