Model List

234 models

GPU TEE

Qwen: Qwen3 Coder
GPU TEE
Updated a day ago
Qwen3-Coder-480B-A35B-Instruct is a Mixture-of-Experts (MoE) code generation model developed by the Qwen team. It is optimized for agentic coding tasks such as function calling, tool use, and long-context reasoning over repositories. The model features 480 billion total parameters, with 35 billion active per forward pass (8 out of 160 experts).

Pricing for the Alibaba endpoints varies by context length. Once a request is greater than 128k input tokens, the higher pricing is used.
by phala|262K context|$1.9/M input tokens|$1.9/M output tokens
Meta: Llama 3.1 70B Instruct
GPU TEE
Updated 7 days ago
Meta's latest class of model (Llama 3.1) launched with a variety of sizes & flavors. This 70B instruct-tuned version is optimized for high quality dialogue usecases.

It has demonstrated strong performance compared to leading closed-source models in human evaluations.

To read more about the model release, . Usage of this model is subject to .
by phala|131K context|$0.89/M input tokens|$0.89/M output tokens
Qwen2.5 VL 72B Instruct
GPU TEE
Updated 7 days ago
Qwen2.5-VL is proficient in recognizing common objects such as flowers, birds, fish, and insects. It is also highly capable of analyzing texts, charts, icons, graphics, and layouts within images.
by phala|128K context|$0.59/M input tokens|$0.59/M output tokens
DeepSeek: DeepSeek V3 0324
GPU TEE
Updated a month ago
DeepSeek V3, a 685B-parameter, mixture-of-experts model, is the latest iteration of the flagship chat model family from the DeepSeek team.

It succeeds the model and performs really well on a variety of tasks.
by phala|163K context|$0.78/M input tokens|$1.14/M output tokens
OpenAI: GPT-4.1 Nano
Updated 2 months ago
For tasks that demand low latency, GPT‑4.1 nano is the fastest and cheapest model in the GPT-4.1 series. It delivers exceptional performance at a small size with its 1 million token context window, and scores 80.1% on MMLU, 50.3% on GPQA, and 9.8% on Aider polyglot coding – even higher than GPT‑4o mini. It’s ideal for tasks like classification or autocompletion.
by openai|1047K context|$0.1/M input tokens|$0.4/M output tokens
OpenAI: GPT-4.1 Mini
Updated 2 months ago
GPT-4.1 Mini is a mid-sized model delivering performance competitive with GPT-4o at substantially lower latency and cost. It retains a 1 million token context window and scores 45.1% on hard instruction evals, 35.8% on MultiChallenge, and 84.1% on IFEval. Mini also shows strong coding ability (e.g., 31.6% on Aider’s polyglot diff benchmark) and vision understanding, making it suitable for interactive applications with tight performance constraints.
by openai|1047K context|$0.4/M input tokens|$1.6/M output tokens
OpenAI: GPT-4.1
Updated 2 months ago
GPT-4.1 is a flagship large language model optimized for advanced instruction following, real-world software engineering, and long-context reasoning. It supports a 1 million token context window and outperforms GPT-4o and GPT-4.5 across coding (54.6% SWE-bench Verified), instruction compliance (87.4% IFEval), and multimodal understanding benchmarks. It is tuned for precise code diffs, agent reliability, and high recall in large document contexts, making it ideal for agents, IDE tooling, and enterprise knowledge retrieval.
by openai|1047K context|$2/M input tokens|$8/M output tokens
Google: Gemini 2.5 Flash Preview 05-20
Updated 2 months ago
Gemini 2.5 Flash May 20th Checkpoint is Google's state-of-the-art workhorse model, specifically designed for advanced reasoning, coding, mathematics, and scientific tasks. It includes built-in "thinking" capabilities, enabling it to provide responses with greater accuracy and nuanced context handling.

Note: This model is available in two variants: thinking and non-thinking. The output pricing varies significantly depending on whether the thinking capability is active. If you select the standard variant (without the ":thinking" suffix), the model will explicitly avoid generating thinking tokens.

To utilize the thinking capability and receive thinking tokens, you must choose the ":thinking" variant, which will then incur the higher thinking-output pricing.

Additionally, Gemini 2.5 Flash is configurable through the "max tokens for reasoning" parameter, as described in the documentation (https://redpill.ai/models/docs/use-cases/reasoning-tokens#max-tokens-for-reasoning).
by google|1048K context|$0.15/M input tokens|$0.6/M output tokens
Google: Gemini 2.5 Pro Preview 06-05
Updated 2 months ago
Gemini 2.5 Pro is Google’s state-of-the-art AI model designed for advanced reasoning, coding, mathematics, and scientific tasks. It employs “thinking” capabilities, enabling it to reason through responses with enhanced accuracy and nuanced context handling. Gemini 2.5 Pro achieves top-tier performance on multiple benchmarks, including first-place positioning on the LMArena leaderboard, reflecting superior human-preference alignment and complex problem-solving abilities.
by google|1048K context|$1.25/M input tokens|$10/M output tokens
Anthropic: Claude Opus 4
Updated 2 months ago
Claude Opus 4 is benchmarked as the world’s best coding model, at time of release, bringing sustained performance on complex, long-running tasks and agent workflows. It sets new benchmarks in software engineering, achieving leading results on SWE-bench (72.5%) and Terminal-bench (43.2%). Opus 4 supports extended, agentic workflows, handling thousands of task steps continuously for hours without degradation.

Read more at the
by anthropic|200K context|$15/M input tokens|$75/M output tokens
Anthropic: Claude Sonnet 4
Updated 2 months ago
Claude Sonnet 4 significantly enhances the capabilities of its predecessor, Sonnet 3.7, excelling in both coding and reasoning tasks with improved precision and controllability. Achieving state-of-the-art performance on SWE-bench (72.7%), Sonnet 4 balances capability and computational efficiency, making it suitable for a broad range of applications from routine coding tasks to complex software development projects. Key enhancements include improved autonomous codebase navigation, reduced error rates in agent-driven workflows, and increased reliability in following intricate instructions. Sonnet 4 is optimized for practical everyday use, providing advanced reasoning capabilities while maintaining efficiency and responsiveness in diverse internal and external scenarios.

Read more at the
by anthropic|200K context|$3/M input tokens|$15/M output tokens
Qwen2.5 7B Instruct
GPU TEE
Updated 3 months ago
Qwen2.5 7B is the latest series of Qwen large language models. Qwen2.5 brings the following improvements upon Qwen2:

Significantly more knowledge and has greatly improved capabilities in coding and mathematics, thanks to our specialized expert models in these domains.

Significant improvements in instruction following, generating long texts (over 8K tokens), understanding structured data (e.g, tables), and generating structured outputs especially JSON. More resilient to the diversity of system prompts, enhancing role-play implementation and condition-setting for chatbots.

Long-context Support up to 128K tokens and can generate up to 8K tokens.

Multilingual support for over 29 languages, including Chinese, English, French, Spanish, Portuguese, German, Italian, Russian, Japanese, Korean, Vietnamese, Thai, Arabic, and more.

Usage of this model is subject to .
by phala|32K context|$0.04/M input tokens|$0.1/M output tokens
RedPill Auto Router
Updated 4 months ago
Depending on their size, subject, and complexity, your prompts will be routed to the most appropriate AI model from our selection of Claude 3.5, GPT-4o, Llama 3.1/3.3, Mistral, or other models to optimize for both performance and cost-efficiency.
by redpill|200K context|$0/M input tokens|$0/M output tokens
Anthropic: Claude 3.7 Sonnet
Updated 5 months ago
Claude 3.7 Sonnet is an advanced large language model with improved reasoning, coding, and problem-solving capabilities. It introduces a hybrid reasoning approach, allowing users to choose between rapid responses and extended, step-by-step processing for complex tasks. The model demonstrates notable improvements in coding, particularly in front-end development and full-stack updates, and excels in agentic workflows, where it can autonomously navigate multi-step processes.

Claude 3.7 Sonnet maintains performance parity with its predecessor in standard mode while offering an extended reasoning mode for enhanced accuracy in math, coding, and instruction-following tasks.

Read more at the
by anthropic|200K context|$3/M input tokens|$15/M output tokens
DeepSeek: R1 Distill 70B
GPU TEE
Updated 6 months ago
DeepSeek R1 Distill 70B is a distilled large language model based on Llama-3.3-70B-Instruct, using outputs from DeepSeek R1.
by phala|16K context|$0.23/M input tokens|$0.69/M output tokens
DeepSeek: DeepSeek V3
Updated 6 months ago
DeepSeek-V3 is the latest model from the DeepSeek team, building upon the instruction following and coding abilities of the previous versions. Pre-trained on nearly 15 trillion tokens, the reported evaluations reveal that the model outperforms other open-source models and rivals leading closed-source models.

For model details, please visit for more information, or see the .
by deepseek|64K context|$0.14/M input tokens|$0.28/M output tokens
DeepSeek: DeepSeek R1
Updated 6 months ago
DeepSeek R1 is here: Performance on par with , but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active in an inference pass.

Fully open-source model & .

MIT licensed: Distill & commercialize freely!
by deepseek|163K context|$7/M input tokens|$7/M output tokens
Meta: Llama 3.3 70B Instruct
GPU TEE
Updated 8 months ago
The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks.

Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

by phala|131K context|$0.12/M input tokens|$0.3/M output tokens
Meta: Llama 3.3 70B Instruct
Updated 8 months ago
The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks.

Supported languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

by meta-llama|131K context|$0.13/M input tokens|$0.4/M output tokens
Amazon: Nova Lite 1.0
Updated 8 months ago
Amazon Nova Lite 1.0 is a very low-cost multimodal model from Amazon that focused on fast processing of image, video, and text inputs to generate text output. Amazon Nova Lite can handle real-time customer interactions, document analysis, and visual question-answering tasks with high accuracy.

With an input context of 300K tokens, it can analyze multiple images or up to 30 minutes of video in a single input.
by amazon|300K context|$0.06/M input tokens|$0.24/M output tokens

Gemini 2.5 Flash May 20th Checkpoint is Google's state-of-the-art workhorse model, specifically designed for advanced reasoning, coding, mathematics, and scientific tasks. It includes built-in "thinking" capabilities, enabling it to provide responses with greater accuracy and nuanced context handling.

Note: This model is available in two variants: thinking and non-thinking. The output pricing varies significantly depending on whether the thinking capability is active. If you select the standard variant (without the ":thinking" suffix), the model will explicitly avoid generating thinking tokens.

To utilize the thinking capability and receive thinking tokens, you must choose the ":thinking" variant, which will then incur the higher thinking-output pricing.

Additionally, Gemini 2.5 Flash is configurable through the "max tokens for reasoning" parameter, as described in the documentation (https://redpill.ai/models/docs/use-cases/reasoning-tokens#max-tokens-for-reasoning).