OpenAI GPT-4 - AI Demos

OpenAI has made significant strides in the field of deep learning with the creation of GPT-4. This latest development is a large multimodal model that is capable of accepting both image and text inputs and producing text outputs. While it falls short of human capabilities in many real-world scenarios, GPT-4 has demonstrated human-level performance on various professional and academic benchmarks. For instance, it achieved a score in the top 10% of test takers on a simulated bar exam, whereas GPT-3.5 only managed a score in the bottom 10%.

OpenAI spent six months refining GPT-4, drawing from lessons learned through their adversarial testing program and ChatGPT. The outcome of this effort was the best results to date in terms of factuality, steerability, and adherence to established boundaries, although there is still room for improvement.

Over the last two years, OpenAI has entirely rebuilt their deep learning stack and partnered with Azure to design a supercomputer optimized for its workload. They used GPT-3.5 as a “test run” of the system and worked out the kinks while enhancing their theoretical foundations. Consequently, GPT-4’s training performance was exceptionally stable, and they were able to accurately predict its performance in advance, which is critical for ensuring safety as they continue to focus on scaling reliably.

OpenAI is releasing GPT-4’s text input capability through ChatGPT and the API, although there is a waitlist for access. They are also collaborating with a single partner to expand the image input capability and will eventually make it more widely available. OpenAI is also open-sourcing OpenAI Evals, which is their framework for automated evaluation of AI model performance, to encourage feedback from anyone to help guide further improvements.

Enhanced Capabilities of GPT-4 over GPT-3.5

In a casual conversation, it can be difficult to distinguish between the capabilities of GPT-3.5 and GPT-4. However, the difference becomes more evident when the task complexity reaches a certain threshold. GPT-4 surpasses GPT-3.5 in terms of reliability, creativity, and ability to handle more nuanced instructions.

To comprehend the contrast between these two language models, they conducted various tests using benchmarks, including exams that were originally designed for humans. They utilized the latest publicly-available tests, such as the Olympiads and AP free response questions, or bought the 2022-2023 editions of practice exams.

Incorporating Visual Inputs in GPT-4

Unlike its predecessor, GPT-3.5, GPT-4 is capable of accepting prompts that include both text and images. This means that users can now specify any language or vision-based task, and the model will generate text outputs (such as natural language and code) based on these inputs.

GPT-4 has been tested across various domains, including documents containing text and photographs, diagrams, or screenshots. Impressively, the model exhibits similar capabilities on text-only and mixed text-image inputs. Furthermore, it can be enhanced with test-time techniques that were originally developed for text-only language models, such as few-shot and chain-of-thought prompting.

It is important to note, however, that incorporating image inputs into GPT-4 is still in the research phase and is not publicly available at this time.

Predictable Scaling

The GPT-4 project focuses on building a deep learning stack that scales predictably. Extensive model-specific tuning is not feasible for large training runs like GPT-4. Infrastructure and optimization were developed with predictable behavior across multiple scales. To verify scalability, GPT-4’s final loss on our internal codebase was accurately predicted in advance by extrapolating from models trained using the same methodology but using 10,000x less compute.

The methodology is being developed to predict more interpretable metrics now that the metric optimized during training (loss) can be accurately predicted. For example, the pass rate on a subset of the HumanEval dataset was successfully predicted by extrapolating from models with 1,000x less compute.

However, predicting some capabilities, such as the Inverse Scaling Prize, is still challenging. The Inverse Scaling Prize was a competition to find a metric that gets worse as model computing increases. Hindsight neglect was one of the winners. GPT-4 reverses the trend.

OpenAI Evals Framework

OpenAI has open-sourced its software framework called “OpenAI Evals“, which is designed for creating and running benchmarks to evaluate models like GPT-4 while inspecting their performance sample by sample. This framework is used by OpenAI to guide the development of their models and prevent regressions. Users can apply it to track performance across model versions and evolving product integrations. For instance, Stripe has utilized Evals to measure the accuracy of their GPT-powered documentation tool.

Since the code is open-source, Evals supports writing new classes to implement custom evaluation logic. Additionally, the framework includes templates that have been useful internally, such as a template for “model-graded evals”. GPT-4 is surprisingly capable of checking its own work, and OpenAI has found this template to be valuable. Building a new evaluation is most effective by instantiating one of these templates and providing data.

OpenAI hopes that Evals becomes a vehicle for sharing and crowd-sourcing benchmarks, representing a wide set of failure modes and difficult tasks. As an example, OpenAI has created a logic puzzles evaluation containing ten prompts where GPT-4 fails. Evals is also compatible with implementing existing benchmarks, and OpenAI has included several notebooks implementing academic benchmarks and a few variations of integrating small subsets of CoQA as an example.

ChatGPT Plus

OpenAI will offer access to GPT-4 through its ChatGPT Plus subscription service on chat.openai.com. However, there will be a usage cap on the amount of GPT-4 queries that can be made, and the exact cap will be adjusted based on demand and system performance. OpenAI expects to have limited capacity initially but plans to scale up and optimize over the coming months.

Depending on traffic patterns, OpenAI may introduce a new subscription level for higher-volume GPT-4 usage, and may also offer some free GPT-4 queries for those without a subscription to try it out.

API

To gain access to the GPT-4 API, developers can sign up for the waitlist. OpenAI will gradually invite developers from the waitlist while balancing capacity and demand. Researchers studying the societal impact of AI or AI alignment issues can also apply for subsidized access through the Researcher Access Program.

Once granted access, developers can make text-only requests to the GPT-4 model. Image inputs are still in limited alpha. OpenAI will automatically update the model to the recommended stable version as they release new versions over time. Developers can also pin the current version by calling gpt-4-0314, which will be supported until June 14.

The pricing for GPT-4 API usage is $0.03 per 1k prompt tokens and $0.06 per 1k completion tokens. The default rate limits are 40k tokens per minute and 200 requests per minute. The context length for GPT-4 is 8,192 tokens.

OpenAI is also providing limited access to the 32,768-context version of GPT-4, called gpt-4-32k. The current version is gpt-4-32k-0314, and it will also be updated automatically over time. The pricing for gpt-4-32k is $0.06 per 1K prompt tokens and $0.12 per 1k completion tokens. OpenAI is still improving the model quality for long context and welcomes feedback on its performance.

OpenAI is processing requests for the 8K and 32K engines at different rates based on capacity, so developers may receive access to them at different times.

In summary, GPT-4 holds great promise, and we are eager to see how it will transform the landscape of artificial intelligence.

OpenAI GPT-4