Cache Usage in LLMs: LangChain Cache and OpenAI Prompt Caching

Introduction

If you're a developer using LLMs in your applications, you're likely aware that the prompts you send are essential for obtaining valuable and goal-oriented responses. But did you know there's a way to optimize your requests and improve the overall efficiency of your project? This is where the concept of Prompt Caching comes into play.

Prompt Caching is a technique that allows you to store the responses generated by the API for certain prompts and reuse them when needed. This can be a game changer in how you optimize your application, reducing costs and wait times, and ensuring a smoother experience for users.

What is Prompt Caching?

Imagine you have an application that constantly asks the same questions to the API, like an AI for answering frequently asked questions. Instead of making the same request over and over again (which consumes resources and can lead to unnecessary wait times), Prompt Caching allows you to save those responses for reuse.

In other words, when a repeated query is detected, the API can return a previously saved response instead of generating a new one. This not only optimizes processing time but also helps reduce costs if you're working with usage limits or a tight budget.

Why should you use it?

There are several reasons why Prompt Caching is a valuable tool:

Cost savings: By not having to generate a new response every time a repeated request is made, you reduce the number of tokens consumed, which translates into lower costs.
Improved speed: By retrieving a stored response, wait times are significantly reduced, as there's no need for the LLM to process the prompt from scratch.
Consistency: If your application relies on consistent responses, Prompt Caching ensures that users receive the same answers to the same questions, avoiding unintentional variations.

Implementing cache with LangChain

In-memory cache

from langchain_core.globals import set_llm_cache
from langchain_openai import ChatOpenAI
 
llm = ChatOpenAI(model="gpt-4o-mini")

%%time
from langchain_core.caches import InMemoryCache
 
set_llm_cache(InMemoryCache())
 
# The first time the cache is not used and will be slower
llm.invoke("Who created the concept of Transformer in AI?")

CPU times: total: 31.2 ms
Wall time: 2.39 s

%%time
llm.invoke("Who created the concept of Transformer in AI?")

CPU times: total: 0 ns
Wall time: 0 ns

SQLite Cache

# Now we will use SQLite cache
from langchain_community.cache import SQLiteCache
 
set_llm_cache(SQLiteCache(database_path=".langchain.db"))

%%time
# The first time the cache is not used and will be slower
llm.invoke("Who created the concept of Transformer in AI?")

CPU times: total: 15.6 ms
Wall time: 2.92 s

%%time
# The first time the cache is not used and will be slower
llm.invoke("Who created the concept of Transformer in AI?")

CPU times: total: 609 ms
Wall time: 609 ms

Using OpenAI Prompt Caching

The OpenAI API automatically implements Prompt Caching, reducing latency by 80% and costs by 50% for prompts exceeding 1024 tokens.

Prompt Caching is active for:

gpt-4o (excluding gpt-4o-2024-05-13 and chatgpt-4o-latest)
gpt-4o-mini
o1-preview
o1-mini

Structuring the prompt

To take advantage of this cache, OpenAI checks whether the prompt’s prefix matches an existing cache. This means that the unchanging part, such as instructions or the system prompt, should always remain identical, and the unique information for each request should be added at the end.

Prompt Structure from OpenAI Platform Guide

Here’s how you can use OpenAI’s prompt cache:

Ensure you're using a model that supports it
The prompt length must be greater than 1024 tokens
Verify that your requests include the cached_tokens value and that the cost is reduced

Nothing else is needed; with this, you’ll get faster LLM responses while saving costs.

Cache Usage in LLMs: LangChain Cache and OpenAI Prompt Caching

Introduction

What is Prompt Caching?

Why should you use it?

Implementing cache with LangChain

In-memory cache

SQLite Cache

Using OpenAI Prompt Caching

Structuring the prompt

Otros posts que podrían interesarte

Cómo usar Docker Model Runner: Ejecutando LLM en local con Docker y LangChain

Cómo funciona DeepSeek R1: Explicando todos sus componentes claves y sus consecuencias

TF-IDF y Bag of Words: Cómo la estadística marcó el comienzo del procesamiento del lenguaje natural (NLP)

¿Disfrutando del post?