Back to Tools

GPT-4 and GPT-3.5 Tokenizer (More Accurate than OpenAI's)

Our free online tokenizer for GPT-4 and GPT-3.5 will give you the token count of any text you enter. We built this because OpenAI's tokenizer tool is for legacy models and is not accurate for GPT-4 or GPT-3.5.

Tokens
25
Characters
124
Words
21
Welcome
 to
 our
 free
 online
 tokenizer
 tool
 for
 modern
 G
PT
 models
!<newline><newline>
Replace
 this
 with
 your
 text
 to
 see
 how
 token
ization
 works
.

How it Works

Our tokenizer uses the gpt-tokenizer package which utilizes the modern cl100k_base encoder. This is the encoder that the GPT-4 and GPT-3.5 models use.

Many other online tokenization tools, including OpenAI's official tokenizer, use the p50k_base encoder which is only accurate for the legacy models such as GPT-2 and GPT-3.

What is a Tokenizer?

When engaging with language models like GPT-3.5 and GPT-4, it is important to understand the concept of a tokenizer. Simply put, a tokenizer transforms text into smaller chunks called tokens. The language model then operates based on those tokens when understanding the input and processes output one token at a time.

The number of tokens processed directly impacts the cost of using the model. Also, all models have a maximum token limit (both for input and output), so it is important to keep in mind how many tokens you are sending to the model. Sending too many tokens will result in an error or the output being truncated.

Impact of Language on Tokenization

Text written in English will almost always result in less tokens than the equivalent text in non-English languages.

This is because tokenization varies significantly across languages. English and many Western languages, using the Latin alphabet, typically tokenize around words and punctuation.

In contrast, logographic systems like Chinese often treat each character as a distinct token, leading to higher token counts. Similarly, agglutinative languages like Turkish might produce long words that increase token counts.

The difference can be significant. For example, "hello" is only a single token while the equivalent word in Thai, "สวัสดี" is 6 tokens!

Hopefully there will be improvements to tokenization in the future to help reduce the costs associated with non-English languages.

Free GPT-4 and GPT-3.5 Tokenizer API

If you are a developer you can use our API to easily get the token count of any text.

Simply make a POST request to https://koala.sh/api/tokens/ with a JSON body containing the text to tokenize eg {"text": "My text to tokenize"}

Here is an example using Javascript:

fetch("https://koala.sh/api/tokens/", {
  method: "POST",
  body: JSON.stringify({
    text: "My text to tokenize"
  }),
});

Response:

{
    "tokens": 4
}

How to Tokenize in JavaScript for GPT-4 or GPT-3.5

In JavaScript the easiest way to tokenize for GPT-4 or GPT-3.5 models is by using the js-tiktoken library. This works great in all major JS environments including NodeJS, browser, and edge (including Vercel and Cloudflare Workers).

First, install the package:

npm install js-tiktoken

Next, you can import the package and tokenize a string:

import { encodingForModel } from "js-tiktoken";

const encoder = await encodingForModel("gpt-4");
const tokens = encoder.encode("My text to tokenize");

How to Tokenize in Python for GPT-4 or GPT-3.5

You can use OpenAI's official Python library, tiktoken.

Try KoalaChat

Need more advanced AI capabilities? KoalaChat offers GPT-4o, real-time data, and more - all with a generous free tier.

Try KoalaChat

Try KoalaWriter

Need help with longer content? KoalaWriter helps you create blog posts, articles, and more with AI assistance.

Try KoalaWriter