GPT-4 and GPT-3.5 Tokenizer (More Accurate than OpenAI's)

Tokens

Characters

124

Words

Welcome

to

our

 free

 online

 tokenizer

 tool

for

 modern

PT

 models

!<newline><newline>

Replace

 this

 with

 your

 text

to

see

how

 token

ization

 works

How it Works

Our tokenizer uses the gpt-tokenizer package which utilizes the modern cl100k_base encoder. This is the encoder that the GPT-4 and GPT-3.5 models use.

Many other online tokenization tools, including OpenAI's official tokenizer, use the p50k_base encoder which is only accurate for the legacy models such as GPT-2 and GPT-3.

What is a Tokenizer?

When engaging with language models like GPT-3.5 and GPT-4, it is important to understand the concept of a tokenizer. Simply put, a tokenizer transforms text into smaller chunks called tokens. The language model then operates based on those tokens when understanding the input and processes output one token at a time.

The number of tokens processed directly impacts the cost of using the model. Also, all models have a maximum token limit (both for input and output), so it is important to keep in mind how many tokens you are sending to the model. Sending too many tokens will result in an error or the output being truncated.

Impact of Language on Tokenization

Text written in English will almost always result in less tokens than the equivalent text in non-English languages.

This is because tokenization varies significantly across languages. English and many Western languages, using the Latin alphabet, typically tokenize around words and punctuation.

In contrast, logographic systems like Chinese often treat each character as a distinct token, leading to higher token counts. Similarly, agglutinative languages like Turkish might produce long words that increase token counts.

The difference can be significant. For example, "hello" is only a single token while the equivalent word in Thai, "สวัสดี" is 6 tokens!

Hopefully there will be improvements to tokenization in the future to help reduce the costs associated with non-English languages.

Free GPT-4 and GPT-3.5 Tokenizer API

If you are a developer you can use our API to easily get the token count of any text.

Simply make a POST request to https://koala.sh/api/tokens/ with a JSON body containing the text to tokenize eg {"text": "My text to tokenize"}

Here is an example using Javascript:

fetch("https://koala.sh/api/tokens/", {
  method: "POST",
  body: JSON.stringify({
    text: "My text to tokenize"
  }),
});

Response:

{
    "tokens": 4
}

How to Tokenize in JavaScript for GPT-4 or GPT-3.5

In JavaScript the easiest way to tokenize for GPT-4 or GPT-3.5 models is by using the js-tiktoken library. This works great in all major JS environments including NodeJS, browser, and edge (including Vercel and Cloudflare Workers).

First, install the package:

npm install js-tiktoken

Next, you can import the package and tokenize a string:

import { encodingForModel } from "js-tiktoken";

const encoder = await encodingForModel("gpt-4");
const tokens = encoder.encode("My text to tokenize");

How to Tokenize in Python for GPT-4 or GPT-3.5

You can use OpenAI's official Python library, tiktoken.