Text Vectorization

Vuk Dukic
Founder, Senior Software Engineer
October 9, 2023

Introduction

Machine Learning (ML) models are used for making predictions. Predictions could be about the weather, whether a user clicks on an ad/movie/song, the answer to a question etc. In order to make a prediction the model needs to be provided some input data that contains information that can be used to make a prediction.

The way input data is presented to a model is quite critical and can determine how easy it is for a model to extract information from it. LLMs are no different, today we'll dive into how we need to present input data to them.

Blog image

Text Vectorization: Converting text to numbers

On receiving input, ML models perform a series of operations like multiplications to provide a numerical output that is translated into a prediction. In the world of LLMs the model is provided with a prompt made up of text, however running the mathematical operations associated with the internal workings of an LLM requires converting the text to numerical values.

The conversion of text into numerical values is called text vectorization. A vector is a sequence of numbers and is analogous to an array of numbers in the context of programming. When dealing with ML libraries it’s common for arrays of numbers to be converted to vector objects since they make mathematical operations run more efficiently. For example in numpy to make an array a vector you'd do something like:

import numpy as np

# Array of numbers
a = [1,2,3]

# Convert array to a vector
vector_a = np.asarray(a)

Tokenization

Tokenization is the process of breaking a piece of text into its units called tokens. Tokens can be alphabets, words, or groups of alphabets that make up words often called subwords based on the methodology. A tokenizer is an algorithm that's responsible for tokenizing text.

The simplest tokenizer that one can imagine (for English) is to split a document at every space character or punctuation.

import re

def tokenize_document(document):
    # Define a regular expression pattern to 
    # match spaces and punctuation as separate tokens
    pattern = r"[\s.,;!?()]+|[.,;!?()]"

    # Use re.split to tokenize the document
    tokens = re.split(pattern, document)

    # Remove empty tokens
    tokens = [token for token in tokens if token]

    return tokens

text = 'sample sentence. It contains punctuation!'

tokenize_document(text)

>>> ['sample', 'sentence', 'It', 'contains', 'punctuation'] 

Building a vocabulary

A vocabulary is the set of all tokens that an ML model would be able to recognize. The English language has around 170k words. Imagine how huge this number would be for a multilingual use case.

We need to put a cap on the size of our vocabulary to ensure computational efficiency. The size of the vocabulary is often capped by counting the frequency of tokens in a huge corpus of text and choosing the top-k tokens. Where k would correspond to the vocabulary size.

from collections import Counter

def build_top_k_vocab(corpus, k):
    # Initialize a Counter to count token frequencies
    token_counter = Counter()

    # Tokenize and count tokens in each document
    for document in corpus:
        tokens = tokenize_document(document)
        token_counter.update(tokens)

    # Get the top 10 tokens by frequency
    top_k_tokens = [token for token, _ in 
                   token_counter.most_common(k)]

    return set(top_k_tokens)

# Example usage:
corpus = [
    "This is a sample sentence with some words.",
    "Another sample sentence with some repeating words.",
    "And yet another sentence to build the vocabulary.",
]

build_top_k_vocab(corpus, k=5)
>>> {'sample', 'sentence', 'some', 'with', 'words'}

Converting Tokens to Numerical Values

Now that we have a vocabulary we assign an id to each token in our vocabulary. This can be done through simple enumeration and maintaining a map/dictionary between the ids and corresponding token. The id map will help us keep track of tokens that are present in a document.

The identified set of tokens now have to be translated to features that'll help an ML model extract information. While a token can be represented as a scalar or vector a document is always going to be represented as a vector of the token’s representation that it is made up of.

Some of the common ways to encode a document into features are:

  1. Binary Document-Term Vector: Each document is represented as a vector whose size is equal to the size of the vocabulary. The id of each token in the vocabulary corresponds to the index position in the vector. A value of 1 is assigned to the index position in the vector, when the corresponding token is present in the document and 0 if it’s absent.

  2. Bag of Words (BoW): Similar to approach 1, but the vector’s indices map to the frequency of the token in the document.

  3. N-gram vectors: This approach extends the above approach by allowing us to have index positions corresponding to bi-grams, tri-grams etc.

  4. Tf-idf: Similar to BoW, but instead of just the frequency a token is assigned a value based on its frequency in a document and how many unique documents in the corpus the token occurs in. The intuition is that tokens that are rare in general but are present a large number of times in a specific document are important, but the tokens that are plentiful across all documents (like the words "and, an, the, it etc.") are not.

  5. Embeddings: This approach is used in most deep neural networks such as LLMs. Each id is mapped to a unique n-dimensional vector called an embedding. The advantage of this approach is that rather than having a single hand-crafted feature such as the ones mentioned above for each token, an LLM can learn a better high dimensional feature via back propagation. An embedding is supposed to be able to capture the meaning or context around which a token occurs.

Share this article:
View all articles

Related Articles

How AI Automation Finds Upsell and Cross-Sell Opportunities in Your CRM featured image
January 15, 2026
Most CRMs contain far more revenue potential than teams are able to unlock manually. Usage data, support history, renewal timing, and engagement signals all point toward upsell and cross-sell opportunities, but identifying those patterns consistently is nearly impossible at scale without automation. AI changes that by continuously analyzing CRM and connected system data to surface actionable revenue insights. Instead of relying on intuition or sporadic reports, AI models identify patterns that historically lead to successful expansions and apply them across the entire customer base. These AI recommendations help sales, customer success, and marketing teams align around the right accounts at the right time with offers that feel relevant rather than pushy. Over time, the system learns from outcomes and improves its accuracy, turning the CRM into a proactive revenue engine rather than a passive database.
When You Need More Than Zapier: Custom AI Solutions for Complex Integrations featured image
January 14, 2026
No-code integration tools like Zapier work well for simple automations, but they quickly reach their limits as businesses grow. When workflows require complex logic, multiple systems, advanced error handling, and data enrichment, generic tools become fragile and difficult to maintain. This is where custom AI integrations become essential. Custom integration layers powered by AI allow businesses to orchestrate APIs intelligently, apply business rules dynamically, and reason over data instead of simply passing it between systems. By centralizing automation logic, companies avoid the spaghetti mess of point-to-point connections and gain better visibility, reliability, and control. AI adds an additional layer of intelligence by classifying events, detecting anomalies, and choosing the correct workflow paths. For organizations where data accuracy and operational reliability directly impact revenue, moving beyond Zapier is not an upgrade. It is a requirement for sustainable growth.
Meet Your AI Salesforce Admin: Automating Everyday Configuration Tasks featured image
January 13, 2026
Salesforce administrators spend a large portion of their time handling repetitive configuration requests that slow down the entire organization. From creating fields and updating page layouts to fixing broken automations and adjusting validation rules, these small tasks pile up quickly and reduce overall productivity. An AI Salesforce Admin changes how this work gets done by automating everyday configuration tasks safely and consistently. Instead of submitting tickets and waiting days for updates, teams can describe their needs in natural language while the AI agent interprets the request, applies governance rules, and executes or prepares changes for approval. With built-in guardrails, audit trails, and permission controls, automation does not mean loss of control. It means faster changes, cleaner data, and more time for human admins to focus on architecture, scalability, and long-term CRM strategy. The result is a Salesforce environment that keeps pace with business growth rather than holding it back.

Unlock the Full Power of AI-Driven Transformation

Schedule Demo

See how Anablock can automate and scale your business with AI.

Book Demo

Start a Support Agent

Talk directly with our AI experts and get real-time guidance.

Call Now

Send us a Message

Summarize this page content with AI