Changyu Lee

What is Vector Similarity Search? : 1. Introduction

Published at
2024/02/26
Last edited time
2024/02/26 00:18
Created
2024/02/18 19:01
Section
Prompt Enginnering
Status
Done
Series
Tags
Theory
AI summary
Keywords
Vector Similiarty Search
LLM
Language

Vector Similarity Search

Vector Similarity Search refers to the process of finding vectors (arrays of numbers) that are closest to a given query vector in a vector space.
This process is fundamental in various applications such as recommendation systems, search engines, image retrieval, and natural language processing tasks. The idea is to represent data (like text, images, or user preferences) as vectors in a high-dimensional space where the similarity between these data points can be calculated based on their vector representations.
The similarity between vectors is often measured using metrics such as Euclidean distance, cosine similarity, or Manhattan distance, among others. The choice of metric depends on the specific application and the nature of the data.
Euclidean Distance measures the straight-line distance between two points in Euclidean space, which is suitable for numerical data.
Cosine Similarity measures the cosine of the angle between two vectors, which is particularly useful for text data represented in vector space models (like TF-IDF or word embeddings) because it captures the orientation of the vectors and not the magnitude, making it sensitive to the similarity in direction of the vectors.
Manhattan Distance (or L1 distance) measures the sum of the absolute differences of their coordinates, which can be useful for certain types of data distributions or applications where the grid-like distance matters.
Vector Similarity Search can be computationally expensive, especially as the size of the dataset grows. To address this, various indexing and search optimization techniques such as k-d trees, locality-sensitive hashing (LSH), and approximate nearest neighbor (ANN) algorithms are employed to speed up the search process without significantly compromising accuracy.
These techniques enable efficient retrieval of high-dimensional data by approximating the nearest neighbors to a query vector, making Vector Similarity Search a critical component in handling and making sense of large datasets in machine learning and data science.

Prob. to Solve

2.
Ineffective kyeword-based search
3.
Scalability
4.
Searching Unstructured or Semi-Structured Data

1. Curse of Dimensionality

The curse of dimensionality is a phenomenon that occurs in high-dimensional spaces, where the number of dimensions is large. In the context of vector similarity search, the curse of dimensionality refers to the difficulty of accurately comparing vectors in high-dimensional spaces.
As the number of dimensions increases, the volume of the space grows exponentially, making it increasingly difficult to find similar vectors. This is because the distance between vectors in high-dimensional spaces is not well-defined, and the concept of "closeness" becomes less meaningful.
There are several reasons why the curse of dimensionality affects vector similarity search:
1.
Increased computational complexity: As the number of dimensions increases, the number of possible combinations of vectors also increases exponentially. This makes it computationally expensive to compare all vectors in the space.
2.
Reduced accuracy: In high-dimensional spaces, vectors can be very close to each other in terms of distance, but have very different characteristics. This makes it difficult to accurately identify similar vectors.
3.
Loss of sparsity: In many applications, vectors have only a few non-zero elements, which makes them sparse. However, in high-dimensional spaces, vectors tend to become less sparse, making it harder to identify patterns and similarities.
4.
Overfitting: In machine learning, high-dimensional spaces can lead to overfitting, where models are too complex and fit the training data too closely, but perform poorly on new data.
To mitigate the curse of dimensionality in vector similarity search, several techniques can be used, such as dimensionality reduction, feature selection, and similarity metrics that are specifically designed for high-dimensional spaces.

2. Ineffective Keyword-Based Search

Keyword-based search relies on exact matches or closely matching terms between the query and the document or item. This approach often fails to capture the true intent or semantic meaning behind a user's query, especially when synonyms, context, or the nuanced meaning of words come into play.
Vector similarity search addresses this by representing queries and documents as high-dimensional vectors in a semantic space, where the distance between vectors indicates their semantic similarity. This allows the search system to understand and retrieve items that are contextually related to the query, even if they do not share common keywords. It effectively captures synonyms, related terms, and the nuanced meaning of words, providing more relevant and accurate results.

3. Scalability

As the volume of data grows, traditional search engines that rely on keyword matching and inverted index structures can struggle to maintain performance without significant computational resources.
Vector similarity search scales more effectively with large datasets because it can leverage efficient indexing and search algorithms designed for high-dimensional spaces, such as approximate nearest neighbor (ANN) search. These algorithms, combined with modern hardware accelerations (e.g., GPUs), allow for rapid querying over massive datasets with minimal loss in accuracy, making vector search both scalable and efficient.

4. Searching Unstructured or Semi-Structured Data

Unstructured data (e.g., text, images, audio) and semi-structured data (e.g., JSON, XML documents) pose significant challenges for traditional search engines, as they often lack a clear schema or structure that can be easily indexed or queried using keywords.
Vector similarity search excels in handling unstructured and semi-structured data by converting these data types into vector representations. Natural Language Processing (NLP) models can be used to generate vectors for textual data, while other deep learning models can handle images, audio, and more. This conversion process allows the inherent information and relationships within the data to be captured semantically, enabling effective search capabilities across diverse data types.

Ref