full-text search

MySQL and PostgreSQL for Advanced Full-Text Search.

This article introduces the advantages of full-text search, focusing on MySQL and PostgreSQL. It details the features, indexing, and search approaches. How to handle PostgreSQL BTree index row size errors.

Dmitry Narizhnykh

Feb 8, 2024 • 6 min read

This article explores various search methods within databases, going beyond conventional approaches. As data expands, traditional methods encounter challenges, prompting an exploration of the advantages of full-text search. Concentrating on MySQL and PostgreSQL, we'll delve into the techniques employed by both database systems, aiding in selecting the most suitable one for particular projects.

`LIKE` condition

Searching for data in a relational database management system (RDBMS) often begins with the most straightforward approach: using a `LIKE` condition. The `LIKE` condition operates on a string, allowing the specification of what to search with optional percentage signs acting as wildcards. For instance:

SELECT *
FROM articles
WHERE title LIKE '%tech%';

This query fetches records where the titles contain "tech". However, as the application grows and accumulates substantial data (ranging from hundreds of thousands to millions of rows), relying solely on LIKE conditions can result in slow queries.

Full-Text Search Capabilities and Advantages over `LIKE` Condition

Full-text search is a powerful feature offering distinct capabilities that make it superior to a regular LIKE condition.

Precision and Recall: Full-text search balances precision and recall better than the LIKE operator. While LIKE offers 100% precision, it may need more recall, meaning it might miss relevant results. In contrast, full-text search allows for more flexibility, offering a trade-off between precision and recall based on configured parameters.
Speed and Performance: MySQL-specific searching with the MATCH() and AGAINST() clauses in full-text search is significantly faster than using LIKE, especially as the dataset size increases. The performance of LIKE degrades as data grows, making full-text search a more efficient solution, especially for large datasets.
Flexibility in Querying: Full-text search allows for more advanced query capabilities. It supports features like phrase, proximity, and linguistic search, enabling more nuanced and context-aware queries than the basic pattern matching of LIKE.
Indexing for Improved Performance: Full-text search often utilizes specialized indexes, like inverted indexes, to optimize search speed. This indexing strategy enhances performance by efficiently narrowing the search space, making it well-suited for scenarios where speed is crucial.

Introduction to MySQL Full-Text Index

A MySQL Full-Text Index is a specialized index designed to enhance the efficiency of text searches within a database. Here are key aspects of MySQL Full-Text Index:

The full-text search index in MySQL is denoted by the type FULLTEXT.
MySQL, starting from version 5.6, supports full-text table indexes for both InnoDB and MyISAM storage engines.
Full-text indexes can be applied to columns with data types such as CHAR, VARCHAR, and TEXT, allowing for comprehensive text search capabilities.

Index Creation:

MySQL offers flexibility in creating full-text indexes. You can include the FULLTEXT index in the CREATE TABLE statement during the table's creation.

CREATE TABLE articles (
	id INT AUTO_INCREMENT PRIMARY KEY,
	title VARCHAR(255) NOT NULL,
	body TEXT,
	FULLTEXT KEY (title, body)
);

create table in MySQL

Alternatively, you can add a FULLTEXT index to an existing table using ALTER TABLE or CREATE FULLTEXT INDEX statements.

Full-Text Search:

MySQL offers dedicated full-text search functions like MATCH() and AGAINST() to search for specific terms within columns covered by the full-text indexes.
Example of a full-text search to find articles containing the word 'database':

SELECT id, title, body
FROM articles
WHERE
MATCH (title, body) AGAINST ('database' IN NATURAL LANGUAGE MODE);

PostgreSQL Full-Text Search Approaches:

In PostgreSQL, there are two primary approaches to implementing full-text search. First, let's create a similar table 'article' as we did previously for the MySQL table.

CREATE TABLE articles (
	id SERIAL PRIMARY KEY,
	title VARCHAR(255) NOT NULL,
	body TEXT
);

create table in PostgreSQL

Indexing tsvector columns for Postgres full-text search.

According to the Postgres documentation, the tsvector GIN index (generalized inverted index) is designed around lexemes:

"GIN indexes are the preferred text search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows lacking additional words."

GIN indexes are the optimal initial choice for implementing Postgres Full Text Search. An illustration of setting up a full-text search index is as follows:

-- Adding Full-Text Index
CREATE INDEX idx_articles_document
ON articles
USING gin(to_tsvector('english', title || ' ' || body));

Create full-text search index in postgres

When searching for titles containing database in PostgreSQL, you can use the following query:

SELECT * FROM articles
WHERE to_tsvector('english', title) @@ to_tsquery('english', 'database');

This query leverages the PostgreSQL full-text search capabilities. It uses the to_tsvector function to convert the title column into a tsvector, then checks if it matches the tsquery generated from the search term 'database' using the to_tsquery function.

Trigram Approach for PostgreSQL Full Text Search

Trigram Indexes in PostgreSQL offer an added advantage. These indexes operate by dissecting text into trigrams, sequences of three letters.

To implement the Trigram indexes, you can follow these steps:

Enable the pg_trgm extension if not already enabled:

CREATE EXTENSION IF NOT EXISTS pg_trgm;

This extension introduces functions, operators, and trigram index support (specifically using GIN or GiST indexes). The following query demonstrates how to discover the trigrams generated by PostgreSQL:

SELECT show_trgm('hello');

The result of this query displays the trigrams for the word 'hello':

show_trgm
---------------------------------
{"  h"," he",ell,hel,llo,"lo "}

2. Create a GIN index using the Trigram operator class on the desired column of the "articles" table:

CREATE INDEX IDX_articles_data_index
ON articles
USING GIN (title gin_trgm_ops, body gin_trgm_ops);

Postgres will split the row values into trigrams, allowing indexed searches:

SELECT * FROM articles WHERE title LIKE '%base';

This approach improves text similarity searches, enabling more flexible and effective pattern matching within the specified column.

The effectiveness of this method varies with the specific dataset. However, when it proves effective, it can significantly accelerate searches on arbitrary text data.

tsvector Indexes vs. Trigram Indexes

When deciding between tsvector indexes and trigram indexes in PostgreSQL, it's essential to understand their distinctions.

Trigram indexes are great for implementing text search, especially for substring searches and matching similar patterns. They work by breaking up words into consecutive triplets of characters, making them ideal for "fuzzy" matching. However, they don't understand text semantics and can be trickier to rank large documents.

On the other hand, tsvector indexes are designed for full-text search operations, making them ideal for complex linguistic search scenarios and natural language queries.

	tsvector Indexes (Full Text Search)	Trigram Indexes
Functionality	Designed for full-text search operations	Works by breaking text into trigrams (three consecutive characters)
Usage	Ideal for complex linguistic search scenarios, supporting natural language queries	Effective for substring searches and matching similar patterns
Index Type	GIN (Generalized Inverted Index) or GiST (Generalized Search Tree) indexes are common	GIN or GiST indexes with the gin_trgm_ops extension

Conclusion.

Effective database search methods extend beyond traditional LIKE conditions, especially as datasets expand. MySQL's Full-Text Index proves advantageous with its superior speed, performance, and flexibility in querying, making it a robust choice for efficient text searches across various data types.

PostgreSQL offers two primary full-text search approaches — tsvector indexes and trigram indexes — each catering to specific needs. Tsvector indexes excel in complex linguistic searches, while trigram indexes optimize substring searches and fuzzy matching.

Choosing between these methods depends on project requirements and data characteristics. Embracing advanced search techniques becomes crucial for maintaining efficient and responsive database operations as data evolves. Whether opting for the precision of full-text search or the substring matching capability of trigram indexes, informed decisions empower developers to meet the unique needs of their projects.

DBConvert Streams preview | DBConvert Streams preview

database conversion and data streaming distributed platform to transfer data between heterogeneous databases.