Introduction
Postgresql has support for full text search. The basic idea is to create a column of type tsvector
and then you can run full text queries (represented as tsquery
strings) using the @@
operator. This is different from the LIKE
queries using the %string%
since this is language aware and can provide things like ranking etc. As an example,
SELECT 'I am satisfied with postgresql' LIKE '%satisfied%' as found;
will return
found
-------
t
However, if we use "%satisfy%"
as the query string, it will fail since it’s text based.
SELECT 'I am satisfied with postgresql' LIKE '%satisfy%' as found;
found
-------
f
If we use a full text query, we will get it right though.
SELECT to_tsvector('I am satisfied with postgresql') @@ to_tsquery('satisfy') as found;
found
-------
t
SQLAlchemy is the Python database toolkit of choice. It supports most of the popular databases out there and has dialect specific features for mysql, postgresql etc. The tsvector type is not natively supported. It’s not very hard to add support for this but I couldn’t find a single reference that helps me do it. After some trial and error, I managed to get this to work (atleast for my purposes) so I’m going to write out what I did. I’ll put links to the articles, posts and other materials on the net which helped me get this to work.
Feedback is welcome as are suggestions on how to get this fully feature complete. If sufficiently done, I’ll contribute this as a patch back to SQLAlchemy.
The table we want
First, I create a table like so
CREATE TABLE example (
name VARCHAR(10),
details TEXT
);
And then, I insert 1000 rows into this using
COPY example FROM '/home/noufal/projects/scratch/sa/items.csv' (FORMAT csv);
Now, I have a database to play with. I can do full text searches like so
SELECT * FROM example where to_tsvector('english', details) @@ to_tsquery('life') limit 3;
name | details
---------+-------------------------------------------------------------------------------------------------------------------------------------------------
item-20 | Life is a grand adventure -- or it is nothing. -- Helen Keller
item-46 | Life is a gamble at terrible odds; if it was a bet you wouldn't take it. -- Tom Stoppard; Rosencrantz and Guildenstern are Dead
item-63 | Life is like a 10 speed bicycle. Most of us have gears we never use. -- C. Schultz
The 'english'
in the to_tsvector
is optional. If I skip it, it’ll use the default.
Full text items that can be searched are referred to as documents
in postgresql. Now, I create an extra column that holds the document to be searched like so.
ALTER TABLE example ADD COLUMN details_tsvector TSVECTOR;
Then, I run an update on the table that creates the tsvector documents and puts them into this column like so
UPDATE example SET details_tsvector = to_tsvector(details);
Now, if I do a full query, it’ll return the actual tsvector documents.
SELECT name, details_tsvector FROM example limit 3;
name | details_tsvector
--------+---------------------------------------------------------
item-1 | 'bathroom':7 'left':2 'wallet':4
item-2 | 'bit':7 'could':11 'difficult':3 'effort':9 'imposs':13
item-3 | 'avoid':3 'hedg':4 'least':6 'think':11
I can search the table like so
SELECT name, details FROM example WHERE details_tsvector @@ to_tsquery('life') limit 3;
name | details
---------+-------------------------------------------------------------------------------------------------------------------------------------------------
item-20 | Life is a grand adventure -- or it is nothing. -- Helen Keller
item-46 | Life is a gamble at terrible odds; if it was a bet you wouldn't take it. -- Tom Stoppard; Rosencrantz and Guildenstern are Dead
item-63 | Life is like a 10 speed bicycle. Most of us have gears we never use. -- C. Schultz
This, of course, is faster than the other approach since the documents have already been generated. However, running an EXPLAIN shows us how the query is working.
EXPLAIN SELECT name, details FROM example WHERE details_tsvector @@ to_tsquery('life');
QUERY PLAN
----------------------------------------------------------
Seq Scan on example (cost=0.00..53.30 rows=26 width=92)
Filter: (details_tsvector @@ to_tsquery('life'::text))
One advantage of tsvector columns (over LIKE
queries) is that they can be indexed. If I add an index like so,
CREATE INDEX details_idx ON example USING gin(details_tsvector);
The output of the EXPLAIN command changes.
EXPLAIN SELECT name, details FROM example WHERE details_tsvector @@ to_tsquery('life');
QUERY PLAN
----------------------------------------------------------------------------
Bitmap Heap Scan on example (cost=12.20..50.17 rows=26 width=92)
Recheck Cond: (details_tsvector @@ to_tsquery('life'::text))
-> Bitmap Index Scan on details_idx (cost=0.00..12.20 rows=26 width=0)
Index Cond: (details_tsvector @@ to_tsquery('life'::text))
which will be quicker.
One problem with this is that the details_tsvector
column is not automatically updated when new rows are inserted. I can fix this by creating a trigger that will automatically compute and add the values when a row is UPDATEd or INSERTed into.
CREATE TRIGGER details_tsvector_update BEFORE INSERT OR UPDATE
ON example
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger('details_tsvector', 'pg_catalog.english', 'details');
Now, if I run an INSERT
statement, the details_tsvector
will get updated.
The table definition looks like this
Table "public.example"
Column | Type | Modifiers
------------------+-----------------------+-----------
name | character varying(10) |
details | text |
details_tsvector | tsvector |
Indexes:
"details_idx" gin (details_tsvector)
Triggers:
details_tsvector_update BEFORE INSERT OR UPDATE ON example FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('details_tsvector', 'pg_catalog.english', 'details')
This is what I need for my application but I need to do it using SQLAlchemy inside my application rather than in SQL. The rest of the article will cover this.
Scaffold
To play with the whole thing, I have a tiny program that takes command line arguments to run various database operations. It’s what I use to manually test the code. Here it is. There are a few imports which are not necessary at this point but which we’ll use later.
import subprocess
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, VARCHAR, create_engine, func, MetaData, Table, Index, event, DDL
from sqlalchemy.orm import sessionmaker
engine = create_engine('postgresql://noufal:abcdef@localhost/test', echo = True)
Base = declarative_base()
Session = sessionmaker(bind = engine)
session = Session()
class Example(Base):
__tablename__ = 'example'
name = Column(VARCHAR(10), primary_key = True)
details = Column(String)
def create_tables():
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
if __name__ == '__main__':
import sys
for i in sys.argv[1:]:
print "\n","=================",i,"==================="
dict(create = create_tables)[i]()
This should be familiar to you if you’ve used SQLAlchemy before. It simply defines a table in the new declarative format. One difference between this and our original setup is that the name
field is now a primary key and therefore has a uniqueness constraints and an auto increment. The file is called sample.py
If we run
python sample.py create
We’ll get the example
table. The schema is very plain now and doesn’t have the tsvector
type.
Creating the type
For this, we’ll need to derive from the sqlalchemy.types.UserDefinedType
. This is mostly based on the postgis example in the SQLAlchemy source tree. We simply create a new type derived from UserDefinedType and then give it a name. We have just one method inside it which is get_col_spec
.
class TsVector(UserDefinedType):
"Holds a TsVector column"
name = "TSVECTOR"
def get_col_spec(self):
return self.name
The get_col_spec
function is used by the expression compiler to decide what the name of the type will be in the DDL. Since it’s called TSVECTOR
, that’s what we should return here. The core types, as far as I know have their types coded directly into the compiler (e.g. For the SQLAlchemy provided Boolean
type translates to the BOOLEAN
type in the DDL). For Use defined types, the compiler will explicitly call the get_col_spec
function to get the type name. This is the bare minimum to create the table.
Creating the tables.
First, we add the above snippet to our code and then we add a column of type TsVector
to our Example
class and run the script again. This time, we’ll get the table with the tsvector
column. Our code looks like this now.
import subprocess
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, VARCHAR, create_engine, func, MetaData, Table, Index, event, DDL
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import UserDefinedType
engine = create_engine('postgresql://noufal:abcdef@localhost/test', echo = True)
Base = declarative_base()
Session = sessionmaker(bind = engine)
session = Session()
class TsVector(UserDefinedType):
"Holds a TsVector column"
name = "TSVECTOR"
def get_col_spec(self):
return self.name
class Example(Base):
__tablename__ = 'example'
name = Column(VARCHAR(10), primary_key = True)
details = Column(String)
details_tsvector = Column(TsVector)
def create_tables():
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
if __name__ == '__main__':
import sys
for i in sys.argv[1:]:
print "\n","=================",i,"==================="
dict(create = create_tables)[i]()
Add a few helper functions
I’m adding a new function called insert_data
like so
def insert_data():
for i in range(1, 20):
u = Example(name = "name-{}".format(i),
details = subprocess.Popen("/usr/games/fortune", stdout = subprocess.PIPE).stdout.read())
session.add(u)
print ".",
session.commit()
And another called dump_data
like so
def dump_data():
for i in session.query(Example):
print "name: ", i.name
print "details: \n------------\n",i.details
print "details_tsvector: \n--------\n",i.details_tsvector
print "=================================================="
and making them available from the command line using
dict(create = create_tables,
insert = insert_data,
dump = dump_data)[i]()
Now, I can run
python sample.py create insert
and I get a database with 20 rows in it. Running
python sample.py dump
will show me what’s in there. The details_tsvector
row is always empty.
Adding an index for this column
As it stands, the details_tsvector
row doesn’t have it’s own index. We can alter the table definition to add an index by adding
__table_args__ = (Index('details_tsvector_idx', 'details_tsvector', postgresql_using = 'gin'),)
to the Example
table.
__table_args__
allows you to add extra stuff like constraints and things to the table. It expects this to be a tuple or a dictionary so we wrap it in the (
)
(along with the notoriously ugly ,
at the end for single valued tuples). The Index('details_tsvector_idx', 'details_tsvector', postgresql_using = 'gin')
line creates a gin index called details_tsvector_idx
on the details_tsvector
column. The postgresql dialect module interprets the postgresl_using
parameter to create a gin index. Different indices have different tradeoffs and you should select one that works for you.
Automatically updating the column using a trigger.
I need to automatically update this column when a new row is inserted into the table. We can accomplish this inside postgresql itself using a trigger. This is done like so.
trigger_snippet = DDL("""
CREATE TRIGGER details_tsvector_update BEFORE INSERT OR UPDATE
ON example
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(details_tsvector,'pg_catalog.english', 'details')
""")
event.listen(Example.__table__, 'after_create', trigger_snippet.execute_if(dialect = 'postgresql'))
This automatically updates the rows that were modified by an insert or update operation. The tsvector_update_trigger
is a function provided by postgresql that takes 3 arguments - the name of the column that needs to be updated, the configuration to use for the conversion and then a list of columns that will be included in the document. It will take care of NULL
columns.
I think it’s also possible to do this using the sqlalchemy event system but I’m not sufficiently familiar with it.
Adding elements now will automatically populate the index column.
The code now, looks like this.
import subprocess
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, VARCHAR, create_engine, func, MetaData, Table, Index, event, DDL
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import UserDefinedType
engine = create_engine('postgresql://noufal:abcdef@localhost/test', echo = True)
Base = declarative_base()
Session = sessionmaker(bind = engine)
session = Session()
class TsVector(UserDefinedType):
"Holds a TsVector column"
name = "TSVECTOR"
def get_col_spec(self):
return self.name
class Example(Base):
__tablename__ = 'example'
name = Column(VARCHAR(10), primary_key = True)
details = Column(String)
details_tsvector = Column(TsVector)
__table_args__ = (Index('details_tsvector_idx', 'details_tsvector', postgresql_using = 'gin'),)
trigger_snippet = DDL("""
CREATE TRIGGER details_tsvector_update BEFORE INSERT OR UPDATE
ON example
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(details_tsvector,'pg_catalog.english', 'details')
""")
event.listen(Example.__table__, 'after_create', trigger_snippet.execute_if(dialect = 'postgresql'))
def create_tables():
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
def insert_data():
for i in range(1, 20):
u = Example(name = "name-{}".format(i),
details = subprocess.Popen("/usr/games/fortune", stdout = subprocess.PIPE).stdout.read())
session.add(u)
print ".",
session.commit()
def dump_data():
for i in session.query(Example):
print "name: ", i.name
print "details: \n------------\n",i.details
print "details_tsvector: \n--------\n",i.details_tsvector
print "=================================================="
if __name__ == '__main__':
import sys
for i in sys.argv[1:]:
print "\n","=================",i,"==================="
dict(create = create_tables,
insert = insert_data,
dump = dump_data)[i]()
Now, you can run python sample.py create
and get the whole thing done. The defined table looks like this.
Table "public.example"
Column | Type | Modifiers
------------------+-----------------------+-----------
name | character varying(10) | not null
details | character varying |
details_tsvector | tsvector |
Indexes:
"example_pkey" PRIMARY KEY, btree (name)
"details_tsvector_idx" gin (details_tsvector)
Triggers:
details_tsvector_update BEFORE INSERT OR UPDATE ON example FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('details_tsvector', 'pg_catalog.english', 'details')
So, we have what we need. The next thing is querying.
Querying
In SQL, it is possible now to run queries like so.
select * from example where details_tsvector @@ to_tsquery('life') limit 1;
which finds rows with the lexeme life
in our full text search column.
I’d like to do this in Python using my custom type. Something like this.
session.query(Example).filter(Example.details_tsvector == "life")
And this should translate into
SELECT * FROM example WHERE details_tsvector @@ to_tsquery('life');
This might not be optimal since the ==
is distinct from the @@
operator but it’ll illustrate the comparator factory and allow us to use a native Python comparison operator in our expressions.
We can accomplish this by adding a
class comparator_factory(UserDefinedType.Comparator):
"""Defines custom types for tsvectors.
Specifically, the ability to search for ts_query strings using
the @@ operator.
On the Python side, this is implemented simply as a `==` operation.
So, you can do
Table.tsvector_column == "string"
to get the same effect as
tsvector_column @@ to_tsquery('string')
in SQL
"""
def __eq__(self, other):
return self.op('@@')(func.to_tsquery(other))
to our TsVector
class. This basically defines a way for the __eq__
operator to be converted into the @@
operator in SQL world.
We also add a query_data
function to our harness so that we can try to run a query. It looks like this
def query_data():
vals = session.query(Example.name, Example.details).filter(Example.details_tsvector == "life")
for i in vals:
print "name: ", i.name
print "details: ", i.details
print "=================================================="
And update the commands dictionary with a query
key so that I can run python sample.py query
. The generated query looks like this
SELECT example.name AS example_name, example.details AS example_details
FROM example
WHERE example.details_tsvector @@ to_tsquery('life')
which is exactly what I want.
Full code
The entire program looks like this now
#!/usr/bin/env python
import subprocess
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, VARCHAR, create_engine, func, MetaData, Table, Index, event, DDL
from sqlalchemy.orm import sessionmaker
from sqlalchemy.types import UserDefinedType
engine = create_engine('postgresql://noufal:abcdef@localhost/test', echo = True)
Base = declarative_base()
Session = sessionmaker(bind = engine)
session = Session()
class TsVector(UserDefinedType):
"Holds a TsVector column"
name = "TSVECTOR"
def get_col_spec(self):
return self.name
class comparator_factory(UserDefinedType.Comparator):
"""Defines custom types for tsvectors.
Specifically, the ability to search for ts_query strings using
the @@ operator.
On the Python side, this is implemented simply as a `==` operation.
So, you can do
Table.tsvector_column == "string"
to get the same effect as
tsvector_column @@ to_tsquery('string')
in SQL
"""
def __eq__(self, other):
return self.op('@@')(func.to_tsquery(other))
class Example(Base):
__tablename__ = 'example'
name = Column(VARCHAR(10), primary_key = True)
details = Column(String)
details_tsvector = Column(TsVector)
__table_args__ = (Index('details_tsvector_idx', 'details_tsvector', postgresql_using = 'gin'),)
trigger_snippet = DDL("""
CREATE TRIGGER details_tsvector_update BEFORE INSERT OR UPDATE
ON example
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(details_tsvector,'pg_catalog.english', 'details')
""")
event.listen(Example.__table__, 'after_create', trigger_snippet.execute_if(dialect = 'postgresql'))
def create_tables():
Base.metadata.drop_all(engine)
Base.metadata.create_all(engine)
def insert_data():
for i in range(1, 20):
u = Example(name = "name-{}".format(i),
details = subprocess.Popen("/usr/games/fortune", stdout = subprocess.PIPE).stdout.read())
session.add(u)
print ".",
session.commit()
def dump_data():
for i in session.query(Example):
print "name: ", i.name
print "details: \n------------\n",i.details
print "details_tsvector: \n--------\n",i.details_tsvector
print "=================================================="
def query_data():
vals = session.query(Example.name, Example.details).filter(Example.details_tsvector == "divide")
for i in vals:
print "name: ", i.name
print "details: ", i.details
print "=================================================="
if __name__ == '__main__':
import sys
for i in sys.argv[1:]:
print "\n","=================",i,"==================="
dict(create = create_tables,
insert = insert_data,
dump = dump_data,
query = query_data)[i]()
Things to implement
Table reflection doesn’t work (I don’t need it now). I haven’t put this into production so I don’t know what problems I’ll face then.
Feedback is welcome.
Update (10 Dec 2013)
This was accepted as a patch into SQLAlchemy so you don’t need to do this anymore. The ==
overloading was not used. Instead, the Python .match
operator implements the @@
operator on the sql side. I’ll leave this article up here though as a tutorial on adding custom types to SQLAlchemy.