pattern.db
The pattern.db module contains wrappers for databases (SQLite, MySQL), Unicode CSV files and Python's datetime. It offers a convenient way to work with tabular data, for example retrieved with the pattern.web module.
It can be used by itself or with other pattern modules: web | db | en | search | vector | graph.
Documentation
Database
A database is a collection of tables. A table has rows of data with a specific data type (e.g., string, float) for each field or column. A database engine provides an interface to the database, using SQL statements (Structured Query Language). Python 2.5+ comes bundled with the SQLite engine. The MySQL engine requires the MySQL-Python bindings. Note that a 32-bit Python requires a 32-bit MySQL.
The Database() constructor creates (if necessary) and returns an SQLITE or MYSQL database. With SQLITE, it will create a file with the given name in the current folder.
db = Database( name, host = 'localhost', port = 3306, username = 'root', password = '', type = SQLITE )
db.type # SQLITE | MYSQL db.name # Database name. db.host # Database host (MySQL). db.port # Database port (MySQL). db.username # Database username (MySQL). db.password # Database password (MySQL). db.tables # Dictionary of (name, Table)-items. db.relations # List of relations, see Database.link(). db.query # Last executed SQL query. db.connected # True after Database.connect().
db.connect() # Happens automatically. db.disconnect()
db.create(table, fields=[]) db.remove(table) db.link(table1, field1, table2, field2, join=LEFT)
db.execute(SQL, commit=False) db.commit() db.escape(value) # "a cat's tail" => "'a cat\'s tail'"
- Database.execute() returns an iterator of rows for the given SQL query.
- Database.commit() commits the changes of pending INSERT, UPDATE, DELETE queries.
- Database.escape() safely quotes and escapes field values.
Create table
Database.create() creates a new table in the database, It takes a table name and a list of row fields, where each field is defined with the field() function. Each field has a name (a-z + underscores) and a type, with an optional default value for new rows. The pk() function can be used for primary keys.
field(name, type=STRING, default=None, index=False, optional=True)
pk(name='id') # field('id', INTEGER, index=PRIMARY, optional=False)
Type | Value | Example |
STRING | str, unicode (1-255 characters) | u'Schrödinger' |
INTEGER | int | 42 |
FLOAT | float | 3.14159 |
TEXT | str, unicode | open('file.txt').read() |
BLOB | str (binary, e.g., PDF, PNG) | db.binary(open('img.jpg', 'rb').read()) |
BOOLEAN | bool | True, False |
DATE | Date | date('1999-12-31 23:59:59') |
A STRING field can contain up to a 100 characters. The length (1-255) can be changed by calling STRING as a function, e.g., type=STRING(255). For longer strings, use TEXT. The default value for a DATE field is NOW.
With index=True, the field is indexed for faster search. The index can also be set to UNIQUE (no duplicates) or PRIMARY. A table must have a primary key field that uniquely identifies each row (i.e., an id). Integer primary keys are auto-numbered, there is no need to set the value manually in new rows.
With optional=True, the field is allowed to contain None.
>>> from pattern.db import Database, field, pk, STRING, BOOLEAN, DATE, NOW >>> >>> db = Database('my_stuff') >>> db.create('pets', fields=( >>> pk(), >>> field('name', STRING(80), index=True), >>> field('type', STRING(20)), >>> field('tail', BOOLEAN), >>> field('date_birth', DATE, default=None), >>> field('date_created', DATE, default=NOW) >>> ))
>>> db.pets.append(name=u'Schrödinger', type='cat', tail=True) >>> print db.pets.rows()[0] (1, u'Schrödinger', u'cat', True, None, Date('2013-12-11 10:09:08'))
Create table from XML
Database.create() can also take a Table.xml or Query.xml. It creates a new table and copies the row data in the given XML string. An optional name parameter can be used to rename the new table. In Query.xml, a field name may contain a period. It will be replaced with an underscore (e.g., pets.name → pets_name). Alternatively, an alias can be defined in the Query.aliases dictionary.
Table
A Table is a list of rows, with one or more fields (i.e., table columns) of a certain type (i.e., string or number). A new table can be created with Database.create(). A TableError is raised if a table with the given name exists. An existing table can be retrieved with Database.tables[name], Database[name] or Database.<name>.
table = Database.tables[name]
table.db # Parent Database. table.name # Table name (a-z + underscores). table.fields # List of field names (i.e., columns). table.schema # Dictionary of (field, Schema)-items. table.default # Dictionary of (field, value)-items for new rows. table.pk # Primary key field name.
table.count() # Total number of rows (len(table) also works). table.rows() # List of rows, each a tuple of fields.
table.record(row) # Dictionary of (field, value)-items for given row.
table.append(fields={}, commit=True) table.update(id, fields={}, commit=True) table.remove(id, commit=True)
table.filter(*args, **kwargs) table.search(*args, **kwargs)
table.xml # XML string with the table schema and rows. table.datasheet # Datasheet object (see below).
- Table.rows() returns a list of all rows. To iterate rows memory-efficiently, use iter(Table).
- Table.append(), update() and remove() modify the table contents.
With commit=False, changes are only committed after Database.commit() (= faster in batch). - Table.filter() returns a subset of rows with a subset of fields.
For example: table.filter('name', type='cat').
Table schema
The Table.schema dictionary contains field name → Schema items.
schema = Table.schema[fieldname]
schema.name # Field name. schema.type # STRING, INTEGER, FLOAT, TEXT, BLOB, BOOLEAN, DATE schema.length # STRING field length. schema.default # Default value. schema.index # PRIMARY | UNIQUE | True | False schema.optional # True or False.
>>> from pattern.db import Database >>> >>> db = Database('my_stuff') >>> >>> print db.pets.fields >>> print db.pets.schema['name'].type >>> print db.pets.schema['name'].length ['id', 'name', 'tail', 'date_birth', 'date_created'] STRING 80
Append row
Table.append() adds a new row with the given field values. It returns the row id, if the table has a primary key generated with pk(). Field values can be given as optional parameters, a dictionary or a tuple. Field values for a BLOB field must be wrapped in Database.binary().
>>> db.pets.append(name=u'Schrödinger', date_birth=date('2009-08-12'))
>>> db.pets.append({'name': u'Schrödinger', 'date_birth': date('2009-08-12')})
>>> db.pets.append((u'Schrödinger', 'cat', True, date('2009-08-12')) # in-order
Update row
Table.update() updates values in the row with the given primary key. A batch of rows can be updated using a filter, or a chain of filters with any() or all(). In the last example, all rows with type='cat' will have their tail field set to True.
>>> db.pets.update(1, type='cat') # set type='cat' in row with id=1.
>>> db.pets.update(1, {'type': 'cat'})
>>> db.pets.update(eq('type', 'cat'), tail=True)
Remove row
Table.remove() removes the row with the given primary key:
>>> db.pets.remove(1)
>>> db.pets.remove(ALL)
>>> db.pets.remove(all(eq('type', 'cat'), lt(year('date_birth'), 1990, '<')))
The last example removes all rows that have type='cat' AND year of birth before 1990.
Filter rows
Table.filter() returns a list of rows filtered by field value(s), where each row is a tuple of fields. The first parameter defines which fields to return. It can be a single field name, a list of field names or ALL. The following parameters are optional and define field constraints. They can also be given as a dictionary:
>>> db.pets.filter('name') # all rows, name
>>> db.pets.filter(('id', 'name')) # all rows, name + id
>>> db.pets.filter(ALL, type='cat') # type='cat', all fields
>>> db.pets.filter(ALL, type=('cat', 'dog')) # type='cat' OR type='dog'
>>> db.pets.filter(ALL, type='*at') # type='cat' OR 'hat' OR 'brat', ...
>>> db.pets.filter(ALL, type='cat', tail=True) # type='cat' AND tail=True
>>> db.pets.filter('id', {'type': 'cat', 'tail': True})
More complex queries can be constructed with a Query.
Query
Table.search() returns a new Query with options for filtering, sorting and ordering rows by field value(s). It can include fields from other, related tables.
query = Table.search( fields = ALL, filters = [], relations = [], sort = None, order = ASCENDING, group = None, function = FIRST, range = None )
query.table # Parent Table. query.fields # Field name, list of field names, or ALL. query.aliases # Dictionary of (field name, alias)-items. query.filters # List of filter() objects. query.relations # List of rel() objects. query.sort # Field name or list of field names. query.order # ASCENDING | DESCENDING query.group # Field name or list of field names. query.function # FIRST, LAST, COUNT, MIN, MAX, SUM, AVG, CONCATENATE query.range # (start, stop)-tuple, e.g. rows 11-20.
query.sql() # SQL string, can be used with Database.execute().
query.rows() # List of rows, each a tuple of fields.
query.record(row) # Dictionary of (field, value)-items for given row.
query.xml # XML string with the query schema and rows.
To iterate rows memory-efficiently, use iter(Query) instead of Query.rows().
Query filter
The filter() function creates a field-value constraint that matches certain rows in a table. A list of filters can be passed to the filters parameter of a Query.
filter(field, value, comparison='=')
Comparison | Description | Example | Alias |
= | equal to | filter('type', ('cat', 'dog'), '=') | eq() |
i= | equal to (case-insensitive) | filter('name', 'tig*', 'i=') | eqi() |
!= | not equal to | filter('name', '*y', '!=') | ne() |
> | greater than | filter('weight', 10, '>') | gt() |
< | less than | filter('weight', 10, '<') | lt() |
>= | greater than or equal to | filter(year('date'), 1999, '>=') | gte() |
<= | less than or equal to | filter(year('date'), 2002, '<=') | lte() |
: | between (inclusive) | filter(year('date'), (1999, 2002), ':') | rng() |
The field name of a DATE field can be passed to the year(), month(), day(), hour(), minute() or second() function.The short aliases of filter() have a preset comparison operator.
Query filter chain
Filters can be chained together. The all() function returns a list with AND logic. The any() function returns a list with OR logic. In the example below, the first query matches all cats named Taxi. The second and third query match any pet that is cat OR that is named Taxi.
all(filter1, filter2, ...) # Rows must match ALL of the filters.
any(filter1, filter2, ...) # Rows must match ANY of the filters.
>>> from pattern.db import Database, eq, all, any >>> >>> db = Database('my_stuff') >>> >>> db.pets.search(filters=all(eq('name', 'Taxi'), eq('type', 'cat'))) >>> db.pets.search(filters=any(eq('name', 'Taxi'), eq('type', 'cat'))) >>> db.pets.search(filters=any(name='Taxi', type='cat'))
Lists created with all() and any() can be nested to define complex search criteria. The example below matches all pets that are cats, and whose name starts with Fluff- OR ends with a -y:
>>> f = any(eq('name', 'Fluff*'), eq('name', '*y')) # OR >>> f = all(eq('type', 'cat'), f) # AND >>> >>> for row in db.pets.search(filters=f): >>> print row
The syntax can even be more concise:
>>> for row in db.pets.search(filters=all(name=('Fluff*', '*y'), type='cat')): >>> print row
Query relation
The rel() function defines a relation between two fields in different tables (usually id's).
rel(field1, field2, table, join=LEFT) # LEFT | INNER
The optional join parameter defines how rows are matched. LEFT takes all rows from the base table, with additional fields from the related table. For a row with no match between field1 and field2, these fields have value None. INNER takes the subset of rows that have a match between field1 and field2.
A well-known example is a database app that processes invoices. Say we have a products table and an orders table. Each order has a product id – instead of all product details. Each product id can occur in multiple orders. This approach is called database normalization. It avoids duplicate data. To generate an invoice, we can combine product details and order details using a query relation.
The following example demonstrates a simple products + customers + orders database app:
|
|
|
>>> from pattern.db import Database, field, pk, INTEGER as I >>> >>> db = Database('pizza_delivery') >>> >>> db.create( 'products', (pk(), field('name'), field('price', I))) >>> db.create('customers', (pk(), field('name'))) >>> db.create( 'orders', (pk(), field('product', I), field('customer', I)))
>>> db.products.append(name='pizza', price=15) >>> db.products.append(name='garlic bread', price=3)
>>> db.customers.append(name=u'Schrödinger') >>> db.customers.append(name=u'Hofstadter')
Hofstadter orders a pizza.
>>> db.orders.append(product=1, customer=2)
>>> from pattern.db import Database, rel >>> >>> db = Database('pizza_delivery') >>> >>> f = ('orders.id', 'customers.name', 'products.name', 'products.price') >>> q = db.orders.search(f, relations=( >>> rel('orders.customer', 'customers.id', 'customers'), >>> rel('orders.product', 'products.id', 'products')) >>> ) >>> for row in q: >>> print q.record(row) { 'orders.id' : 1, 'customers.name' : u'Hofstadter', 'products.name' : u'pizza', 'products.price' : 15 }
Grouping rows
A Query has an optional parameter group that can be used to merge rows on duplicate field values. The given function is applied to the other fields. It can also be a list with a function for each field.
Function | Field type | Description |
FIRST | any | The first row field in the group. |
LAST | any | The last row field in the group. |
COUNT | any | The number of rows in the group. |
MIN | INTEGER + FLOAT | The lowest field value in the group. |
MAX | INTEGER + FLOAT | The highest field value in the group. |
SUM | INTEGER + FLOAT | The sum of all field values in the group. |
AVG | INTEGER + FLOAT | The average of all field values in the group. |
STDEV | INTEGER + FLOAT | The standard deviation (= variation from average). |
CONCATENATE | STRING | Joins all field values with a comma. |
For example, to get the total revenue per ordered product:
>>> print db.orders.search( >>> fields = ('products.name', 'products.price'), >>> relations = rel('product', 'products.id', 'products'), >>> group = 'products.name', # Merge orders with same product name. >>> function = SUM # Sum of product prices. >>> ).rows()
Datasheet
A Datasheet is a matrix of rows and columns, where each row and column can be retrieved as a list. The data can be imported or exported as a CSV-file. Optionally, the given fields is a list of (name, type) headers, where type can be STRING, TEXT, INTEGER, FLOAT, BOOLEAN, BLOB or DATE.
datasheet = Datasheet(rows=[], fields=None)
datasheet = Datasheet.load(path, separator=',', decoder=lambda v: v, headers=False)
datasheet.rows # List of rows (each row = list of values). datasheet.columns # List of columns (each column = list of values). datasheet.fields # List of (name, type) column headers. datasheet.<field> # List of column values.
datasheet[i] # Row at index i. datasheet[i, j] # Value in row i at column j. datasheet[i1:i2, j] # Slice of column j from rows i1-i2. datasheet[i, j1:j2] # Slice of columns j1-j2 from row i. datasheet[i1:i2, j1:j2] # Datasheet with columns j1-j2 from rows i1-i2. datasheet[:] # Datasheet copy.
datasheet.insert(i, row, default=None) datasheet.append(row, default=None) datasheet.extend(rows, default=None) datasheet.copy(rows=ALL, columns=ALL)
datasheet.group(j, function=FIRST, key=lambda v: v)
datasheet.save(path, separator=',', encoder=lambda v: v, headers=False)
datasheet.json # JSON-formatted string.
- Datasheet.insert() and append() fill missing columns with the default value.
- Datasheet.columns.insert() and append() fill missing rows with the default value.
An optional field parameter can be used to supply a (name, type) column header. - Datasheet.copy() returns a new Datasheet from a selective list of row and/or column indices.
- To rotate a datasheet 90 degrees, use datasheet = flip(datasheet).
For example:
>>> from pattern.db import Datasheet >>> >>> ds = Datasheet() >>> ds.append((u'Schrödinger', 'cat')) >>> ds.append((u'Hofstadter', 'cat')) >>> ds.save('pets.csv') >>> >>> ds = Datasheet.load('pets.csv') >>> print ds [[u'Schrödinger', 'cat'], [ u'Hofstadter', 'cat']]
Grouping rows
Datasheet.group(j) returns a new Datasheet with unique values in column j. It merges rows using a given function that takes a list of column values and returns a single value. Predefined functions are FIRST, LAST, COUNT, MIN, MAX, SUM, AVG, STDEV and CONCATENATE. It can also be a list of functions.
The optional key can be used to compare the values in column j. For example, lambda date: date.year groups a column of Date objects by year.
>>> from pattern.db import Datasheet, pprint >>> >>> ds = Datasheet(rows=[ >>> (1, u'Schrödinger', 'cat'), >>> (2, u'Hofstadter', 'cat'), >>> (3, u'Taxi', 'dog') >>> ]) >>> >>> g = ds.copy(columns=[2, 0]) # A copy with type & id. >>> g = g.group(0, COUNT) # Group type, count rows per type. >>> pprint(g, fill='') cat 2 dog 1
Sorting rows & columns
Datasheet.columns[j].sort() sorts the rows according to the values in column j.
Datasheet.columns.sort() can be used to change the column order:
>>> ds.columns.sort(order=[0, 2, 1]) >>> pprint(ds, fill='') 1 cat Schrödinger 2 cat Hofstadter 3 dog Taxi
Datasheet.columns.swap(j1,j2) swaps two individual columns with given indices.
CSV import & export
Datasheet.save() exports the matrix as a CSV file. Datasheet.load() returns a Datasheet from a given CSV file. CSV (comma-separated values) is a simple text format for tabular data, where each line is a row and each value is separated by a comma.
datasheet = Datasheet.load(path, separator=',', decoder=lambda v: v, headers=False)
datasheet.save(path, separator=',', encoder=lambda v: v, headers=False)
On export, all str, int, float, bool and Date values are converted to Unicode. An encoder can be given for other data types. On import, all values in the datasheet will be Unicode unless a decoder is given.
With headers=True, the Datasheet.fields headers are exported and imported (first line in CSV). In this case, the data type for each column (STRING, INTEGER, FLOAT, BOOLEAN or DATE) is explicitly known and no encoder or decoder is needed.
>>> from pattern.db import Datasheet, STRING, DATE, date >>> >>> ds = Datasheet(fields=(('name', STRING), ('date', DATE))) >>> ds.append((u'Schrödinger', date('1887-08-12'))) >>> ds.append((u'Hofstadter', date('1945-02-15'))) >>> >>> ds.save('pets.csv', headers=True) >>> >>> ds = Datasheet.load('pets.csv', headers=True) >>> print ds[0] [u'Schrödinger', Date('1887-08-12 00:00:00')]
The csv() function can also be used instead of Datasheet.load():
>>> from pattern.db import csv >>> >>> for name, date in csv('pets.csv', separator=',', headers=True): >>> print name, date
Date
The date() function returns a new Date, a convenient subclass of Python's datetime.datetime. It takes an integer (Unix timestamp), a string or NOW. An optional string input format and output format can be given (e.g., "%d/%m/%y"). The default output format is "YYYY-MM-DD hh:mm:ss".
d = date(int)
d = date(NOW, format=DEFAULT)
d = date(string)
d = date(string, format=DEFAULT)
d = date(string, inputformat, format=DEFAULT)
d = date(year, month, day, format=DEFAULT)
d = date(year, month, day, hours, minutes, seconds, format=DEFAULT)
d.year d.month # 1-12 d.week # 1-52 d.weekday # 1-7 d.day # 1-31 d.minute # 1-60 d.second # 1-60 d.timestamp # Seconds elapsed since 1/1/1970.
If no string input format is given, a number of common formats will be tried:
Format | Example |
%Y-%m-%d %H:%M:%S | 2010-09-21 09:27:01 |
%a, %d %b %Y %H:%M:%S %z | Tue, 9 Sep 2010 17:58:28 +0000 |
%Y-%m-%dT%H:%M:%SZ | 2010-09-20T09:27:01Z |
%Y-%m-%dT%H:%M:%S+0000 | 2010-09-20T09:27:01+0000 |
%Y-%m-%d %H:%M | 2010-09-20 09:27 |
%Y-%m-%d | 2010-09-20 |
%d/%m/%Y | 20/09/2010 |
%d %B %Y | 9 september 2010 |
%B %d %Y | September 9 2010 |
%B %d, %Y | September 09, 2010 |
All date formats used in pattern.web (e.g., Twitter search result) are automatically detected.
For an overview of date format syntax, see: http://docs.python.org/library/time.html#time.strftime.
Date calculations
The time() function can be used to add or subtract time to a Date:
time(days=0, seconds=0, minutes=0, hours=0)
>>> from pattern.db import date, time >>> >>> d = date('23 august 2011') >>> d += time(days=2, hours=5) >>> print type(d) >>> print d >>> print d.year, d.month, d.day <class 'pattern.db.Date'> 2011-08-25 05:00:00 2011, 8, 25