DataStore extension¶
The CKAN DataStore extension provides an ad hoc database for storage of structured data from CKAN resources. Data can be pulled out of resource files and stored in the DataStore.
When a resource is added to the DataStore, you get:
- Automatic data previews on the resource’s page, using the Data Explorer extension
- The DataStore API: search, filter and update the data, without having to download and upload the entire data file
The DataStore is integrated into the CKAN API and authorization system.
The DataStore is generally used alongside the DataPusher, which will automatically upload data to the DataStore from suitable files, whether uploaded to CKAN’s FileStore or externally linked.
Relationship to FileStore¶
The DataStore is distinct but complementary to the FileStore (see FileStore and file uploads). In contrast to the FileStore which provides ‘blob’ storage of whole files with no way to access or query parts of that file, the DataStore is like a database in which individual data elements are accessible and queryable. To illustrate this distinction, consider storing a spreadsheet file like a CSV or Excel document. In the FileStore this file would be stored directly. To access it you would download the file as a whole. By contrast, if the spreadsheet data is stored in the DataStore, one would be able to access individual spreadsheet rows via a simple web API, as well as being able to make queries over the spreadsheet contents.
Setting up the DataStore¶
Note
The DataStore (like CKAN) requires PostgreSQL 9.2 or later. This was released in 2012, is widely available. At the time of writing, the only version that is not supported by CKAN that has not been made ‘end-of-life’ by the PostgreSQL community is 9.1.
Changed in version 2.6: Previous CKAN (and DataStore) versions were compatible with earlier versions of PostgreSQL.
2. Set-up the database¶
Warning
Make sure that you follow the steps in Set Permissions below correctly. Wrong settings could lead to serious security issues.
The DataStore requires a separate PostgreSQL database to save the DataStore resources to.
List existing databases:
sudo -u postgres psql -l
Check that the encoding of databases is UTF8, if not internationalisation may be a problem. Since changing the encoding of PostgreSQL may mean deleting existing databases, it is suggested that this is fixed before continuing with the datastore setup.
Create users and databases¶
Tip
If your CKAN database and DataStore databases are on different servers, then you need to create a new database user on the server where the DataStore database will be created. As in Installing CKAN from source we’ll name the database user ckan_default:
sudo -u postgres createuser -S -D -R -P -l ckan_default
Create a database_user called datastore_default. This user will be given read-only access to your DataStore database in the Set Permissions step below:
sudo -u postgres createuser -S -D -R -P -l datastore_default
Create the database (owned by ckan_default), which we’ll call datastore_default:
sudo -u postgres createdb -O ckan_default datastore_default -E utf-8
Set URLs¶
Now, uncomment the ckan.datastore.write_url and ckan.datastore.read_url lines in your CKAN config file and edit them if necessary, for example:
ckan.datastore.write_url = postgresql://ckan_default:pass@localhost/datastore_default ckan.datastore.read_url = postgresql://datastore_default:pass@localhost/datastore_default
Replace pass with the passwords you created for your ckan_default and datastore_default database users.
Set permissions¶
Tip
See Legacy mode: use the DataStore with old PostgreSQL versions if these steps continue to fail or seem too complicated for your set-up. However, keep in mind that the legacy mode is limited in its capabilities.
Once the DataStore database and the users are created, the permissions on the DataStore and CKAN database have to be set. CKAN provides a paster command to help you correctly set these permissions.
If you are able to use the psql command to connect to your database as a superuser, you can use the datastore set-permissions command to emit the appropriate SQL to set the permissions.
For example, if you can connect to your database server as the postgres superuser using:
sudo -u postgres psql
Then you can use this connection to set the permissions:
sudo ckan datastore set-permissions | sudo -u postgres psql --set ON_ERROR_STOP=1
Note
If you performed a source install, you will need to replace all references to sudo ckan ... with paster --plugin=ckan ... and provide the path to the config file, e.g. paster --plugin=ckan datastore set-permissions -c /etc/ckan/default/development.ini
If your database server is not local, but you can access it over SSH, you can pipe the permissions script over SSH:
sudo ckan datastore set-permissions |
ssh dbserver sudo -u postgres psql --set ON_ERROR_STOP=1
If you can’t use the psql command in this way, you can simply copy and paste the output of:
sudo ckan datastore set-permissions
into a PostgreSQL superuser console.
3. Test the set-up¶
The DataStore is now set-up. To test the set-up, (re)start CKAN and run the following command to list all DataStore resources:
curl -X GET "http://127.0.0.1:5000/api/3/action/datastore_search?resource_id=_table_metadata"
This should return a JSON page without errors.
To test the whether the set-up allows writing, you can create a new DataStore resource. To do so, run the following command:
curl -X POST http://127.0.0.1:5000/api/3/action/datastore_create -H "Authorization: {YOUR-API-KEY}" -d '{"resource": {"package_id": "{PACKAGE-ID}"}, "fields": [ {"id": "a"}, {"id": "b"} ], "records": [ { "a": 1, "b": "xyz"}, {"a": 2, "b": "zzz"} ]}'
Replace {YOUR-API-KEY} with a valid API key and {PACKAGE-ID} with the id of an existing CKAN dataset.
A table named after the resource id should have been created on your DataStore database. Visiting this URL should return a response from the DataStore with the records inserted above:
http://127.0.0.1:5000/api/3/action/datastore_search?resource_id={RESOURCE_ID}
Replace {RESOURCE-ID} with the resource id that was returned as part of the response of the previous API call.
You can now delete the DataStore table with:
curl -X POST http://127.0.0.1:5000/api/3/action/datastore_delete -H "Authorization: {YOUR-API-KEY}" -d '{"resource_id": "{RESOURCE-ID}"}'
To find out more about the DataStore API, see The DataStore API.
Legacy mode: use the DataStore with old PostgreSQL versions¶
Tip
The legacy mode can also be used to simplify the set-up since it does not require you to set the permissions or create a separate user.
The DataStore can be used with a PostgreSQL version prior to 9.0 in legacy mode. Due to the lack of some functionality, the datastore_search_sql() and consequently the HTSQL support cannot be used. To enable the legacy mode, remove the declaration of the ckan.datastore.read_url.
The set-up for legacy mode is analogous to the normal set-up as described above with a few changes and consists of the following steps:
- Enable the plugin
- The legacy mode is enabled by not setting the ckan.datastore.read_url
- Set-Up the database
- Create a separate database
- Create a write user on the DataStore database (optional since the CKAN user can be used)
- Test the set-up
There is no need for a read-only user or special permissions. Therefore the legacy mode can be used for simple set-ups as well.
DataPusher: Automatically Add Data to the DataStore¶
Often, one wants data that is added to CKAN (whether it is linked to or uploaded to the FileStore) to be automatically added to the DataStore. This requires some processing, to extract the data from your files and to add it to the DataStore in the format the DataStore can handle.
This task of automatically parsing and then adding data to the DataStore is performed by the DataPusher, a service that runs asynchronously and can be installed alongside CKAN.
To install this please look at the docs here: http://docs.ckan.org/projects/datapusher
The DataStore API¶
The CKAN DataStore offers an API for reading, searching and filtering data without the need to download the entire file first. The DataStore is an ad hoc database which means that it is a collection of tables with unknown relationships. This allows you to search in one DataStore resource (a table in the database) as well as queries across DataStore resources.
Data can be written incrementally to the DataStore through the API. New data can be inserted, existing data can be updated or deleted. You can also add a new column to an existing table even if the DataStore resource already contains some data.
You will notice that we tried to keep the layer between the underlying PostgreSQL database and the API as thin as possible to allow you to use the features you would expect from a powerful database management system.
A DataStore resource can not be created on its own. It is always required to have an associated CKAN resource. If data is stored in the DataStore, it will automatically be previewed by the recline preview extension.
Making a DataStore API request¶
Making a DataStore API request is the same as making an Action API request: you post a JSON dictionary in an HTTP POST request to an API URL, and the API also returns its response in a JSON dictionary. See the API guide for details.
API reference¶
Note
Lists can always be expressed in different ways. It is possible to use lists, comma separated strings or single items. These are valid lists: ['foo', 'bar'], 'foo, bar', "foo", "bar" and 'foo'. Additionally, there are several ways to define a boolean value. True, on and 1 are all vaid boolean values.
Note
The table structure of the DataStore is explained in Internal structure of the database.
- ckanext.datastore.logic.action.datastore_create(context, data_dict)¶
Adds a new table to the DataStore.
The datastore_create action allows you to post JSON data to be stored against a resource. This endpoint also supports altering tables, aliases and indexes and bulk insertion. This endpoint can be called multiple times to initially insert more data, add fields, change the aliases or indexes as well as the primary keys.
To create an empty datastore resource and a CKAN resource at the same time, provide resource with a valid package_id and omit the resource_id.
If you want to create a datastore resource from the content of a file, provide resource with a valid url.
See Fields and Records for details on how to lay out records.
Parameters: - resource_id (string) – resource id that the data is going to be stored against.
- force (bool (optional, default: False)) – set to True to edit a read-only resource
- resource (dictionary) – resource dictionary that is passed to resource_create(). Use instead of resource_id (optional)
- aliases (list or comma separated string) – names for read only aliases of the resource. (optional)
- fields (list of dictionaries) – fields/columns and their extra metadata. (optional)
- records (list of dictionaries) – the data, eg: [{“dob”: “2005”, “some_stuff”: [“a”, “b”]}] (optional)
- primary_key (list or comma separated string) – fields that represent a unique key (optional)
- indexes (list or comma separated string) – indexes on table (optional)
Please note that setting the aliases, indexes or primary_key replaces the exising aliases or constraints. Setting records appends the provided records to the resource.
Results:
Returns: The newly created data object. Return type: dictionary See Fields and Records for details on how to lay out records.
- ckanext.datastore.logic.action.datastore_upsert(context, data_dict)¶
Updates or inserts into a table in the DataStore
The datastore_upsert API action allows you to add or edit records to an existing DataStore resource. In order for the upsert and update methods to work, a unique key has to be defined via the datastore_create action. The available methods are:
- upsert
- Update if record with same key already exists, otherwise insert. Requires unique key.
- insert
- Insert only. This method is faster that upsert, but will fail if any inserted record matches an existing one. Does not require a unique key.
- update
- Update only. An exception will occur if the key that should be updated does not exist. Requires unique key.
Parameters: - resource_id (string) – resource id that the data is going to be stored under.
- force (bool (optional, default: False)) – set to True to edit a read-only resource
- records (list of dictionaries) – the data, eg: [{“dob”: “2005”, “some_stuff”: [“a”,”b”]}] (optional)
- method (string) – the method to use to put the data into the datastore. Possible options are: upsert, insert, update (optional, default: upsert)
Results:
Returns: The modified data object. Return type: dictionary
- ckanext.datastore.logic.action.datastore_info(context, data_dict)¶
Returns information about the data imported, such as column names and types.
Return type: A dictionary describing the columns and their types. Parameters: id (A UUID) – Id of the resource we want info about
- ckanext.datastore.logic.action.datastore_delete(context, data_dict)¶
Deletes a table or a set of records from the DataStore.
Parameters: - resource_id (string) – resource id that the data will be deleted from. (optional)
- force (bool (optional, default: False)) – set to True to edit a read-only resource
- filters (dictionary) – filters to apply before deleting (eg {“name”: “fred”}). If missing delete whole table and all dependent views. (optional)
Results:
Returns: Original filters sent. Return type: dictionary
- ckanext.datastore.logic.action.datastore_search(context, data_dict)¶
Search a DataStore resource.
The datastore_search action allows you to search data in a resource. DataStore resources that belong to private CKAN resource can only be read by you if you have access to the CKAN resource and send the appropriate authorization.
Parameters: - resource_id (string) – id or alias of the resource to be searched against
- filters (dictionary) – matching conditions to select, e.g {“key1”: “a”, “key2”: “b”} (optional)
- q (string or dictionary) – full text query. If it’s a string, it’ll search on all fields on each row. If it’s a dictionary as {“key1”: “a”, “key2”: “b”}, it’ll search on each specific field (optional)
- distinct (bool) – return only distinct rows (optional, default: false)
- plain (bool) – treat as plain text query (optional, default: true)
- language (string) – language of the full text query (optional, default: english)
- limit (int) – maximum number of rows to return (optional, default: 100)
- offset (int) – offset this number of rows (optional)
- fields (list or comma separated string) – fields to return (optional, default: all fields in original order)
- sort (string) – comma separated field names with ordering e.g.: “fieldname1, fieldname2 desc”
Setting the plain flag to false enables the entire PostgreSQL full text search query language.
A listing of all available resources can be found at the alias _table_metadata.
If you need to download the full resource, read Download resource as CSV.
Results:
The result of this action is a dictionary with the following keys:
Return type: A dictionary with the following keys
Parameters: - fields (list of dictionaries) – fields/columns and their extra metadata
- offset (int) – query offset value
- limit (int) – query limit value
- filters (list of dictionaries) – query filters
- total (int) – number of total matching records
- records (list of dictionaries) – list of matching results
- ckanext.datastore.logic.action.datastore_search_sql(context, data_dict)¶
Execute SQL queries on the DataStore.
The datastore_search_sql action allows a user to search data in a resource or connect multiple resources with join expressions. The underlying SQL engine is the PostgreSQL engine. There is an enforced timeout on SQL queries to avoid an unintended DOS. DataStore resource that belong to a private CKAN resource cannot be searched with this action. Use datastore_search() instead.
Note
This action is only available when using PostgreSQL 9.X and using a read-only user on the database. It is not available in legacy mode.
Parameters: sql (string) – a single SQL select statement Results:
The result of this action is a dictionary with the following keys:
Return type: A dictionary with the following keys
Parameters: - fields (list of dictionaries) – fields/columns and their extra metadata
- records (list of dictionaries) – list of matching results
- ckanext.datastore.logic.action.datastore_make_private(context, data_dict)¶
Deny access to the DataStore table through datastore_search_sql().
This action is called automatically when a CKAN dataset becomes private or a new DataStore table is created for a CKAN resource that belongs to a private dataset.
Parameters: resource_id (string) – id of resource that should become private
- ckanext.datastore.logic.action.datastore_make_public(context, data_dict)¶
Allow access to the DataStore table through datastore_search_sql().
This action is called automatically when a CKAN dataset becomes public.
Parameters: resource_id (string) – if of resource that should become public
Download resource as CSV¶
A DataStore resource can be downloaded in the CSV file format from {CKAN-URL}/datastore/dump/{RESOURCE-ID}.
Fields¶
Fields define the column names and the type of the data in a column. A field is defined as follows:
{
"id": # a string which defines the column name
"type": # the data type for the column
}
Field types are optional and will be guessed by the DataStore from the provided data. However, setting the types ensures that future inserts will not fail because of wrong types. See Field types for details on which types are valid.
Example:
[
{
"id": "foo",
"type": "int4"
},
{
"id": "bar"
# type is optional
}
]
Records¶
A record is the data to be inserted in a DataStore resource and is defined as follows:
{
"<id>": # data to be set
# .. more data
}
Example:
[
{
"foo": 100,
"bar": "Here's some text"
},
{
"foo": 42
}
]
Field types¶
The DataStore supports all types supported by PostgreSQL as well as a few additions. A list of the PostgreSQL types can be found in the type section of the documentation. Below you can find a list of the most common data types. The json type has been added as a storage for nested data.
In addition to the listed types below, you can also use array types. They are defines by prepending a _ or appending [] or [n] where n denotes the length of the array. An arbitrarily long array of integers would be defined as int[].
- text
- Arbitrary text data, e.g. Here's some text.
- json
- Arbitrary nested json data, e.g {"foo": 42, "bar": [1, 2, 3]}. Please note that this type is a custom type that is wrapped by the DataStore.
- date
- Date without time, e.g 2012-5-25.
- time
- Time without date, e.g 12:42.
- timestamp
- Date and time, e.g 2012-10-01T02:43Z.
- int
- Integer numbers, e.g 42, 7.
- float
- Floats, e.g. 1.61803.
- bool
- Boolean values, e.g. true, 0
You can find more information about the formatting of dates in the date/time types section of the PostgreSQL documentation.
Resource aliases¶
A resource in the DataStore can have multiple aliases that are easier to remember than the resource id. Aliases can be created and edited with the datastore_create() API endpoint. All aliases can be found in a special view called _table_metadata. See Internal structure of the database for full reference.
HTSQL support¶
The ckanext-htsql extension adds an API action that allows a user to search data in a resource using the HTSQL query expression language. Please refer to the extension documentation to know more.
Comparison of different querying methods¶
The DataStore supports querying with multiple API endpoints. They are similar but support different features. The following list gives an overview of the different methods.
datastore_search() | datastore_search_sql() | HTSQL | |
---|---|---|---|
Ease of use | Easy | Complex | Medium |
Flexibility | Low | High | Medium |
Query language | Custom (JSON) | SQL | HTSQL |
Join resources | No | Yes | No |
Internal structure of the database¶
The DataStore is a thin layer on top of a PostgreSQL database. Each DataStore resource belongs to a CKAN resource. The name of a table in the DataStore is always the resource id of the CKAN resource for the data.
As explained in Resource aliases, a resource can have mnemonic aliases which are stored as views in the database.
All aliases (views) and resources (tables respectively relations) of the DataStore can be found in a special view called _table_metadata. To access the list, open http://{YOUR-CKAN-INSTALLATION}/api/3/action/datastore_search?resource_id=_table_metadata.
_table_metadata has the following fields:
- _id
- Unique key of the relation in _table_metadata.
- alias_of
- Name of a relation that this alias point to. This field is null iff the name is not an alias.
- name
- Contains the name of the alias if alias_of is not null. Otherwise, this is the resource id of the CKAN resource for the DataStore resource.
- oid
- The PostgreSQL object ID of the table that belongs to name.