Using sqlite3
as a notekeeping document graph with automatic reference indexing
The full-text functionality of sqlite3
along with the powerful SQL indexing and trigger features allows us to easily keep notes with references in an sqlite3
database. In this document I present a workflow for doing so.
First, let’s examine the pros and cons of this workflow:
Pros:
- Your notes are kept in one file, and are portable on every OS and CPU architecture
sqlite3
supports. - You can write your notes in any kind of plain text format, for example
troff
ormarkdown
. - You can compose, edit, download your files with the
sqlite3
CLI and your editor of choice. - You get reference link calculations for free.
- You get full-text search for free.
- You can group notes into collections.
- You can optionally tag your notes with keywords.
- You can attach any (binary or not) file to those collections and refer to them from your notes.
Cons:
- Your database may get corrupted (versus one note file getting corrupted) but it’s mostly recoverable. Always backup in anything you do.
- You will need familiarity with the command line and SQL, but with the proper mindset this shouldn’t be a problem.
- References in text files do not obey
FOREIGN KEY
constraints; if you delete a note, the dangling reference in text and in indices remains. You can easily search for and fix them inAFTER INSERT
triggers, of course.
The schema
You can use anything you like as long as it has a basic property: your notes
table must have a unique id that you can reference in plain text.
For this demo, I use the bibliothecula
schema which has UUIDs for primary keys and allows you to tag or add other arbitrary metadata (and files) to each document. In this model, the document is our notes collection and the files of this document can include plain text ones that are our notes.
The table used for files in bibliothecula
is BinaryMetadata
; since it’s binary it can also hold plain text data. This is the CREATE
statement for BinaryMetadata
:
CREATE TABLE IF NOT EXISTS "BinaryMetadata" (
"uuid" CHARACTER(32) NOT NULL PRIMARY KEY,
"name" TEXT NULL,
"data" BLOB NOT NULL,
"compressed" BOOLEAN NOT NULL DEFAULT (0),
"created" DATETIME NOT NULL DEFAULT (strftime ('%Y-%m-%d %H:%M:%f', 'now')),
"last_modified" DATETIME NOT NULL DEFAULT (strftime ('%Y-%m-%d %H:%M:%f', 'now')),
CONSTRAINT uniqueness UNIQUE ("name", "data")
);
The name
column can hold our filename. What about mime type? Furthermore, what if I want to know the size of a file, do I have to calculate the data
length every time? †
The default sqlite
distribution includes the JSON1
extension which allows us to place structured data in a column, so I chose to store filename, mime type and size in bytes in the name
column. Examples:
[{"content_type":"text/markdown","filename":"2021-06-20.md","size":0},
{"content_type":"text/markdown","filename":"dataintegrity.md","size":566},
{"content_type":"text/markdown","filename":"exports.md","size":34},
{"content_type":"text/markdown","filename":"generate_tool.md","size":229},
{"content_type":"text/markdown","filename":"shell.md","size":240},
{"content_type":"text/plain","filename":"","size":97632},
{"content_type":"text/plain","filename":"test.txt","size":91} ]
Again, this is only for convenience. Our notes don’t have to have filenames if they already have a unique identifier, and there’s no restriction for filename UNIQUENESS
anywhere.
You can create JSON objects with the json_object
SQL function, and extract fields with the json_extract
SQL function:
SELECT json_extract(name, '$.content_type') FROM BinaryMetadata WHERE json_valid(name) LIMIT 1;
INSERT INTO BinaryMetadata(uuid,name,data) VALUES ('623fec5beac242fcb0b0d17ada20e2b5',json_object('content_type','text/plain','filename','file.txt','size',LENGTH(readfile('file.txt'))),readfile('file.txt'));
Note the use of json_valid
to ignore non-JSON names, and also the use of readfile
: this is a CLI-only function allowing you to read files as BLOB
s. We can use it to quickly attach files to our note database.
The indices
Full-text search
I will use the fts5
extension, included by default nowadays in sqlite3
. To create an fts5
index, I issue:
Note that this doesn’t seem limited to our text notes; indeed I can produce the full text of other attached binary files like PDFs and index them too, or maybe at a dedicated fts5
table as well.
The fts5
index needs to be filled manually by us, and we can use SQL triggers to automate this.
An INSERT
trigger for BinaryMetadata
might look like:
CREATE TRIGGER fts_insert
AFTER INSERT ON BinaryMetadata
WHEN json_valid(NEW.name)
BEGIN
INSERT INTO
zettel_fts(uuid, title, filename, full_text)
VALUES (NEW.uuid, NEW.name, json_extract(NEW.name, '$.filename'), NEW.data);
END;
I insert some dummy values:
INSERT INTO
BinaryMetadata(uuid,name,data)
VALUES
('623fec5beac242fcb0b0d17ada20e2b5',
json_object('content_type','text/plain','filename','file.txt','size',5),
'sun bicycle trigger journal'),
('37a3ff02c8cd4d7fb3280e5b160d1389',
json_object('content_type','text/plain','filename','book_ref.md','size',1),
'I have no references and I must scream'),
('b0697d8d76ae41bf8e942d505aff8963',
json_object('content_type','text/plain','filename','note.md','size',1),
'I refer to 623fec5b-eac2-42fc-b0b0-d17ada20e2b5 and also 37a3ff02c8cd4d7fb3280e5b160d1389');
Querying the index is as simple as SELECT
ing from it:
SELECT uuid, snippet(zettel_fts, -1, '<mark>', '</mark>', '[...]', 10) AS snippet FROM zettel_fts('journal');
uuid snippet
-------------------------------- ----------------------------------------
623fec5beac242fcb0b0d17ada20e2b5 sun bicycle trigger <mark>journal</mark>
Read the fts5
documentation here.
Reference index
First we need a way to recognize UUIDs in text. For this purpose I create a text tokenizer using the fts3
text tokenizers that spouts tokens that look like UUIDs:
CREATE VIRTUAL TABLE IF NOT EXISTS uuidtok USING fts3tokenize(
'unicode61',
"tokenchars=-1234567890abcdefABCDEF",
"separators= "
);
The UUIDs are spouted when you query the tokenizer. Querying a tokenizer in general is done with a special SELECT
:
token
-------
sun
bicycle
trigger
journal
Now, to get stuff that look like UUIDs from the tokenizer:
SELECT DISTINCT REPLACE(token, '-', '') as ref FROM uuidtok
WHERE input =
(SELECT data FROM BinaryMetadata WHERE uuid =
'b0697d8d76ae41bf8e942d505aff8963')
AND LENGTH(REPLACE(token, '-', '')) = 32
This returns:
ref
--------------------------------
623fec5beac242fcb0b0d17ada20e2b5
37a3ff02c8cd4d7fb3280e5b160d1389
Note the use of REPLACE
to exclude any hyphens from our processing.
Now we can create a reference index that we can update on insert/update/delete with triggers:
We can make triggers that use the SELECT DISTINCT
above along with a check that the reference target exists by adding
By having two columns in refs_fts
, referrer
and target
we can get all references inside a note and all back references from other notes.
Examples
INSERT INTO refs_fts(target, referrer)
SELECT DISTINCT REPLACE(tok.token, '-', '') AS target,
b.uuid AS referrer
FROM uuidtok AS tok,
(SELECT uuid,
data,
json_extract(name, '$.content_type') AS _type
FROM BinaryMetadata
WHERE json_valid(name)
AND _type LIKE "%text/%") AS b
WHERE tok.input=b.data
AND LENGTH(REPLACE(tok.token, '-', '')) = 32
AND EXISTS (SELECT * FROM BinaryMetadata WHERE uuid = REPLACE(tok.token, '-', ''));
referrer
--------------------------------
b0697d8d76ae41bf8e942d505aff8963
target
--------------------------------
623fec5beac242fcb0b0d17ada20e2b5
37a3ff02c8cd4d7fb3280e5b160d1389
Miscellanea
We can read text from the
sqlite3
CLI by justSELECT
ing the data. To save the data, text or any binary into a file use thewritefile
CLI function:To insert a file as a
BLOB
, use thereadfile
CLI function (example fromsqlite3
documentation:To edit a file, use
edit()
(again, CLI only):Yes, that means you can use your editor of choice without problem. You can also just view any file with
edit()
by selecting it without doing anyUPDATE
.sqlite
supports Common Table Expressions, an SQL standard that allows you to query hierarchical relationships like nodes in a graph. That means you can easily find all the notes you can reach with references and back references. For info see the documentationThe full schema required for this article’s examples is here
There’s a web demo of sociologist Niklas Luhmann's zettelkasten you can explore online using
sql.js
,sqlite3
compiled to webassembly (total compressed asset size:16MB
). source code
Epilogue
You can check out the bibliothecula
project if you are interested in small tools to support tagged storage inside sqlite3
databases.