New philosophy articles are at Critical Fallibilism. Since 2021, Curiosity is my personal site for informal or off-topic articles.
Philosopher. Fan of Ayn Rand, Karl Popper & Eli Goldratt.
this site rocks
This article describes some performance problems associated with using the OFFSET SQL keyword for pagination. It also describes a more efficient, but more complicated, approach which the author calls "keyset pagination". Keyset pagination involves selecting only the items that are come after the last item shown on the previous page (in whatever sort order you are using). For example, instead of *OFFSET 30*, where 29 is the item number of the last result on the previous page, you would write *where (sale.date, sale.id) > (?, ?)*, in which case the placeholders represent the date and id of the last sale on the previous page.
Some things to be aware of:
- The SQL for keyset pagination is more complicated than the SQL for offset pagination.
- Most frameworks support only offset pagination, not keyset pagination.
- A SQL query for keyset pagination must be deterministic; that is, it must return the same rows no matter what execution plan is used. This can be achieved by including some unique identifier (e.g. sale.id) in the comparison.
- To simplify the comparison for keyset pagination, you can use a neat SQL feature called "row values" (supported in Postgres and SQLite, among other databases). A row value lets you combine multiple values into a single thing that can be compared like any other value. For instance you can write: *where (sale.date, sale.id) <= (?, ?)*. This combines sale.date and sale.id into a single tuple that can be compared to the given variables. (5, 7) <= (5, 8), but (5, 13) > (5,8).
- To paginate backward you need to change the order of the comparison operator(s).
- You can't easily fetch arbitrary pages with keyset pagination the way you can with offset pagination.
why would you WHERE on *both* date and id?
I wrote a script to remove watermarks from PDF documents. The exact code for the removal depends on the way the watermark is done, but the general outline may be helpful.
> why would you WHERE on *both* date and id?
In my example, id is included because the pagination boundary isn't guaranteed to exactly correspond to changes in the date.
I was assuming that you're paginating through sales in date order, and that the date field was just a date (no time). If you have more fine-grained timestamps that are unique (i.e. no two sales have the exact same timestamp), then you could just use that instead.
To flesh out my example a bit, say your first sale was on 2016-10-01. Suppose you had 15 sales that day, and you're showing 10 sales per page. The SQL query for the first page would have no filter (i.e. no "where" clause), it'd just be:
select * from sale order by date, id limit 10
Say the ID of the 10th sale (the last one) returned from that query is 12345. That ID gets fed back in to the query for page 2, so that page 2 includes the last 5 sales on 2016-10-01 as well as the next 5 sales made after that. The query for the second page looks like
select * from sale where (date,id) > ('2016-10-01', 12345) order by date, id limit 10.
ok i get why not to do date only if it's just a date not a datetime. but if they are in the correct order by ID, why not just WHERE on ID without date?
> ok i get why not to do date only if it's just a date not a datetime. but if they are in the correct order by ID, why not just WHERE on ID without date?
If "they are in the correct order by ID" means that id1 >= id2 just when date1 >= date2, then yes, your idea would work. However, that wasn't assumed in my example and it might not be the case. For example suppose that ...
- you are paginating sales in date order, with 10 items per page
- sale.ID is an auto-incrementing integer
- you insert 10 sales for the date 2016-10-02, and then subsequently you insert 10 sales for the date 2016-10-01
In this case, the 2016-10-01 sales would all have higher IDs than the 2016-10-02 sales, so just filtering with WHERE ID > ? would incorrectly exclude the 2016-10-02 sales from showing up on page 2.
i think the case you're talking about (IDs out of order, but never out of order across day boundaries) is rare and shouldn't be assumed when trying to present an approach for general use.
to take this blog as a typical example, IDs are occasionally out of chronological order and when they are it can be by multiple days. and even if the IDs are only out of order by a few minutes, those minutes can go over a day boundary.
> i think the case you're talking about (IDs out of order, but never out of order across day boundaries) is rare and shouldn't be assumed when trying to present an approach for general use.
I agree. Maybe date wasn't a good example to show why ID is necessary. How about this: people want to paginate on all sorts of things that won't correspond to ID. What if you view your sales in order by amount? The amount won't correspond to the ID, and yet you'll still need to include ID in the where clause to break ties. Another example would be when you want to order by multiple fields.
The general principle is that for keyset pagination, your query needs to be deterministic. That is, it needs to return the same results under all possible execution plans. One way to make the query deterministic is to add a column that makes the results unique and include that column in the sort order and where clause.
i'd rather break ties on the purchase amounts by datetime, not ID. but yes i get the concept of using ID as basically a random but unchanging (never rerolled) tiebreaker. the original example involving chronology confuses this.
> The general principle is that for keyset pagination, your query needs to be deterministic.
Actually, this is true for OFFSET pagination as well.
Want to discuss this? Join my forum.
(Due to multi-year, sustained harassment from David Deutsch and his fans, commenting here requires an account. Accounts are not publicly available. Discussion info.)