choosing an id - Marcel's blog

# Choosing an identifier for a PostgreSQL table FIXME: - consider vidual ambigious chars https://gajus.com/blog/avoiding-visually-ambiguous-characters-in-ids - Ideally, an Id doesn't start with a number, I faced the question of choosing an identifier (a primary key) for a low-volume table in PostgreSQL. The table represents a list of physical products, the database is a single-node instance. One might say, [the boring stuff](http://email.mg.buttondown.email/c/eJyNjrGOwyAQRL_GdKAN2OxSUFxzvxGZBWwkGyKH6JS_PydN2kjTjPRGb6KnhGBQFK9BazB6upDBySrngOYMGFzIYImGEfZFhUfvrcb2V1Xa57KJ1ZtMqElbO2Yb4zmlnBMisnGBmUlsfu39dh_Mz6B_z-ycH3VLT8VtPyuvrd2TDO0odZE98Vrb1panOPzLcHqPVGpJh4pJdB-QLMzjKCGClePFsQyTI-kyZYczxsAoPjev75vXEr8bvumv0H_vBl5D). For the _type_ of identifier, it seems to me there are four routes that people typically reach out to: 1. [UUID](http://email.mg.buttondown.email/c/eJyNjsuKxCAURL8m7iI3auJ14WI28xuNj2si5DXGEObvO92b3jYUFBRV1IkWSYPULFsBQoAUfYdS9wM3BtAl0N74BANio2AZuT9r3da4XSunxeWZTVZgEiogIAXTu14Y76iTUkUypJNXbLZTrfvRyJ9G_N66rovv21HHQsffzLcy3mHcwnFbOEuhtb4CV13936k9zxz5VJeZFfu6vEEK5TVT4ZFYtV7jAE6pFiIMrepMaH1vsDUJk9FORx80-3A_3tyPHL8bvttfVZ8dWmRU), such as `a0eebc99-9c0b-4ef8-bb6d-6bb9bd380a11` 2. [sequential numeric IDs](http://email.mg.buttondown.email/c/eJyNj7lqxDAYhJ_G6iR-y7KOQoUhGwikCMk2qRadtsBXZBmTt493m7QLAwPDDHzjtQwCGoGSpkApNLStZSNaTpQCaSIIq2wELmXFYOqJ3UtZZr8cMwmTSSMatAHBwQIA81wJw2vrrXWMS0e9ikyhUQ-lrFvVdBV9PXUcB1mXrfQ5bD8jWXJ_hn5x22luzznM5R6YYsrvGvC8TyEnR4YyjRVtXrprd_3-uOCvy-db946yvnOcdDmkOYVMfEBFWyE5GMYweOCY1cph2yqJVZTxhBTeOoH-z9weZ27JPzd8tJ-q_gFJRmog), such as `1`,`2`,`3`,`4`,`5` 3. natural keys, that is, domain-specific values that are guaranteed to be unique, such as manufacturer serial number or book ISBN 4. Youtube-like short IDs, such as `1TNhBqYo-6Q`, that can be generated inside Postgres (for example, [pg-shortkey](http://email.mg.buttondown.email/c/eJyNjs1uxCAMhJ8m3IIc_gwHDr30NVY4QIK6CSuWqOrbl-6l15UszYz02Z7obUKQyIoXIARIoRcrURvuHNiQAclRBmPtpODYOF291zPW75OnI5Q7272MLqqAgoQmlxahUkZAPawdJht293vvj-ckPybxOWYrfb-Ir_UYoV-N6tDHNj_32vpX-mHN_50eD1sqZ0mNx8S6J7QGglIzRDCzWtw6k3Z2dtlmhwEjrcj--91e_W4lvrf4ot9CfwFvvlrH)) or in the application layer (for example [nanoid](http://email.mg.buttondown.email/c/eJyNjr1uxCAQhJ_GdEZrDOxSUJwipc4bnFh-zkhnHDmcorx9yDVpT5piRvpmd5KnjLCiqF6BUrAqs9CKxkrngEIBZMcFLNGkYb9JfvR-tHR8N5n3UO9i84MIxZrCxkQCMlEzukAxLDmZsKC4-633z69pvUzqfehW-_ZgGY99hLcfzuflY7gW2lGTOP3f3fHtzLXVfMqURfeMZCFoPUMCO-vFxZmNo9kVKg4DJo4o_sddn-OuNb1WfNIvob9XVVmp)). Now, which route to take? ## Decision criteria table Since I now try a lot to solve problems by forcing two-dimensional tables onto them (FIXME:link), I tried to do the same for this problem. I was able to gather the following criteria: - Ergonomics: Are identifiers short, easy to read, easy to copy and paste? - Exposing sensitive data: Does the identifier exposes the total number of items in the table or the order of tables? - Distributed multi-master database setup: Can the identifiers be created independently on different masters? (not relevant for our use-case though, putting it there for the record) - The expected risk of collision - Incorrect identifiers (relevant only for natural keys): what happens if the identifiers are entered incorrectly - Same sequence used across different tables: this one is related to the use-case of having different tables reaching out to [the same key sequence](http://email.mg.buttondown.email/c/eJyNjsuqwzAMRL8m3sUoiRNbCy-6ub9RbEl5QB6t45Lfv2433RbECIYZ5rB3YqGzavEttC10bd-4zvaDRgQXRrAR4wiDc5WBbdLxlfOx83HtWrawrGr2bBhljLYDaxwSB3KuIUAmRKKB1OrnnB9n1d2q9q_cdV36cZx5SnI-V32kqZh80Fke6r5osWtKErKc8nzJTqLnvK0q-fdmIUmy7IskzaKyj9YNEIypgWGoTYNUxx5djaMb0QbLkaz6gt8_4PeFfyt-0j9F_wFtwGRi). It’s related to the pattern described in [SQL Antipatterns](http://email.mg.buttondown.email/c/eJyNjs1uwyAQhJ_G3EBrfrzsgUMvfY0IDI5RbUhtKr9-SC69RhpppNGM5ovOJgSFLDsJUoKSZrQKzSSIwPoFMFBYYLJ20LDfRfhrrZZYryLS7vPGVhdwlkizTJFGVMEoiouUpLzRNClt2ObW1h7noL4G-d11XZe41xqP5OMp5rr3LNT60-1c69UNyRAoy8_fjfvS8sO3lo5yssO9TjvKkXLJ6RAxsdYJ7AReaw4RJq5HmnkwZDktdiH0GMOM7J_89ia_5fjZ8N3-qPoEM8xi0Q), chapter 6 on how to support variable attributes. Specifically, this criteria is to capture the implications of using sequence identifiers for the _concrete table_ approach that is described in the book. This list is probably incomplete, but it sufficed for our use-case to explore the problem space a bit deeper and get a feel of the terrain. Here’s an overview of all criteria against all the types: ![Screenshot 2022-03-25T19.29.png](https://ci4.googleusercontent.com/proxy/f_ODjjOBcYiFaPs03FVnD1ltXD6LoMXKqwef-6txQ4KFCWJbpbgRu8oTS41IpTG4_OSviFW_7sipZjSgvLfHfOmEA-pvXFd4spzbf6rQiSiHLj9I2aQ2_Sg7qtJBMCEpng2sMpwl=s0-d-e1-ft#https://buttondown.s3.amazonaws.com/images/c768aec7-92f7-461f-96b0-8244c660c63b.png) You can check the live version of the table here (FIXME: add link) (Feel free to comment on the cells though). ## Takeaways - It seems to me that there’s no ideal solution and the answer depends on the specific use case at hand. - A good starting choice might be the natural keys. The upside of natural keys is that the identifiers are probably much more aligned with what the business or customers use to identify the thing. The cons are the one that needs to deal with incorrect keys, so the choice might be easier if incorrect keys don’t occur or the total cost of having an incorrect key is negligible. - If natural keys are not possible, one needs to reach out to generated keys, that is, UUIDs, sequences or the YouTube-like. - Out of the generated keys, if ergonomics is not relevant, UUIDs seem to be the beaten path. - If ergonomics is relevant and exposing the sequence is not an issue, sequential IDs seem to be a good fit: great ergonomics and no risk of collision. One can even prevent the clients from seeing the sequence by [obfuscating](http://email.mg.buttondown.email/c/eJyNjsEOgyAQRL9GbpItILscOPTS3zCsgJK02ihNf7_qpVeTuUwy82aip4SgURSvQCnQqruRxs5K54BCBmTHGSxRY-A1Sv7Uusxx-c4yvUJ5iskHbenmhsFxHKwxllExYoYuQz5Q4umnWt9bo--NeuyawjaVuMllHXcnVn-AdvyaylzSKmMS1TOShWBMCxFsa_aBljtHrcuUHQaMPKD4v-nPN32J14pn-lL0B08rU-w); however, if such obfuscation is needed, it seems that other types of an identifier might be more suitable for the given use-case. - This brings me to an intriguing approach: the YouTube-like short ID. It seems that with NanoIds, one can [optimize for ergonomics given the collision risk](http://email.mg.buttondown.email/c/eJyNTjtuAyEUPM3SgR5_KCjS5BrWgwdelDUbbbAs-fTZuElraZoZzY9SqB60Zz0pUAq0sjJob52IEQI28DnmBi6ExcDtKvJ9zn3Q_hii3rBvbE2RvNO5ScJiWpGWbKwGpfOoiAgr29I65_fPoj8W9XniWTc8vsS1z_WeRd9PaeDYeSdeyknYkf6qz8Gj9tHrIaiymbIPDtAYDgSOGxkLzzYGHlto0aOnXDz7_3d5_bt0ei_4cr9l_QXkBlsX). This seems to tick all the explored criteria. The downside, however, is that such IDs would be created on the application side; outside of the transactional SQL territory. That might imply difficulties in situations such as complex SQL migrations or complex transactions. Nevertheless, an intriguing choice. - There seem to be other approaches as well. For example, Instagram faced a use-case, where they needed sortable short ids and they went with [a custom-made identifier](http://email.mg.buttondown.email/c/eJyNj71uwzAMhJ_G2mTQsn4HDV36GgElUraAWAlkFX39ulmyBrjlcN-BPIqeHaxO1KhAKViVWfzqjJ1DAI8FXAqpgPV-0nBsc_oZ49Ho8dtmPrDexR7Jr8FZndEa7w3qQkFpDAszAZRkxT3uYzzPaf2a1Pel2s6BW8dDcttqY-61bXN-HFd27tjpsrLSKXHIN7vkYtAtbNBk0eP_8eulzrVV7jOxGDE5bwG1lkBgpV5ClskEL0PxJTh0lLIT7wW314Jbpc-KL_oj9A_4MWcY). Other approaches are mentioned in this [HackerNews discussion](http://email.mg.buttondown.email/c/eJyNjsFqwzAQRL_GulmsV7Z2ddChUPIbQWtJjSCWg6MS-vdRcuk1MIdhmGFe9JwIDKniERDB4DKxocVq54BDBhInGSzzMMP2o-W3tb3G_VF12kK5qot3gDnbTLRydi44FCOJEBjCJGKTuvpLa7f7YL4GPHXV9Ljrv3XfpNTQ9kN32-PS0jaYU4mD-cYFmSdmdfjXS_8-UqklHTom1bwQWwjzPEIEO86TW0dZHI8udwQKFGUl9Y96fqOeS_xs-G5_VH0CGCBdTg)