The Hash impossibility
2022-01-24 22:40 #commit
I want each post to have a snapshot of the blog's code at the time it was created. What I was doing so far was updating a field in the database with the post creation's hash. But that has a problem: I only have that hash after creating the commit, so I needed another commit to update the database with the hash of the commit. The history was turning into something like this:
...
5997d1a Post commit number
8406233 New post
a7b89b6 Post commit number
d7cb3c3 New post
f37c21f Post commit number
a3d8950 New post
...
It would be better to create a single commit that included the commit number in the database. For doing that, I would need to compute the hash before committing the changes. But that's almost impossible: git uses a hash function to compute it. I've seen this formula here, but I don't know very much about git internals and maybe it's wrong:
(printf "<type> %s\0" $(git cat-file <type> <ref> | wc -c); git cat-file <type> <ref>) | sha1sum
Hash functions are designed to generate very different outputs for similar inputs. For example:
echo "Hello" | sha1sum
1d229271928d3f9e2bb0375bd6ce5db6c6d348d9 -
echo "hello" | sha1sum
f572d396fae9206628714fb2ce00f72e94f2258f -
And they have another important property: it's almost impossible to find two different inputs that have the same output. If there is a way to do it in a better way than trying randomly, the hash algorithm is considered "broken" and must be replaced.
git uses SHA-1 for hashing. It has not been considered secure since 2005 and was deprecated in 2011. That means that, in theory, I should be able to "break" the algorithm and generate content with the same hash. Anyway, I've read that git is introducing SAH-256 so that solution would help me only for some time.
Also, what I really want is to generate a commit number that, once stored in the database, generates a hash equal to the same number stored. I don't even know if it's more simple or more difficult, but it won't be easy anyway.
The solution is to use another tool for "marking" the post, and I think that the right approach is to use tags.
I haven't used them very much while working with git. I think the only time I've used them was when I released some version of Ruby's Amplitude gem and, if you see the history, there is a "list" tag there that doesn't seem to fit.
Anyway, there are two tag types in git, lightweight and annotated. Usually, it's better to use annotated tags because they provide much more information than lightweight ones. I didn't know why they were better, but I've found this post on StackOverflow explaining it: with annotated tags you know who and when it was created. In this case, maybe I could use lightweight tags because I'm the only user of this repository and you have the post's date on the DB, but I think using annotated ones won't hurt so I will use them.
So, from now on, snapshots on the blog will be tagged with the post's slug, and git history will be clearer. If I want to see where the posts are, I can just use this command: git log --tags --simplify-by-decoration --pretty="format:%ci %d":
2022-01-24 21:53:01 +0100 (tag: new_year_resoultion_2022)
2022-01-08 13:57:47 +0100 (tag: retaking_the_blog)
2020-03-08 22:16:08 +0100 (tag: creating_a_ssl_certificate)
2020-03-01 17:18:59 +0100 (tag: ssl_concepts)
2020-03-01 00:04:58 +0100 (tag: an_agents_interface)
2020-03-01 00:02:27 +0100 (tag: a_little_cypress)
2020-02-29 23:16:53 +0100 (tag: markdown_format)
2020-02-29 13:29:49 +0100 (tag: blogs_database)
2020-02-18 14:49:50 +0100 (tag: importing_things)
2020-01-18 20:22:20 +0100 (tag: mercadona_from_mobile)
2020-01-11 19:28:32 +0100 (tag: flask_and_freeze)
2020-01-01 14:07:23 +0100 (tag: new_years_resolutions)
2019-12-31 17:16:20 +0100 (tag: simpler_things)