/blog/

2024 0607 Technical solutions to social problems

Being both disagreeable and pedantic, I am always delighted to find counterexamples to received wisdom. It is often said that there are no technical solutions to social problems. This is not, strictly speaking, true. I can think of at least one, from a previous employer.

At this past job, I was on the systems team, where we managed a fleet of several thousand bare metal Linux systems in datacenters around the world. When I came to work there, the firm was booting these machines over the network. The boot servers sent a kernel and a disk image, and the servers held the disk image in RAM and never loaded an operating system from local disk. Right after configuring networking, they would make an HTTP request to a provisioning server, which would run an Ansible playbook against them, configuring them the rest of the way and making them available for applications. The servers did all have local disks, and persistent storage was symlinked or bind-mounted by the Ansible playbooks where required.

One of the reasons they built this system, which they spent considerable resources running and maintaining, was to enforce a rule that all changes had to go into Ansible as the config management source of truth. Whether we liked it or not, users who weren’t on the systems team including application admins sometimes had root access or other ways to affect the operating system. As everyone’s work was very important and their time in high demand, users would sometimes forget to add their changes to Ansible, which took time to do and a lot of time to test. If this state were allowed to continue for long, these users might entirely forget what they had done, and it would not be possible to recover a system without hours or days of trial and error. This was not a hypothetical problem. But if individually made changes were guaranteed to be reset the next time a system booted, users didn’t just add their changes to Ansible if they got around to it, they did so in order to save their work.

The problem was social: users were supposed to do something in a more careful but slower way, but sometimes did it in a less careful and faster way. The solution was technical: make it not a matter of policy, which exists in the social realm, and which can be overridden by stakeholders in positions of authority during crises, but a matter of whether it works technically at all. Saying “our policy does not permit it” is a social solution; “it is simply impossible” is a technical one.

Responses

Webmentions

Hosted on remote sites, and collected here via Webmention.io (thanks!).

Comments

Comments are hosted on this site and powered by Remark42 (thanks!).