Infra/Minutes 20121108

Yesterday, November 8th, we had an informal admin meeting in Pfronten, Germany. Participants were Alin Creţu, Chistian Lohmaier, Robert Einsle, Alexander Werner and myself.

We have all unanimously agreed to the following proposal, which I would like to share with you.

The basic problem is that our infrastructure needs to keep pace with the community growth, so it needs to grow rapidly. For that to happen, we need to establish better structures than now, and a better internal communication.

To solve that issue, we propose the following:

= 1. categorizing and prioritizing of services =

We run various services. Some of them are crucial (like gerrit, email and the download page), others are important, but not that crucial. We need to get an overview of all running services and attach a category/priority to them.

For normal services, the "four eyes principle" should be enforced, i.e. at least two people are fully in the know.

For crucial services, a six eyes principle should be enforced, i.e. at least three people are fully in the know.

Exceptions can be granted when needed, but the above should be the general rule.

= 2. creation of a core team =

To reflect the actual working areas and bandwith of working, we propose the setup of a core team (the name is a working title), composed of those who are experienced and have been involved in many parts of the infrastructure, in other words, those who not only have a focus on one aspect, those who have "the big picture".

Addendum November 21st
The core team is:


 * Alin Creţu
 * Florian Effenberger
 * Robert Einsle
 * Christian Lohmaier
 * Alexander Werner

= 3. policy for new software and services =

New software and services should only be installed after the majority of the core team approved them.

= 4. defining responsible parties =

For all servers, VMs and services, at least one responsible party needs to be defined. Responsible means that it's their responsibility of keeping the service running and ensure proper updating, especially in terms of security fixes.

= 5. advance update planning =

For updates to be applied, especially those involving a restart of services or the reboot of an entire machine, a proper update and reboot procedure should be set in place. This also includes a fixed update window for regular updates, when downtime of services can be expected. However, in case of security updates, there is one major rule: Safety and security first. In other words, as soon as a crucial update is available, it will be immediately installed without any further delay.

= 6. documentation =

New services will only be installed after they have been properly documented beforehand. Exceptions can be granted by the core team. General rule: No productive services without proper documentation.

We will come up with a proposal on how to document. Wiki and ODT have not been working out, so we will evaluate other options. An idea was to use RST files (restructured text), using Sphinx. Those text files could be managed via a git repository.

We will come up with a proper template, e.g. for layout, but also with some basic principles (paths, commands, scriptable configuration and the like) for documentation.

In addition, etckeeper and git will be used for tracking changes and manage the configuration.

Furthermore, certain policies on when to use either source packages, or distribution packages, or a self-hosted repository will be defined.

= 7. OTRS =

We will make more use of OTRS in the future, especially for change management.

= 8. regular meetings =

Since e-mail is basically filling everyone's inbox, it becomes incredibly hard to keep up with all important aspects. We therefore plan at least monthly admin phone conference meetings to keep up with recent developments.

In addition, depending of time availability and budgets, we plan to have one real life meeting per quarter.

= 9. todo/task management =

Every admin currently has their very own todo list. We will try to make these lists public, using one common tool.

= 10. housekeeping =

We will check the current recipient list of the internal admin list, removing those not being active for months.

= 11. adding new team members =

In addition, as a general rule, we try to involve new participants even better. Due to a lack of time, structure and enough VMs we failed at that, but it's important for the future growth of the admin team.

As a general rule, we will be very careful of granting root access to all machines. There is no need to grow that current list of account holders. However, we try to virtualize/jail more and more services, so new team members can e.g. take full responsibility of certain VHosts, the wiki and other parts, without giving out too much access.

The rationale behind that is that TDF already hosts lots of crucial infrastructure as well as confidential items, and we need to find the right balance between being open and inclusive, and opening security issues.