Tag Archives: Critical team

Signature server back in operation

Retour en fonctionnement du serveur de signature

Le serveur responsable de signer à la demande les certificats émis par CAcert dispose de deux disques durs, en redondance l’un de l’autre. Lorsqu’un dysfonctionnement se produit, aucune maintenance à distance n’est possible, car la machine n’est intentionnellement pas branchée au réseau. Seul un câble série permet d’échanger requêtes et réponses avec le reste de notre infrastructure. Aucune connexion n’est possible par ce moyen.

Or, depuis le 2 Août, nous observions la mise en attente de toutes les demandes de signature de certificats. L’équipe des infrastructures critiques est donc intervenue sur site ce 21 Août. Un problème dans le traitement d’un des certificats était la cause du blocage. Ce problème est résolu, mais reste à diagnostiquer avec précision. Il s’agit d’une série d’incidents que nous n’avions jamais vus auparavant.

Compte tenu des deux autres incidents intervenus plus tôt cette année, liés au système de fichiers de notre serveur de signature, nous devions accroitre sa résilience. Aussi, ce 21 août, l’équipe des infrastructures critiques a installé dans le rack un second serveur de signature, comme secours passif du premier. La présence de liens série dédiés vers chaque machine permettra à l’avenir de basculer très rapidement sur le second serveur de signature, en cas de nouveau problème. Dans tous les cas, les deux serveurs restent comme auparavant isolés du réseau.

Nous prions nos membres de nous excuser pour ces dysfonctionnements, et encourageons ceux résidant en Hollande où dans sa proche périphérie, à envisager de s’associer au travail de notre équipe des infrastructures critiques, ce qui augmenterait notre capacité d’intervention rapide.

Simultanément, nous espérons que l’intervention d’hier marque la fin de cette longue et exceptionnelle série.

English version

The server responsible for signing certificates issued by CAcert on demand has two hard disks, redundant to each other. When a malfunction occurs, no remote maintenance is possible, as the machine is intentionally not connected to the network. Only a serial cable is used to exchange requests and responses with the rest of our infrastructure. No connection is possible by this means.

However, since the 2nd of August, we have been seeing all certificate signing requests being put on hold. The Critical Infrastructure team therefore intervened on site on the 21st of August. A problem in the processing of one of the certificates was the cause of the blockage. This problem has been solved, but remains to be precisely diagnosed. This is a series of failures that we have never seen before.

In light of the two other incidents earlier this year related to the file system of our signature server, we needed to increase its resilience. So on 21 August, the Critical Infrastructure team installed a second signature server in the rack as a passive backup to the first. The presence of dedicated serial links to each machine will make it possible in future to switch very quickly to the second signature server in the event of a new problem. In any case, the two servers remain isolated from the network as before.

We apologise to our members for the inconvenience, and encourage those living in or near the Netherlands to consider working with our Critical Infrastructure team, which would increase our ability to respond quickly.

At the same time, we hope that yesterday’s intervention marks the end of this long and exceptional series.

New signer proves itself in use

EN: Signer is running again

DE: Signer ist wieder in Betrieb

FR: Signataire fonctionne à nouveau

ES: Firmante vuelve a funcionar

IT: Firmatario è di nuovo in funzione

The signer has been running again since yesterday, Friday, around 13:00 CEST. We then (while we were doing other work) watched the processing for about another hour… Around 0:30 CEST all outstanding certificate requests (~3000) were processed.

Things didn’t quite go as planned in June. As soon as something cannot be done remotely – there is no remote access to critical systems for security reasons – someone who is authorised to do so has to go the data centre in the Netherlands. Despite Corona, quarantine, floods, overtime at the company and whatever else comes up. That’s maybe two hours. Then two hours home again and in between the actual work. During the opening hours of the data centre, in your free time and paying for your own train ticket or petrol. It’s not always easy to reconcile all that. On Friday afternoon, however, the time had come and the Signer has now been running smoothly again for over a day.

As can be seen from the Critical Team’s plan published yesterday, preliminary work is already underway to make the system redundant throughout and even more robust, so that failures should no longer be noticed by users, because no one is interested in such failures! We are very sorry that you had to wait so long. At the same time, we thank the small core team who have sacrificed nights and weekends over the last five weeks to get the technology back up and running for the CAcert community!

Engineers nominated

The free certificate authority CAcert is making progress in increasing the number of its working groups. In the past few days, the committee approved the appointment of Jan to the post of Critical Engineer.

The appointment of Michaela as Access Engineer was also approved. Both have a broad range of experience and are distinguished by their specialist knowledge and sense of responsibility. We wish both engineers much success and fulfilment in their voluntary work for the CAcert community. These are challenging tasks and come with great responsibility. CAcert offers interested volunteers a variety of tasks, the opportunity to gain exciting experience and stimulating career opportunities.

Software-Assessment-Project reached next milestone

Todays systemlog message marks the quantum leap in our about 10 months project work, to become the Software-Assessment area auditable.

As many Software-Updates are in the queue from the software developers, that needs testing and reviews by Software Assessors, the team started by end of last year with this project,

  • to build up a new ”controlled” testserver with authority by Software-Assessors
  • built up by the critical team as a Disaster Recovery testcase
  • a new central repository for all the upcoming software projects (including the New Software project BirdShack)
  • building a new test team running the software tests
  • and finalyze the process by a review of the patches by 2 Software-Assessors
  • document the patches, the testing, the review and the check by two Software-Assessors
  • to bundle the new Software-revision for transfer to the Critical team

The systemlog message signals, that the first tested and reviewed patches has received by the critical system webdb and is incorporated into production. A new tarball has been generated to build the next basis for applying the next patches.

So here my thanks goes to all the involved teams,

  • Software-Assessment-Project team
  • the new Software Testteam
  • the Critical Sysadmins team
  • and last but not least to the Software-Assessors from the Software-Assessment team

With all these people assistance, this project hadn’t be pushed to this milestone. Thank you Andreas, to build the project plan and the technical background, and also hosting the current testserver, Thank you Wytze for all your work to build the new testserver from scratch as identical as possible to the production server, to Michael, who assist us in deploying the new git repository and also assistance in deploying the Testserver-Mgmt-System, so everybody can start testing w/o the need of console access, Thank you Markus, for all your time and effort to deploy the repository and testserver environment and also your work together with Philipp as Software-Assessor, to finalyze the Software-Update-Cycle. Thank you Dirk for all your suggestions to move on with this project.

Some more work is todo:

  • adding a test-signer, so also cert related patches can be tested in the future (Andreas and Markus are working on this)
  • deploying a C(ontinous)I(ntegration) system for automated testing (Andreas is working on this).

Now the teams have to walk thru the list of open bugs, that needs to be pushed thru … First of all is the “Thawte” bug … to signal all users who’ve got their Thawte points transfered by the old Tverify program if they are effected by the points removal or if they are safe. The CCA-Rollout with a couple of patches, a list of new Policies and Subpolicies related patches (eg. PoJAM, TTP program), a list of Arbitration pushed patches, and so on …

So guys, lets have a party tonight, we’ve wiped out one of the biggest audit blockers!