As a faulty connection cable between www.cacert.org and the signer made it necessary to travel to the datacenter this weekend instead of the planned schedule later this year we were able to finish this part earlier than expected: We finalized on the last steps of moving CAcert to a more modem hardware and software on critical servers.
This project was started “somehow” in May 2020 when the signer power board broke just before the Corona-Lockdown took place. The old signer was replaced by the same model at this visit. Since then we had several outages, which were mainly caused by broken hardware, sometimes noticed by our members, sometimes only visible in our internal monitoring.
Today the last of the old servers (our signer) was powered down as it was replaced by two modern machines using a more recent debian release, but keeping the old signer-coding.
The complete hardware-replacement-project reduced the power consumption of all CAcert-servers for more than 60%.
But that’s not all: We have plans to put our signer-environment to a new software written in Go, but here we need YOUR help in testing and reviewing the code. Feel free to contact email@example.com to get in touch to our experts.
+++ Update +++ www.cacert.org is now running on a new server, first tests were successful. Still some finetuning needs to be done afterwards +++ update +++
During the long weekend around pentecost (“Pfingsten” as it is called here in Germany) we’re planning the next step in replacing some hardware at the datacenter.
The main reason for the visit at the datacenter on monday is it to plug the serial connection between our webserver and signer to the new machine.
As our main website will move to a new server, which was installed in the datacenter during the last visit, there will be an interruption of service while doing the final copy and reconfiguration of the firewall (hopefully not longer than one hour).
While we’re at the datacenter we’re adding two SSD-drives to infra02. During the activation of the host system on these SSDs the services running on infra02 (like blog, wiki etc.) will not be accessible and/or slower than usual.
After all services are moved (remotely/afterwards) from the HDDs to SSDs everything should be active again … and most likely faster.
At a later visit (planned in July) the old sun1-server and old infra02-HDDs will be removed from the rack.
The final step for hardware-upgrade/replacement in the critical environment will be a replacement of the old signer machine(s) by new servers and HSM-modules. For this step software- as well as development team need some assistance in reviewing and testing especially the coding (written in Go). Feel free to contact us via support@.c.o, mailing-lists or using comments to this blog-entry.
… we’ve just activated our own OCSP-resolver on our new arm64-servers.
This sounds a little bit unspectacular, but it’s a big milestone while replacing hard- and software within our environment as the old OCSP-resolver-software could not be ported to a recent debian and arm64-environment.
All other critical services (like Nameserver and CRL-Serving) were already moved successfully to our new power-saving machines (2 Raspberry Pi4) in the last weeks/months. OCSP needed some development and testing.
The virtual machines in the old environment are now stopped, within the next days the (power-consuming) sun3-server will then get it’s final shutdown and will be removed from CAcert-Rack during the next visit at the datacenter.
Our main website and signer-software will still be kept running on dedicated servers.
Today we switched the connection to our main website as a preparation for a “bigger” change. Unfortunately this (temporary) change is not IPv6-capable, so only IPv4 is working currently.
Over the weekend we plan to move www.cacert.org to another server for a more recent environment and add a second firewall to our rack. During this server-transition you may face some issues while using www.cacert.org, after the weekend the services should be normal again.
Early next week we’ll enable IPv6 again for our main website (maybe by using a new IPv6-Address, but that’s not yet decided).
All other services (like blog/wiki/bugs/…) should remain active as usual as there is currently no planned update.
Update: Nameserver-transition is currently finished, new DNSSEC-records are set and active. KSK and ZSK were replaced by CSK.
In the ongoing process to update hard- and software we’re moving our main domain cacert.org to another master-nameserver-machine (with different nameserver-software) within our rack …
As we’re using DNSSEC to secure our domains, we need to update KSK and ZSK-keys for our domains during this progress, too.
Therefore you may face some DNSSEC-errors or issues in resolving cacert.org-domains within the next days, but this should resolve itself within some hours/days.
As soon as the transition of the nameserver-move is finished, I’ll update this post.
Todo: Give ns1.cacert.org the “old” nameserver-address again (after next hardware-change onsite) so secondary-nameserver ns3.cacert.org can get back to work. ns3 is currently not listed at our registrar, so not active for CAcert-Domains.
Moving www.cacert.org to new hardware was not successful due to some firewall settings, so we decided to keep the old server active.
During the next days/weeks we’ll change some firewall settings remotely so short downtimes may apply before we try to activate the new server during the next visit in some weeks.
During the next visit at the datacenter on Friday we’re doing some hardware-changes within our rack, especially for our main website www.cacert.org.
As a preparation we will disable most of the services on www.cacert.org on Tuesday evening. The site will be fully operational again after the new server is up and running (most likely during Friday morning).
All other subdomains like blog/wiki/… will only have a short outage while we install a new firewall.
— this post will be updated after returning back from the datacenter —
The activation of signer machine was successful, all pending certificates were processed in the last hours.
Short version: There is a visit at the datacenter planned to enable the signer again (and do some other maintenance there).
Unfortunately it was not possible to get the signer back to work again during the last visit due to a hardware-issue with the harddrive.
To get the server running on the (pre-)created backup drive did fail, too …
Therefore we took the time during the last weeks (when it was not possible to visit the datacenter due to different business and personal reasons) to rebuild a test-environment on spare hardware and to train ourselves.
We should now be able to do the necessary steps to bring back the signer machine to work.
In the background we’re currently adjusting our processes to make it easier to visit the datacenter during out-of-office-times (as every trip to the datacenter takes several hours additionally to the time we’re working at the servers).
In future we plan to set up an additional confuguration, which can take over in case of a failure in the datacenter, but this will still take time. However: The exact procedure needs to be worked out as the machines are not to be connected to the internet, but need to communicate (e.g. for CRL-creation, certificate serial numbers etc.).
After a new member was added to the access engineers team it was possible to visit the datacenter following the epidemiological guidelines for SARS-CoV-2, as well as our own security guidelines.
During this visit we applied the long-awaited patch for bug 1438 by adding the serial number to certificate revocation lists.
This visit also provided an opportunity to instal a new infrastructure-server, courtesy of Abil’I.T. , a Luxembourg based free software service provider. Many thanks again!
… and …
We did the Class-3-resigning during this visit. Currently we’re testing this new Class-3-certificate and will publish it real soon.
A new visit in the summer will be necessary to replace hardware (and maybe apply further patches on the signer).
Today we were able to investigate the signer machine at the datacenter.
As previously assumed, the signer machine was powered off. It was not possible to power it on again, so either both PSUs or other components died.
As we ordered a replacement-machine of the same type we were able to use the existing harddrives to power up the signer again.
Currently the signer is catching up, which will take some hours. As soon as your certificate was processed, you’ll get an email from our server.
The certificate of www.cacert.org is in the queue (together with your certificates and revocations), so we need to wait until it’s ready. It will get updated as soon as possible.
Update 2020-05-05: All pending certificates requests are processed now, new requests should now processed on the fly again.
CAcert critical admin
Since yesterday evening the main webserver of CAcert is currently not available.
We’re working hard in the background to get it up and running again.
According to the logfiles the server crashed at (or shortly after) Jan 21 18:07:39
The machine is up again after a hardware restart since: Jan 23 10:55:17
(Software-Restarts of the sun-server did not help yesterday …)
The root-cause-analysis is yet to be done (and will be done later the day).
A detailed investigation of our logfiles did NOT show any intrusion attack. We were not able to find any details, why the hardware-server stopped responding.
Sorry for any inconvenience.
CAcert Crtitical Admin