Using Buildbot with NSS on Windows systems 
Wednesday, January 9, 2013, 09:34 PM
Posted by Administrator
Here is how I had success to use Buildbot with NSS on Windows systems.

Because buildbot on Windows isn't able to kill the full set of processes belonging to an active build/test job, each time the connection between buildbot-master and buildbot-slaves (the slave on the Windows client) breaks, the slave system ends up being in a mess, where parts of the old job are still active. In that state, a new started job will misbehave, because old and new job fight against the files on disk.

I tried many approaches, including attempts to kill the jobs by name. For various reasons, most of them didn't work right for me. And by work right, I expect that the system automatically recovers from the failure, and fully restarts the job - without requiring manual cleanup by an admin, without requiring manual restart of jobs.

I heard that some consumers modify buildbot to their needs, but that is not acceptable to me. I need to focus on other things, I don't have time to learn the internals of buildbot. For that reason, approaches like "use a custom logfile parser" are not an option for me.

Here is a solution that seems to work for me, which involves automatically rebooting the Windows slave after each connection failure to the buildmaster. We don't reboot if jobs finish without exceptions (a build failure is fine, it isn't an exception).

It requires a Windows slave with buildbot-slave installed.
The buildbot-master must be configured to allow only one build for the Windows slave, BuildSlave(max_builds=1, ...).
You must setup the slave to automatically login and start the buildbot-slave after a reboot (e.g. using the StartUp program group).
If you use an encrypted tunnel for the connection between buildslave and buildmaster, that tunnel must be automatically started after reboot, too.

Let's say you have a build-and-test.sh script that you use on Linux and other Unix-like platforms, which you run from buildbot master as a "buildstep".

On Windows, we'll use different buildsteps, and one of them will call the build-and-test.sh script. (Note that the NSS build system uses the Mozilla-build tools on Windows, including MSYS, which provide a Unix-like environment, including a bash shell).

Let's say your non-Windows buildsteps are:
* get source code
* build-and-test.sh

On Windows, we'll use:
* maybe-reboot.bat
* get source code
* start-b-and-t.bat

The .bat files have the following contents:

==== maybe-reboot.bat ====
IF EXIST ..\buildbot-is-building (
del ..\buildbot-is-building
shutdown /r /t 0

timeout /t 120
)
==========================

=== start-b-and-t.bat ====
echo running > ..\buildbot-is-building

rem This is an Example.
rem Do whatever you must do to start your build.

"%MOZILLABUILD%\msys\bin\bash" -c "hg/tinder/buildbot/buildprnss.sh %*"

if %errorlevel% neq 0 (
set EXITCODE=1
) else (
set EXITCODE=0
)

del ..\buildbot-is-building


exit /b %EXITCODE%
==========================


This assumes that the user account used for building and testing has permission to reboot the system, executed using Window's "shutdown" tool.

If all goes well, what happens is:
* maybe-reboot will do nothing
* start-b-and-t will create a file to signal that it's active,
and will delete the file after the job has completed.

In case there is an exception, and start-b-and-t is still active, what happens is:
* buildslave will notice the exception and attempt to kill the active job. For advanced build/tests this will partially fail.
* buildmaster and buildslave regain the connection, and buildmaster asks to start the build.
* maybe-reboot will detect that a job is still active, and will prepare to reboot
* it will clean up by removing the status file, to ensure we will proceed to start-b-and-t after the reboot
* it will ask the system to reboot immediately
* because it will take some time before the reboot has killed all active processes, we must make sure that this job won't make things worse by running the job again, in parallel
* the timeout command will wait for 2 minutes, that should be enough for the reboot to happen, including termination of maybe-reboot.bat
* buildmaster will notice an exception, and will remember once again to retry that interrupted job
* as soon as the slave has rebooted and automatically started the buildslave, the job will be restarted in a clean environment.

You might ask, why are we using the status file in the parent directory of the buildbot
slave directory? ("..\")
Because your slave might be configured to build multiple different configurations,
that file must be in a place that is shared by all those build configurations.
The parent directory of an individual configuration is buildbot's main slave directory,
that makes it a good place for that file.

[edit]: Updated start-b-and-t.bat to return the inner script's exit status to buildbot.
view entry ( 3979 views )   |  permalink   |  $star_image$star_image$star_image$star_image$star_image ( 3 / 1666 )


<<First <Back | 1 | 2 | 3 | 4 | 5 | Next> Last>>