Message ID | 6e97e882f080e09a9f7d777cc84e9cd089d92d11.1719927511.git.jan.kiszka@siemens.com |
---|---|
State | Accepted, archived |
Headers | show |
Series | Reworks, fixes and unattended mode for image installer | expand |
On Tue, 2024-07-02 at 15:38 +0200, 'Jan Kiszka' via isar-users wrote: > From: Jan Kiszka <jan.kiszka@siemens.com> > > Pull the reboot out of the script. This allows for cleaner > integration > with different calling environment, may they be a systemd unit, an > initramfs script or simply a shell for testing purposes. > > And if the script exits with an error, wait a minute before rebooting > the system, rather than just trying to re-execute it. This permits to > inspect potential error as well. > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > --- > .../recipes-installer/deploy-image/files/deploy-image-wic.sh | 2 > +- > .../recipes-installer/deploy-image/files/install.override.conf | 2 > +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/meta-isar/recipes-installer/deploy-image/files/deploy- > image-wic.sh b/meta-isar/recipes-installer/deploy-image/files/deploy- > image-wic.sh > index 8043aff1..12c1eea2 100644 > --- a/meta-isar/recipes-installer/deploy-image/files/deploy-image- > wic.sh > +++ b/meta-isar/recipes-installer/deploy-image/files/deploy-image- > wic.sh > @@ -105,4 +105,4 @@ fi > umount "$installdata" > sync > dialog --title "Reboot" --msgbox "Installation is successful. System > will be rebooted. Please remove the USB stick." 7 60 > -reboot > +exit 0 > diff --git a/meta-isar/recipes-installer/deploy- > image/files/install.override.conf b/meta-isar/recipes- > installer/deploy-image/files/install.override.conf > index 73874caa..357d8662 100644 > --- a/meta-isar/recipes-installer/deploy- > image/files/install.override.conf > +++ b/meta-isar/recipes-installer/deploy- > image/files/install.override.conf > @@ -1,5 +1,5 @@ > [Service] > ExecStart= > -ExecStart=/usr/bin/deploy-image-wic.sh > +ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo 'Rebooting in 60 > s'; sleep 60); reboot" Hello Jan. This change seems to cause a sort of race issue. Few days ago "[PATCH v3 0/2] Cover installer image with tests" was merged. Tests were OK (and "next" is still OK) on two jenkins and one gitlab instances. But on the third jenkins instance, they were failing. Boot log: ``` Got config: installer_unattended=true installer_image_uri=/install/isar-image-ci-debian-bookworm- qemuamd64.wic.zst installer_target_dev=/dev/sda installer_target_overwrite=OVERWRITE bmaptool: ERROR: An error occurred, here is the traceback: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/bmaptools/CLI.py", line 116, in open_block_device descriptor = os.open(path, os.O_WRONLY | os.O_EXCL) bmaptool: ERROR: cannot open block device '/dev/sda' in exclusive mode: [Errno 16] Device or resource busy: '/dev/sda' Rebooting in 60 s ``` But either machine is not booted, or there are some "ext4" errors happen. I couldn't understand few things: - why this happens (unable to get exclusive access to block device)$ - why device is written even if there is such error in logs$ - why sometimes we don't see this error in logs$ - why this doesn't happen when "unattended mode" is off. The reason is that we have these two "getty" instances overriden by custom one having the race. While "serial-getty@ttyS0.service.d" does the job, "getty@tty1.service.d" produces this error visible in logs and reboot machine in 60 seconds. In some cases 60sec is enough for first service to finish writing to the disc, but on slower machines we result in incomplete copy and broken system. Before this patch, when "reboot" calls were in the script, "failed" instance simply exited and didn't try to reboot the board. As a quick fix, I would reverted this patch. Also, I think this behaviour may impact on bmap-tools "psplash" patches currently we have on maillist. I would expect that user won't see the progress bar on screen because it's somewhere on serial console. > StandardInput=tty > StandardOutput=tty > -- > 2.43.0 >
On 13.03.25 14:10, Uladzimir Bely wrote: > On Tue, 2024-07-02 at 15:38 +0200, 'Jan Kiszka' via isar-users wrote: >> From: Jan Kiszka <jan.kiszka@siemens.com> >> >> Pull the reboot out of the script. This allows for cleaner >> integration >> with different calling environment, may they be a systemd unit, an >> initramfs script or simply a shell for testing purposes. >> >> And if the script exits with an error, wait a minute before rebooting >> the system, rather than just trying to re-execute it. This permits to >> inspect potential error as well. >> >> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> >> --- >> .../recipes-installer/deploy-image/files/deploy-image-wic.sh | 2 >> +- >> .../recipes-installer/deploy-image/files/install.override.conf | 2 >> +- >> 2 files changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/meta-isar/recipes-installer/deploy-image/files/deploy- >> image-wic.sh b/meta-isar/recipes-installer/deploy-image/files/deploy- >> image-wic.sh >> index 8043aff1..12c1eea2 100644 >> --- a/meta-isar/recipes-installer/deploy-image/files/deploy-image- >> wic.sh >> +++ b/meta-isar/recipes-installer/deploy-image/files/deploy-image- >> wic.sh >> @@ -105,4 +105,4 @@ fi >> umount "$installdata" >> sync >> dialog --title "Reboot" --msgbox "Installation is successful. System >> will be rebooted. Please remove the USB stick." 7 60 >> -reboot >> +exit 0 >> diff --git a/meta-isar/recipes-installer/deploy- >> image/files/install.override.conf b/meta-isar/recipes- >> installer/deploy-image/files/install.override.conf >> index 73874caa..357d8662 100644 >> --- a/meta-isar/recipes-installer/deploy- >> image/files/install.override.conf >> +++ b/meta-isar/recipes-installer/deploy- >> image/files/install.override.conf >> @@ -1,5 +1,5 @@ >> [Service] >> ExecStart= >> -ExecStart=/usr/bin/deploy-image-wic.sh >> +ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo 'Rebooting in 60 >> s'; sleep 60); reboot" > > Hello Jan. > > This change seems to cause a sort of race issue. > > Few days ago "[PATCH v3 0/2] Cover installer image with tests" was > merged. Tests were OK (and "next" is still OK) on two jenkins and one > gitlab instances. But on the third jenkins instance, they were failing. > > Boot log: > > ``` > Got config: > installer_unattended=true > installer_image_uri=/install/isar-image-ci-debian-bookworm- > qemuamd64.wic.zst > installer_target_dev=/dev/sda > installer_target_overwrite=OVERWRITE > bmaptool: ERROR: An error occurred, here is the traceback: > Traceback (most recent call last): > File "/usr/lib/python3/dist-packages/bmaptools/CLI.py", line 116, in > open_block_device > descriptor = os.open(path, os.O_WRONLY | os.O_EXCL) > > bmaptool: ERROR: cannot open block device '/dev/sda' in exclusive mode: > [Errno 16] Device or resource busy: '/dev/sda' > > Rebooting in 60 s > ``` > > But either machine is not booted, or there are some "ext4" errors > happen. > > I couldn't understand few things: > - why this happens (unable to get exclusive access to block device)$ > - why device is written even if there is such error in logs$ > - why sometimes we don't see this error in logs$ > - why this doesn't happen when "unattended mode" is off. > > The reason is that we have these two "getty" instances overriden by > custom one having the race. While "serial-getty@ttyS0.service.d" does > the job, "getty@tty1.service.d" produces this error visible in logs and > reboot machine in 60 seconds. > > In some cases 60sec is enough for first service to finish writing to > the disc, but on slower machines we result in incomplete copy and > broken system. > > Before this patch, when "reboot" calls were in the script, "failed" > instance simply exited and didn't try to reboot the board. > > As a quick fix, I would reverted this patch. That won't be a fix, just papering over. The race is apparently around the target device becoming ready for bmaptool, or our script racing with some other service accessing the device for a while. I suspect the service is lacking some dependency here. So, let's identify the race partner first: How to reproduce locally? Jan > > Also, I think this behaviour may impact on bmap-tools "psplash" patches > currently we have on maillist. I would expect that user won't see the > progress bar on screen because it's somewhere on serial console. > > >> StandardInput=tty >> StandardOutput=tty >> -- >> 2.43.0 >> >
On Fri, 2025-03-14 at 06:54 +0100, Jan Kiszka wrote: > On 13.03.25 14:10, Uladzimir Bely wrote: > > On Tue, 2024-07-02 at 15:38 +0200, 'Jan Kiszka' via isar-users > > wrote: > > > From: Jan Kiszka <jan.kiszka@siemens.com> > > > > > > Pull the reboot out of the script. This allows for cleaner > > > integration > > > with different calling environment, may they be a systemd unit, > > > an > > > initramfs script or simply a shell for testing purposes. > > > > > > And if the script exits with an error, wait a minute before > > > rebooting > > > the system, rather than just trying to re-execute it. This > > > permits to > > > inspect potential error as well. > > > > > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > > > --- > > > .../recipes-installer/deploy-image/files/deploy-image-wic.sh > > > | 2 > > > +- > > > .../recipes-installer/deploy-image/files/install.override.conf > > > | 2 > > > +- > > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > > > diff --git a/meta-isar/recipes-installer/deploy- > > > image/files/deploy- > > > image-wic.sh b/meta-isar/recipes-installer/deploy- > > > image/files/deploy- > > > image-wic.sh > > > index 8043aff1..12c1eea2 100644 > > > --- a/meta-isar/recipes-installer/deploy-image/files/deploy- > > > image- > > > wic.sh > > > +++ b/meta-isar/recipes-installer/deploy-image/files/deploy- > > > image- > > > wic.sh > > > @@ -105,4 +105,4 @@ fi > > > umount "$installdata" > > > sync > > > dialog --title "Reboot" --msgbox "Installation is successful. > > > System > > > will be rebooted. Please remove the USB stick." 7 60 > > > -reboot > > > +exit 0 > > > diff --git a/meta-isar/recipes-installer/deploy- > > > image/files/install.override.conf b/meta-isar/recipes- > > > installer/deploy-image/files/install.override.conf > > > index 73874caa..357d8662 100644 > > > --- a/meta-isar/recipes-installer/deploy- > > > image/files/install.override.conf > > > +++ b/meta-isar/recipes-installer/deploy- > > > image/files/install.override.conf > > > @@ -1,5 +1,5 @@ > > > [Service] > > > ExecStart= > > > -ExecStart=/usr/bin/deploy-image-wic.sh > > > +ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo 'Rebooting in > > > 60 > > > s'; sleep 60); reboot" > > > > Hello Jan. > > > > This change seems to cause a sort of race issue. > > > > Few days ago "[PATCH v3 0/2] Cover installer image with tests" was > > merged. Tests were OK (and "next" is still OK) on two jenkins and > > one > > gitlab instances. But on the third jenkins instance, they were > > failing. > > > > Boot log: > > > > ``` > > Got config: > > installer_unattended=true > > installer_image_uri=/install/isar-image-ci-debian-bookworm- > > qemuamd64.wic.zst > > installer_target_dev=/dev/sda > > installer_target_overwrite=OVERWRITE > > bmaptool: ERROR: An error occurred, here is the traceback: > > Traceback (most recent call last): > > File "/usr/lib/python3/dist-packages/bmaptools/CLI.py", line 116, > > in > > open_block_device > > descriptor = os.open(path, os.O_WRONLY | os.O_EXCL) > > > > bmaptool: ERROR: cannot open block device '/dev/sda' in exclusive > > mode: > > [Errno 16] Device or resource busy: '/dev/sda' > > > > Rebooting in 60 s > > ``` > > > > But either machine is not booted, or there are some "ext4" errors > > happen. > > > > I couldn't understand few things: > > - why this happens (unable to get exclusive access to block > > device)$ > > - why device is written even if there is such error in logs$ > > - why sometimes we don't see this error in logs$ > > - why this doesn't happen when "unattended mode" is off. > > > > The reason is that we have these two "getty" instances overriden by > > custom one having the race. While "serial-getty@ttyS0.service.d" > > does > > the job, "getty@tty1.service.d" produces this error visible in logs > > and > > reboot machine in 60 seconds. > > > > In some cases 60sec is enough for first service to finish writing > > to > > the disc, but on slower machines we result in incomplete copy and > > broken system. > > > > Before this patch, when "reboot" calls were in the script, "failed" > > instance simply exited and didn't try to reboot the board. > > > > As a quick fix, I would reverted this patch. > > That won't be a fix, just papering over. > > The race is apparently around the target device becoming ready for > bmaptool, or our script racing with some other service accessing the > device for a while. I suspect the service is lacking some dependency > here. So, let's identify the race partner first: How to reproduce > locally? > > Jan > No, my finding is that getty@tty1.service.d conflicts with serial-getty@ttyS0.service.d one. Every of them tries to run bmaptool when unattended mode is active. meta-isar/recipes-installer/deploy-image-service/deploy-image- service.bb: ``` do_install() { install -m 0600 ${WORKDIR}/install.override.conf ${D}/usr/lib/systemd/system/getty@tty1.service.d/override.conf install -m 0600 ${WORKDIR}/install.override.conf ${D}/usr/lib/systemd/system/serial-getty@ttyS0.service.d/override.conf } ``` It can be easy reproduced if we decrease timeout from 60s to e.g. 10s, to simulate slower system: ``` isar $ sed -i -e 's/60/10/g' meta-isar/recipes-installer/deploy-image- service/files/install.override.conf isar $ ./kas/kas-container shell kas/isar.yaml builder@b21a175d3b78:/build$ rm -rf conf/ builder@b21a175d3b78:/build$ /repo/scripts/ci_build.sh -T installer ``` Failed instance of overriden getty reboots the system while the target is being populated with the not failed instance. Reboot hangs at something like ``` [ 9.323801] JBD2: no valid journal superblock found [ 9.325117] EXT4-fs (sda2): error loading journal ``` > > > > Also, I think this behaviour may impact on bmap-tools "psplash" > > patches > > currently we have on maillist. I would expect that user won't see > > the > > progress bar on screen because it's somewhere on serial console. > > > > > > > StandardInput=tty > > > StandardOutput=tty > > > -- > > > 2.43.0 > > > > > >
On Fri, 2025-03-14 at 09:25 +0300, Uladzimir Bely wrote: > On Fri, 2025-03-14 at 06:54 +0100, Jan Kiszka wrote: > > On 13.03.25 14:10, Uladzimir Bely wrote: > > > On Tue, 2024-07-02 at 15:38 +0200, 'Jan Kiszka' via isar-users > > > wrote: > > > > From: Jan Kiszka <jan.kiszka@siemens.com> > > > > > > > > Pull the reboot out of the script. This allows for cleaner > > > > integration > > > > with different calling environment, may they be a systemd unit, > > > > an > > > > initramfs script or simply a shell for testing purposes. > > > > > > > > And if the script exits with an error, wait a minute before > > > > rebooting > > > > the system, rather than just trying to re-execute it. This > > > > permits to > > > > inspect potential error as well. > > > > > > > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > > > > --- > > > > .../recipes-installer/deploy-image/files/deploy-image- > > > > wic.sh > > > > > 2 > > > > +- > > > > .../recipes-installer/deploy- > > > > image/files/install.override.conf > > > > > 2 > > > > +- > > > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/meta-isar/recipes-installer/deploy- > > > > image/files/deploy- > > > > image-wic.sh b/meta-isar/recipes-installer/deploy- > > > > image/files/deploy- > > > > image-wic.sh > > > > index 8043aff1..12c1eea2 100644 > > > > --- a/meta-isar/recipes-installer/deploy-image/files/deploy- > > > > image- > > > > wic.sh > > > > +++ b/meta-isar/recipes-installer/deploy-image/files/deploy- > > > > image- > > > > wic.sh > > > > @@ -105,4 +105,4 @@ fi > > > > umount "$installdata" > > > > sync > > > > dialog --title "Reboot" --msgbox "Installation is successful. > > > > System > > > > will be rebooted. Please remove the USB stick." 7 60 > > > > -reboot > > > > +exit 0 > > > > diff --git a/meta-isar/recipes-installer/deploy- > > > > image/files/install.override.conf b/meta-isar/recipes- > > > > installer/deploy-image/files/install.override.conf > > > > index 73874caa..357d8662 100644 > > > > --- a/meta-isar/recipes-installer/deploy- > > > > image/files/install.override.conf > > > > +++ b/meta-isar/recipes-installer/deploy- > > > > image/files/install.override.conf > > > > @@ -1,5 +1,5 @@ > > > > [Service] > > > > ExecStart= > > > > -ExecStart=/usr/bin/deploy-image-wic.sh > > > > +ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo 'Rebooting > > > > in > > > > 60 > > > > s'; sleep 60); reboot" > > > > > > Hello Jan. > > > > > > This change seems to cause a sort of race issue. > > > > > > Few days ago "[PATCH v3 0/2] Cover installer image with tests" > > > was > > > merged. Tests were OK (and "next" is still OK) on two jenkins and > > > one > > > gitlab instances. But on the third jenkins instance, they were > > > failing. > > > > > > Boot log: > > > > > > ``` > > > Got config: > > > installer_unattended=true > > > installer_image_uri=/install/isar-image-ci-debian-bookworm- > > > qemuamd64.wic.zst > > > installer_target_dev=/dev/sda > > > installer_target_overwrite=OVERWRITE > > > bmaptool: ERROR: An error occurred, here is the traceback: > > > Traceback (most recent call last): > > > File "/usr/lib/python3/dist-packages/bmaptools/CLI.py", line > > > 116, > > > in > > > open_block_device > > > descriptor = os.open(path, os.O_WRONLY | os.O_EXCL) > > > > > > bmaptool: ERROR: cannot open block device '/dev/sda' in exclusive > > > mode: > > > [Errno 16] Device or resource busy: '/dev/sda' > > > > > > Rebooting in 60 s > > > ``` > > > > > > But either machine is not booted, or there are some "ext4" errors > > > happen. > > > > > > I couldn't understand few things: > > > - why this happens (unable to get exclusive access to block > > > device)$ > > > - why device is written even if there is such error in logs$ > > > - why sometimes we don't see this error in logs$ > > > - why this doesn't happen when "unattended mode" is off. It does not happen when "unattended mode" is off, because the race is interrupted by the script waiting for user input at the very beginning. This is a race condition between multiple instances of the install script executed in parallel, dependent on how many gettys are active. Since the number of gettys and further, which getty get's connected in the end, via terminal, via serial, ... is kind of hard to predict, imo the only option to resolve that race is by adding a dedicated (singleton) service which is executed on startup. I am currently preparing a patch set where a fix for that is included, similar to this: ``` [Unit] Description=Target-bootstrapper service [Service] User=root ExecStart=/bin/sh -c "/usr/bin/target-bootstrapper.sh || (echo 'Rebooting in 60 s'; sleep 60); reboot" StandardOutput=journal+console TTYPath=/dev/console [Install] WantedBy=multi-user.target `` BR Alexander > > > > > > The reason is that we have these two "getty" instances overriden > > > by > > > custom one having the race. While "serial-getty@ttyS0.service.d" > > > does > > > the job, "getty@tty1.service.d" produces this error visible in > > > logs > > > and > > > reboot machine in 60 seconds. > > > > > > In some cases 60sec is enough for first service to finish writing > > > to > > > the disc, but on slower machines we result in incomplete copy and > > > broken system. > > > > > > Before this patch, when "reboot" calls were in the script, > > > "failed" > > > instance simply exited and didn't try to reboot the board. > > > > > > As a quick fix, I would reverted this patch. > > > > That won't be a fix, just papering over. > > > > The race is apparently around the target device becoming ready for > > bmaptool, or our script racing with some other service accessing > > the > > device for a while. I suspect the service is lacking some > > dependency > > here. So, let's identify the race partner first: How to reproduce > > locally? > > > > Jan > > > > No, my finding is that getty@tty1.service.d conflicts with > serial-getty@ttyS0.service.d one. Every of them tries to run bmaptool > when unattended mode is active. > > meta-isar/recipes-installer/deploy-image-service/deploy-image- > service.bb: > > ``` > do_install() { > install -m 0600 ${WORKDIR}/install.override.conf > ${D}/usr/lib/systemd/system/getty@tty1.service.d/override.conf > install -m 0600 ${WORKDIR}/install.override.conf > ${D}/usr/lib/systemd/system/serial-getty@ttyS0.service.d/override.con > f > } > ``` > > It can be easy reproduced if we decrease timeout from 60s to e.g. > 10s, > to simulate slower system: > > ``` > isar $ sed -i -e 's/60/10/g' meta-isar/recipes-installer/deploy- > image- > service/files/install.override.conf > isar $ ./kas/kas-container shell kas/isar.yaml > builder@b21a175d3b78:/build$ rm -rf conf/ > builder@b21a175d3b78:/build$ /repo/scripts/ci_build.sh -T installer > ``` > > Failed instance of overriden getty reboots the system while the > target > is being populated with the not failed instance. > > Reboot hangs at something like > > ``` > [ 9.323801] JBD2: no valid journal superblock found > [ 9.325117] EXT4-fs (sda2): error loading journal > ``` > > > > > > > > Also, I think this behaviour may impact on bmap-tools "psplash" > > > patches > > > currently we have on maillist. I would expect that user won't see > > > the > > > progress bar on screen because it's somewhere on serial console. > > > > > > > > > > StandardInput=tty > > > > StandardOutput=tty > > > > -- > > > > 2.43.0 > > > > > > > > > > > -- > Best regards, > Uladzimir. > > >
On Fri, 2025-03-14 at 14:25 +0000, Heinisch, Alexander wrote: > On Fri, 2025-03-14 at 09:25 +0300, Uladzimir Bely wrote: > > On Fri, 2025-03-14 at 06:54 +0100, Jan Kiszka wrote: > > > On 13.03.25 14:10, Uladzimir Bely wrote: > > > > On Tue, 2024-07-02 at 15:38 +0200, 'Jan Kiszka' via isar-users > > > > wrote: > > > > > From: Jan Kiszka <jan.kiszka@siemens.com> > > > > > > > > > > Pull the reboot out of the script. This allows for cleaner > > > > > integration > > > > > with different calling environment, may they be a systemd > > > > > unit, > > > > > an > > > > > initramfs script or simply a shell for testing purposes. > > > > > > > > > > And if the script exits with an error, wait a minute before > > > > > rebooting > > > > > the system, rather than just trying to re-execute it. This > > > > > permits to > > > > > inspect potential error as well. > > > > > > > > > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> > > > > > --- > > > > > .../recipes-installer/deploy-image/files/deploy-image- > > > > > wic.sh > > > > > > 2 > > > > > +- > > > > > .../recipes-installer/deploy- > > > > > image/files/install.override.conf > > > > > > 2 > > > > > +- > > > > > 2 files changed, 2 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/meta-isar/recipes-installer/deploy- > > > > > image/files/deploy- > > > > > image-wic.sh b/meta-isar/recipes-installer/deploy- > > > > > image/files/deploy- > > > > > image-wic.sh > > > > > index 8043aff1..12c1eea2 100644 > > > > > --- a/meta-isar/recipes-installer/deploy-image/files/deploy- > > > > > image- > > > > > wic.sh > > > > > +++ b/meta-isar/recipes-installer/deploy-image/files/deploy- > > > > > image- > > > > > wic.sh > > > > > @@ -105,4 +105,4 @@ fi > > > > > umount "$installdata" > > > > > sync > > > > > dialog --title "Reboot" --msgbox "Installation is > > > > > successful. > > > > > System > > > > > will be rebooted. Please remove the USB stick." 7 60 > > > > > -reboot > > > > > +exit 0 > > > > > diff --git a/meta-isar/recipes-installer/deploy- > > > > > image/files/install.override.conf b/meta-isar/recipes- > > > > > installer/deploy-image/files/install.override.conf > > > > > index 73874caa..357d8662 100644 > > > > > --- a/meta-isar/recipes-installer/deploy- > > > > > image/files/install.override.conf > > > > > +++ b/meta-isar/recipes-installer/deploy- > > > > > image/files/install.override.conf > > > > > @@ -1,5 +1,5 @@ > > > > > [Service] > > > > > ExecStart= > > > > > -ExecStart=/usr/bin/deploy-image-wic.sh > > > > > +ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo > > > > > 'Rebooting > > > > > in > > > > > 60 > > > > > s'; sleep 60); reboot" > > > > > > > > Hello Jan. > > > > > > > > This change seems to cause a sort of race issue. > > > > > > > > Few days ago "[PATCH v3 0/2] Cover installer image with tests" > > > > was > > > > merged. Tests were OK (and "next" is still OK) on two jenkins > > > > and > > > > one > > > > gitlab instances. But on the third jenkins instance, they were > > > > failing. > > > > > > > > Boot log: > > > > > > > > ``` > > > > Got config: > > > > installer_unattended=true > > > > installer_image_uri=/install/isar-image-ci-debian-bookworm- > > > > qemuamd64.wic.zst > > > > installer_target_dev=/dev/sda > > > > installer_target_overwrite=OVERWRITE > > > > bmaptool: ERROR: An error occurred, here is the traceback: > > > > Traceback (most recent call last): > > > > File "/usr/lib/python3/dist-packages/bmaptools/CLI.py", line > > > > 116, > > > > in > > > > open_block_device > > > > descriptor = os.open(path, os.O_WRONLY | os.O_EXCL) > > > > > > > > bmaptool: ERROR: cannot open block device '/dev/sda' in > > > > exclusive > > > > mode: > > > > [Errno 16] Device or resource busy: '/dev/sda' > > > > > > > > Rebooting in 60 s > > > > ``` > > > > > > > > But either machine is not booted, or there are some "ext4" > > > > errors > > > > happen. > > > > > > > > I couldn't understand few things: > > > > - why this happens (unable to get exclusive access to block > > > > device)$ > > > > - why device is written even if there is such error in logs$ > > > > - why sometimes we don't see this error in logs$ > > > > - why this doesn't happen when "unattended mode" is off. > > > It does not happen when "unattended mode" is off, because the race is > interrupted by the script waiting for user input at the very > beginning. > > This is a race condition between multiple instances of the install > script executed in parallel, dependent on how many gettys are active. > Since the number of gettys and further, which getty get's connected > in > the end, via terminal, via serial, ... is kind of hard to predict, > imo > the only option to resolve that race is by adding a dedicated > (singleton) service which is executed on startup. > > I am currently preparing a patch set where a fix for that is > included, > similar to this: Yes, there were rhetorical questions that get ansered after later debugging. I wanted (but didn't have time to try) to do something like in the script: ``` if ! $installer_unattended; then if [ "$(tty)" != "/dev/ttyS0" ]; then echo "Disable unattended mode on $(tty), it's active on /dev/ttyS0" installer_unattended=0 fi fi ``` Futher, "/dev/ttyS0" could be not hardcoded but also passed from installer configuration during build. > > ``` > [Unit] > Description=Target-bootstrapper service > > [Service] > User=root > ExecStart=/bin/sh -c "/usr/bin/target-bootstrapper.sh || (echo > 'Rebooting in 60 s'; sleep 60); reboot" > > StandardOutput=journal+console > TTYPath=/dev/console > > [Install] > WantedBy=multi-user.target > `` > > BR Alexander > > > > > > > > > The reason is that we have these two "getty" instances > > > > overriden > > > > by > > > > custom one having the race. While > > > > "serial-getty@ttyS0.service.d" > > > > does > > > > the job, "getty@tty1.service.d" produces this error visible in > > > > logs > > > > and > > > > reboot machine in 60 seconds. > > > > > > > > In some cases 60sec is enough for first service to finish > > > > writing > > > > to > > > > the disc, but on slower machines we result in incomplete copy > > > > and > > > > broken system. > > > > > > > > Before this patch, when "reboot" calls were in the script, > > > > "failed" > > > > instance simply exited and didn't try to reboot the board. > > > > > > > > As a quick fix, I would reverted this patch. > > > > > > That won't be a fix, just papering over. > > > > > > The race is apparently around the target device becoming ready > > > for > > > bmaptool, or our script racing with some other service accessing > > > the > > > device for a while. I suspect the service is lacking some > > > dependency > > > here. So, let's identify the race partner first: How to reproduce > > > locally? > > > > > > Jan > > > > > > > No, my finding is that getty@tty1.service.d conflicts with > > serial-getty@ttyS0.service.d one. Every of them tries to run > > bmaptool > > when unattended mode is active. > > > > meta-isar/recipes-installer/deploy-image-service/deploy-image- > > service.bb: > > > > ``` > > do_install() { > > install -m 0600 ${WORKDIR}/install.override.conf > > ${D}/usr/lib/systemd/system/getty@tty1.service.d/override.conf > > install -m 0600 ${WORKDIR}/install.override.conf > > ${D}/usr/lib/systemd/system/serial-getty@ttyS0.service.d/override.c > > on > > f > > } > > ``` > > > > It can be easy reproduced if we decrease timeout from 60s to e.g. > > 10s, > > to simulate slower system: > > > > ``` > > isar $ sed -i -e 's/60/10/g' meta-isar/recipes-installer/deploy- > > image- > > service/files/install.override.conf > > isar $ ./kas/kas-container shell kas/isar.yaml > > builder@b21a175d3b78:/build$ rm -rf conf/ > > builder@b21a175d3b78:/build$ /repo/scripts/ci_build.sh -T installer > > ``` > > > > Failed instance of overriden getty reboots the system while the > > target > > is being populated with the not failed instance. > > > > Reboot hangs at something like > > > > ``` > > [ 9.323801] JBD2: no valid journal superblock found > > [ 9.325117] EXT4-fs (sda2): error loading journal > > ``` > > > > > > > > > > > > Also, I think this behaviour may impact on bmap-tools "psplash" > > > > patches > > > > currently we have on maillist. I would expect that user won't > > > > see > > > > the > > > > progress bar on screen because it's somewhere on serial > > > > console. > > > > > > > > > > > > > StandardInput=tty > > > > > StandardOutput=tty > > > > > -- > > > > > 2.43.0 > > > > > > > > > > > > > > > > -- > > Best regards, > > Uladzimir. > > > > > > >
diff --git a/meta-isar/recipes-installer/deploy-image/files/deploy-image-wic.sh b/meta-isar/recipes-installer/deploy-image/files/deploy-image-wic.sh index 8043aff1..12c1eea2 100644 --- a/meta-isar/recipes-installer/deploy-image/files/deploy-image-wic.sh +++ b/meta-isar/recipes-installer/deploy-image/files/deploy-image-wic.sh @@ -105,4 +105,4 @@ fi umount "$installdata" sync dialog --title "Reboot" --msgbox "Installation is successful. System will be rebooted. Please remove the USB stick." 7 60 -reboot +exit 0 diff --git a/meta-isar/recipes-installer/deploy-image/files/install.override.conf b/meta-isar/recipes-installer/deploy-image/files/install.override.conf index 73874caa..357d8662 100644 --- a/meta-isar/recipes-installer/deploy-image/files/install.override.conf +++ b/meta-isar/recipes-installer/deploy-image/files/install.override.conf @@ -1,5 +1,5 @@ [Service] ExecStart= -ExecStart=/usr/bin/deploy-image-wic.sh +ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo 'Rebooting in 60 s'; sleep 60); reboot" StandardInput=tty StandardOutput=tty