[03/17] meta-isar: deploy-image: Change reboot logic

Message ID 6e97e882f080e09a9f7d777cc84e9cd089d92d11.1719927511.git.jan.kiszka@siemens.com
State Accepted, archived
Headers show
Series Reworks, fixes and unattended mode for image installer | expand

Commit Message

Jan Kiszka July 2, 2024, 1:38 p.m. UTC
From: Jan Kiszka <jan.kiszka@siemens.com>

Pull the reboot out of the script. This allows for cleaner integration
with different calling environment, may they be a systemd unit, an
initramfs script or simply a shell for testing purposes.

And if the script exits with an error, wait a minute before rebooting
the system, rather than just trying to re-execute it. This permits to
inspect potential error as well.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 .../recipes-installer/deploy-image/files/deploy-image-wic.sh    | 2 +-
 .../recipes-installer/deploy-image/files/install.override.conf  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

Comments

Uladzimir Bely March 13, 2025, 1:10 p.m. UTC | #1
On Tue, 2024-07-02 at 15:38 +0200, 'Jan Kiszka' via isar-users wrote:
> From: Jan Kiszka <jan.kiszka@siemens.com>
> 
> Pull the reboot out of the script. This allows for cleaner
> integration
> with different calling environment, may they be a systemd unit, an
> initramfs script or simply a shell for testing purposes.
> 
> And if the script exits with an error, wait a minute before rebooting
> the system, rather than just trying to re-execute it. This permits to
> inspect potential error as well.
> 
> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> ---
>  .../recipes-installer/deploy-image/files/deploy-image-wic.sh    | 2
> +-
>  .../recipes-installer/deploy-image/files/install.override.conf  | 2
> +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/meta-isar/recipes-installer/deploy-image/files/deploy-
> image-wic.sh b/meta-isar/recipes-installer/deploy-image/files/deploy-
> image-wic.sh
> index 8043aff1..12c1eea2 100644
> --- a/meta-isar/recipes-installer/deploy-image/files/deploy-image-
> wic.sh
> +++ b/meta-isar/recipes-installer/deploy-image/files/deploy-image-
> wic.sh
> @@ -105,4 +105,4 @@ fi
>  umount "$installdata"
>  sync
>  dialog --title "Reboot" --msgbox "Installation is successful. System
> will be rebooted. Please remove the USB stick." 7 60
> -reboot
> +exit 0
> diff --git a/meta-isar/recipes-installer/deploy-
> image/files/install.override.conf b/meta-isar/recipes-
> installer/deploy-image/files/install.override.conf
> index 73874caa..357d8662 100644
> --- a/meta-isar/recipes-installer/deploy-
> image/files/install.override.conf
> +++ b/meta-isar/recipes-installer/deploy-
> image/files/install.override.conf
> @@ -1,5 +1,5 @@
>  [Service]
>  ExecStart=
> -ExecStart=/usr/bin/deploy-image-wic.sh
> +ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo 'Rebooting in 60
> s'; sleep 60); reboot"

Hello Jan.

This change seems to cause a sort of race issue.

Few days ago "[PATCH v3 0/2] Cover installer image with tests" was
merged. Tests were OK (and "next" is still OK) on two jenkins and one
gitlab instances. But on the third jenkins instance, they were failing.

Boot log:

```
Got config:
  installer_unattended=true
  installer_image_uri=/install/isar-image-ci-debian-bookworm-
qemuamd64.wic.zst
  installer_target_dev=/dev/sda
  installer_target_overwrite=OVERWRITE
bmaptool: ERROR: An error occurred, here is the traceback:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/bmaptools/CLI.py", line 116, in
open_block_device
    descriptor = os.open(path, os.O_WRONLY | os.O_EXCL)

bmaptool: ERROR: cannot open block device '/dev/sda' in exclusive mode:
[Errno 16] Device or resource busy: '/dev/sda'

Rebooting in 60 s
```

But either machine is not booted, or there are some "ext4" errors
happen.

I couldn't understand few things:
 - why this happens (unable to get exclusive access to block device)$
 - why device is written even if there is such error in logs$
 - why sometimes we don't see this error in logs$
 - why this doesn't happen when "unattended mode" is off.

The reason is that we have these two "getty" instances overriden by
custom one having the race. While "serial-getty@ttyS0.service.d" does
the job, "getty@tty1.service.d" produces this error visible in logs and
reboot machine in 60 seconds.

In some cases 60sec is enough for first service to finish writing to
the disc, but on slower machines we result in incomplete copy and
broken system.

Before this patch, when "reboot" calls were in the script, "failed"
instance simply exited and didn't try to reboot the board.

As a quick fix, I would reverted this patch.

Also, I think this behaviour may impact on bmap-tools "psplash" patches
currently we have on maillist. I would expect that user won't see the
progress bar on screen because it's somewhere on serial console.


>  StandardInput=tty
>  StandardOutput=tty
> -- 
> 2.43.0
>
Jan Kiszka March 14, 2025, 5:54 a.m. UTC | #2
On 13.03.25 14:10, Uladzimir Bely wrote:
> On Tue, 2024-07-02 at 15:38 +0200, 'Jan Kiszka' via isar-users wrote:
>> From: Jan Kiszka <jan.kiszka@siemens.com>
>>
>> Pull the reboot out of the script. This allows for cleaner
>> integration
>> with different calling environment, may they be a systemd unit, an
>> initramfs script or simply a shell for testing purposes.
>>
>> And if the script exits with an error, wait a minute before rebooting
>> the system, rather than just trying to re-execute it. This permits to
>> inspect potential error as well.
>>
>> Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
>> ---
>>  .../recipes-installer/deploy-image/files/deploy-image-wic.sh    | 2
>> +-
>>  .../recipes-installer/deploy-image/files/install.override.conf  | 2
>> +-
>>  2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/meta-isar/recipes-installer/deploy-image/files/deploy-
>> image-wic.sh b/meta-isar/recipes-installer/deploy-image/files/deploy-
>> image-wic.sh
>> index 8043aff1..12c1eea2 100644
>> --- a/meta-isar/recipes-installer/deploy-image/files/deploy-image-
>> wic.sh
>> +++ b/meta-isar/recipes-installer/deploy-image/files/deploy-image-
>> wic.sh
>> @@ -105,4 +105,4 @@ fi
>>  umount "$installdata"
>>  sync
>>  dialog --title "Reboot" --msgbox "Installation is successful. System
>> will be rebooted. Please remove the USB stick." 7 60
>> -reboot
>> +exit 0
>> diff --git a/meta-isar/recipes-installer/deploy-
>> image/files/install.override.conf b/meta-isar/recipes-
>> installer/deploy-image/files/install.override.conf
>> index 73874caa..357d8662 100644
>> --- a/meta-isar/recipes-installer/deploy-
>> image/files/install.override.conf
>> +++ b/meta-isar/recipes-installer/deploy-
>> image/files/install.override.conf
>> @@ -1,5 +1,5 @@
>>  [Service]
>>  ExecStart=
>> -ExecStart=/usr/bin/deploy-image-wic.sh
>> +ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo 'Rebooting in 60
>> s'; sleep 60); reboot"
> 
> Hello Jan.
> 
> This change seems to cause a sort of race issue.
> 
> Few days ago "[PATCH v3 0/2] Cover installer image with tests" was
> merged. Tests were OK (and "next" is still OK) on two jenkins and one
> gitlab instances. But on the third jenkins instance, they were failing.
> 
> Boot log:
> 
> ```
> Got config:
>   installer_unattended=true
>   installer_image_uri=/install/isar-image-ci-debian-bookworm-
> qemuamd64.wic.zst
>   installer_target_dev=/dev/sda
>   installer_target_overwrite=OVERWRITE
> bmaptool: ERROR: An error occurred, here is the traceback:
> Traceback (most recent call last):
>   File "/usr/lib/python3/dist-packages/bmaptools/CLI.py", line 116, in
> open_block_device
>     descriptor = os.open(path, os.O_WRONLY | os.O_EXCL)
> 
> bmaptool: ERROR: cannot open block device '/dev/sda' in exclusive mode:
> [Errno 16] Device or resource busy: '/dev/sda'
> 
> Rebooting in 60 s
> ```
> 
> But either machine is not booted, or there are some "ext4" errors
> happen.
> 
> I couldn't understand few things:
>  - why this happens (unable to get exclusive access to block device)$
>  - why device is written even if there is such error in logs$
>  - why sometimes we don't see this error in logs$
>  - why this doesn't happen when "unattended mode" is off.
> 
> The reason is that we have these two "getty" instances overriden by
> custom one having the race. While "serial-getty@ttyS0.service.d" does
> the job, "getty@tty1.service.d" produces this error visible in logs and
> reboot machine in 60 seconds.
> 
> In some cases 60sec is enough for first service to finish writing to
> the disc, but on slower machines we result in incomplete copy and
> broken system.
> 
> Before this patch, when "reboot" calls were in the script, "failed"
> instance simply exited and didn't try to reboot the board.
> 
> As a quick fix, I would reverted this patch.

That won't be a fix, just papering over.

The race is apparently around the target device becoming ready for
bmaptool, or our script racing with some other service accessing the
device for a while. I suspect the service is lacking some dependency
here. So, let's identify the race partner first: How to reproduce locally?

Jan

> 
> Also, I think this behaviour may impact on bmap-tools "psplash" patches
> currently we have on maillist. I would expect that user won't see the
> progress bar on screen because it's somewhere on serial console.
> 
> 
>>  StandardInput=tty
>>  StandardOutput=tty
>> -- 
>> 2.43.0
>>
>
Uladzimir Bely March 14, 2025, 6:25 a.m. UTC | #3
On Fri, 2025-03-14 at 06:54 +0100, Jan Kiszka wrote:
> On 13.03.25 14:10, Uladzimir Bely wrote:
> > On Tue, 2024-07-02 at 15:38 +0200, 'Jan Kiszka' via isar-users
> > wrote:
> > > From: Jan Kiszka <jan.kiszka@siemens.com>
> > > 
> > > Pull the reboot out of the script. This allows for cleaner
> > > integration
> > > with different calling environment, may they be a systemd unit,
> > > an
> > > initramfs script or simply a shell for testing purposes.
> > > 
> > > And if the script exits with an error, wait a minute before
> > > rebooting
> > > the system, rather than just trying to re-execute it. This
> > > permits to
> > > inspect potential error as well.
> > > 
> > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> > > ---
> > >  .../recipes-installer/deploy-image/files/deploy-image-wic.sh   
> > > | 2
> > > +-
> > >  .../recipes-installer/deploy-image/files/install.override.conf 
> > > | 2
> > > +-
> > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/meta-isar/recipes-installer/deploy-
> > > image/files/deploy-
> > > image-wic.sh b/meta-isar/recipes-installer/deploy-
> > > image/files/deploy-
> > > image-wic.sh
> > > index 8043aff1..12c1eea2 100644
> > > --- a/meta-isar/recipes-installer/deploy-image/files/deploy-
> > > image-
> > > wic.sh
> > > +++ b/meta-isar/recipes-installer/deploy-image/files/deploy-
> > > image-
> > > wic.sh
> > > @@ -105,4 +105,4 @@ fi
> > >  umount "$installdata"
> > >  sync
> > >  dialog --title "Reboot" --msgbox "Installation is successful.
> > > System
> > > will be rebooted. Please remove the USB stick." 7 60
> > > -reboot
> > > +exit 0
> > > diff --git a/meta-isar/recipes-installer/deploy-
> > > image/files/install.override.conf b/meta-isar/recipes-
> > > installer/deploy-image/files/install.override.conf
> > > index 73874caa..357d8662 100644
> > > --- a/meta-isar/recipes-installer/deploy-
> > > image/files/install.override.conf
> > > +++ b/meta-isar/recipes-installer/deploy-
> > > image/files/install.override.conf
> > > @@ -1,5 +1,5 @@
> > >  [Service]
> > >  ExecStart=
> > > -ExecStart=/usr/bin/deploy-image-wic.sh
> > > +ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo 'Rebooting in
> > > 60
> > > s'; sleep 60); reboot"
> > 
> > Hello Jan.
> > 
> > This change seems to cause a sort of race issue.
> > 
> > Few days ago "[PATCH v3 0/2] Cover installer image with tests" was
> > merged. Tests were OK (and "next" is still OK) on two jenkins and
> > one
> > gitlab instances. But on the third jenkins instance, they were
> > failing.
> > 
> > Boot log:
> > 
> > ```
> > Got config:
> >   installer_unattended=true
> >   installer_image_uri=/install/isar-image-ci-debian-bookworm-
> > qemuamd64.wic.zst
> >   installer_target_dev=/dev/sda
> >   installer_target_overwrite=OVERWRITE
> > bmaptool: ERROR: An error occurred, here is the traceback:
> > Traceback (most recent call last):
> >   File "/usr/lib/python3/dist-packages/bmaptools/CLI.py", line 116,
> > in
> > open_block_device
> >     descriptor = os.open(path, os.O_WRONLY | os.O_EXCL)
> > 
> > bmaptool: ERROR: cannot open block device '/dev/sda' in exclusive
> > mode:
> > [Errno 16] Device or resource busy: '/dev/sda'
> > 
> > Rebooting in 60 s
> > ```
> > 
> > But either machine is not booted, or there are some "ext4" errors
> > happen.
> > 
> > I couldn't understand few things:
> >  - why this happens (unable to get exclusive access to block
> > device)$
> >  - why device is written even if there is such error in logs$
> >  - why sometimes we don't see this error in logs$
> >  - why this doesn't happen when "unattended mode" is off.
> > 
> > The reason is that we have these two "getty" instances overriden by
> > custom one having the race. While "serial-getty@ttyS0.service.d"
> > does
> > the job, "getty@tty1.service.d" produces this error visible in logs
> > and
> > reboot machine in 60 seconds.
> > 
> > In some cases 60sec is enough for first service to finish writing
> > to
> > the disc, but on slower machines we result in incomplete copy and
> > broken system.
> > 
> > Before this patch, when "reboot" calls were in the script, "failed"
> > instance simply exited and didn't try to reboot the board.
> > 
> > As a quick fix, I would reverted this patch.
> 
> That won't be a fix, just papering over.
> 
> The race is apparently around the target device becoming ready for
> bmaptool, or our script racing with some other service accessing the
> device for a while. I suspect the service is lacking some dependency
> here. So, let's identify the race partner first: How to reproduce
> locally?
> 
> Jan
> 

No, my finding is that getty@tty1.service.d conflicts with
serial-getty@ttyS0.service.d one. Every of them tries to run bmaptool
when unattended mode is active.

meta-isar/recipes-installer/deploy-image-service/deploy-image-
service.bb:

```
do_install() {
  install -m 0600 ${WORKDIR}/install.override.conf
${D}/usr/lib/systemd/system/getty@tty1.service.d/override.conf
  install -m 0600 ${WORKDIR}/install.override.conf
${D}/usr/lib/systemd/system/serial-getty@ttyS0.service.d/override.conf
}
```

It can be easy reproduced if we decrease timeout from 60s to e.g. 10s,
to simulate slower system:

```
isar $ sed -i -e 's/60/10/g' meta-isar/recipes-installer/deploy-image-
service/files/install.override.conf 
isar $ ./kas/kas-container shell kas/isar.yaml 
builder@b21a175d3b78:/build$ rm -rf conf/
builder@b21a175d3b78:/build$ /repo/scripts/ci_build.sh -T installer
```

Failed instance of overriden getty reboots the system while the target
is being populated with the not failed instance.

Reboot hangs at something like

```
[    9.323801] JBD2: no valid journal superblock found
[    9.325117] EXT4-fs (sda2): error loading journal
```


> > 
> > Also, I think this behaviour may impact on bmap-tools "psplash"
> > patches
> > currently we have on maillist. I would expect that user won't see
> > the
> > progress bar on screen because it's somewhere on serial console.
> > 
> > 
> > >  StandardInput=tty
> > >  StandardOutput=tty
> > > -- 
> > > 2.43.0
> > > 
> > 
>
Heinisch, Alexander March 14, 2025, 2:25 p.m. UTC | #4
On Fri, 2025-03-14 at 09:25 +0300, Uladzimir Bely wrote:
> On Fri, 2025-03-14 at 06:54 +0100, Jan Kiszka wrote:
> > On 13.03.25 14:10, Uladzimir Bely wrote:
> > > On Tue, 2024-07-02 at 15:38 +0200, 'Jan Kiszka' via isar-users
> > > wrote:
> > > > From: Jan Kiszka <jan.kiszka@siemens.com>
> > > > 
> > > > Pull the reboot out of the script. This allows for cleaner
> > > > integration
> > > > with different calling environment, may they be a systemd unit,
> > > > an
> > > > initramfs script or simply a shell for testing purposes.
> > > > 
> > > > And if the script exits with an error, wait a minute before
> > > > rebooting
> > > > the system, rather than just trying to re-execute it. This
> > > > permits to
> > > > inspect potential error as well.
> > > > 
> > > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> > > > ---
> > > >  .../recipes-installer/deploy-image/files/deploy-image-
> > > > wic.sh   
> > > > > 2
> > > > +-
> > > >  .../recipes-installer/deploy-
> > > > image/files/install.override.conf 
> > > > > 2
> > > > +-
> > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/meta-isar/recipes-installer/deploy-
> > > > image/files/deploy-
> > > > image-wic.sh b/meta-isar/recipes-installer/deploy-
> > > > image/files/deploy-
> > > > image-wic.sh
> > > > index 8043aff1..12c1eea2 100644
> > > > --- a/meta-isar/recipes-installer/deploy-image/files/deploy-
> > > > image-
> > > > wic.sh
> > > > +++ b/meta-isar/recipes-installer/deploy-image/files/deploy-
> > > > image-
> > > > wic.sh
> > > > @@ -105,4 +105,4 @@ fi
> > > >  umount "$installdata"
> > > >  sync
> > > >  dialog --title "Reboot" --msgbox "Installation is successful.
> > > > System
> > > > will be rebooted. Please remove the USB stick." 7 60
> > > > -reboot
> > > > +exit 0
> > > > diff --git a/meta-isar/recipes-installer/deploy-
> > > > image/files/install.override.conf b/meta-isar/recipes-
> > > > installer/deploy-image/files/install.override.conf
> > > > index 73874caa..357d8662 100644
> > > > --- a/meta-isar/recipes-installer/deploy-
> > > > image/files/install.override.conf
> > > > +++ b/meta-isar/recipes-installer/deploy-
> > > > image/files/install.override.conf
> > > > @@ -1,5 +1,5 @@
> > > >  [Service]
> > > >  ExecStart=
> > > > -ExecStart=/usr/bin/deploy-image-wic.sh
> > > > +ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo 'Rebooting
> > > > in
> > > > 60
> > > > s'; sleep 60); reboot"
> > > 
> > > Hello Jan.
> > > 
> > > This change seems to cause a sort of race issue.
> > > 
> > > Few days ago "[PATCH v3 0/2] Cover installer image with tests"
> > > was
> > > merged. Tests were OK (and "next" is still OK) on two jenkins and
> > > one
> > > gitlab instances. But on the third jenkins instance, they were
> > > failing.
> > > 
> > > Boot log:
> > > 
> > > ```
> > > Got config:
> > >   installer_unattended=true
> > >   installer_image_uri=/install/isar-image-ci-debian-bookworm-
> > > qemuamd64.wic.zst
> > >   installer_target_dev=/dev/sda
> > >   installer_target_overwrite=OVERWRITE
> > > bmaptool: ERROR: An error occurred, here is the traceback:
> > > Traceback (most recent call last):
> > >   File "/usr/lib/python3/dist-packages/bmaptools/CLI.py", line
> > > 116,
> > > in
> > > open_block_device
> > >     descriptor = os.open(path, os.O_WRONLY | os.O_EXCL)
> > > 
> > > bmaptool: ERROR: cannot open block device '/dev/sda' in exclusive
> > > mode:
> > > [Errno 16] Device or resource busy: '/dev/sda'
> > > 
> > > Rebooting in 60 s
> > > ```
> > > 
> > > But either machine is not booted, or there are some "ext4" errors
> > > happen.
> > > 
> > > I couldn't understand few things:
> > >  - why this happens (unable to get exclusive access to block
> > > device)$
> > >  - why device is written even if there is such error in logs$
> > >  - why sometimes we don't see this error in logs$
> > >  - why this doesn't happen when "unattended mode" is off.


It does not happen when "unattended mode" is off, because the race is
interrupted by the script waiting for user input at the very beginning.

This is a race condition between multiple instances of the install
script executed in parallel, dependent on how many gettys are active.
Since the number of gettys and further, which getty get's connected in
the end, via terminal, via serial, ... is kind of hard to predict, imo
the only option to resolve that race is by adding a dedicated
(singleton) service which is executed on startup.

I am currently preparing a patch set where a fix for that is included,
similar to this:

```
[Unit]
Description=Target-bootstrapper service

[Service]
User=root
ExecStart=/bin/sh -c "/usr/bin/target-bootstrapper.sh || (echo
'Rebooting in 60 s'; sleep 60); reboot"

StandardOutput=journal+console
TTYPath=/dev/console

[Install]
WantedBy=multi-user.target
``

BR Alexander

> > > 
> > > The reason is that we have these two "getty" instances overriden
> > > by
> > > custom one having the race. While "serial-getty@ttyS0.service.d"
> > > does
> > > the job, "getty@tty1.service.d" produces this error visible in
> > > logs
> > > and
> > > reboot machine in 60 seconds.
> > > 
> > > In some cases 60sec is enough for first service to finish writing
> > > to
> > > the disc, but on slower machines we result in incomplete copy and
> > > broken system.
> > > 
> > > Before this patch, when "reboot" calls were in the script,
> > > "failed"
> > > instance simply exited and didn't try to reboot the board.
> > > 
> > > As a quick fix, I would reverted this patch.
> > 
> > That won't be a fix, just papering over.
> > 
> > The race is apparently around the target device becoming ready for
> > bmaptool, or our script racing with some other service accessing
> > the
> > device for a while. I suspect the service is lacking some
> > dependency
> > here. So, let's identify the race partner first: How to reproduce
> > locally?
> > 
> > Jan
> > 
> 
> No, my finding is that getty@tty1.service.d conflicts with
> serial-getty@ttyS0.service.d one. Every of them tries to run bmaptool
> when unattended mode is active.
> 
> meta-isar/recipes-installer/deploy-image-service/deploy-image-
> service.bb:
> 
> ```
> do_install() {
>   install -m 0600 ${WORKDIR}/install.override.conf
> ${D}/usr/lib/systemd/system/getty@tty1.service.d/override.conf
>   install -m 0600 ${WORKDIR}/install.override.conf
> ${D}/usr/lib/systemd/system/serial-getty@ttyS0.service.d/override.con
> f
> }
> ```
> 
> It can be easy reproduced if we decrease timeout from 60s to e.g.
> 10s,
> to simulate slower system:
> 
> ```
> isar $ sed -i -e 's/60/10/g' meta-isar/recipes-installer/deploy-
> image-
> service/files/install.override.conf 
> isar $ ./kas/kas-container shell kas/isar.yaml 
> builder@b21a175d3b78:/build$ rm -rf conf/
> builder@b21a175d3b78:/build$ /repo/scripts/ci_build.sh -T installer
> ```
> 
> Failed instance of overriden getty reboots the system while the
> target
> is being populated with the not failed instance.
> 
> Reboot hangs at something like
> 
> ```
> [    9.323801] JBD2: no valid journal superblock found
> [    9.325117] EXT4-fs (sda2): error loading journal
> ```
> 
> 
> > > 
> > > Also, I think this behaviour may impact on bmap-tools "psplash"
> > > patches
> > > currently we have on maillist. I would expect that user won't see
> > > the
> > > progress bar on screen because it's somewhere on serial console.
> > > 
> > > 
> > > >  StandardInput=tty
> > > >  StandardOutput=tty
> > > > -- 
> > > > 2.43.0
> > > > 
> > > 
> > 
> 
> -- 
> Best regards,
> Uladzimir.
> 
> 
>
Uladzimir Bely March 14, 2025, 3:04 p.m. UTC | #5
On Fri, 2025-03-14 at 14:25 +0000, Heinisch, Alexander wrote:
> On Fri, 2025-03-14 at 09:25 +0300, Uladzimir Bely wrote:
> > On Fri, 2025-03-14 at 06:54 +0100, Jan Kiszka wrote:
> > > On 13.03.25 14:10, Uladzimir Bely wrote:
> > > > On Tue, 2024-07-02 at 15:38 +0200, 'Jan Kiszka' via isar-users
> > > > wrote:
> > > > > From: Jan Kiszka <jan.kiszka@siemens.com>
> > > > > 
> > > > > Pull the reboot out of the script. This allows for cleaner
> > > > > integration
> > > > > with different calling environment, may they be a systemd
> > > > > unit,
> > > > > an
> > > > > initramfs script or simply a shell for testing purposes.
> > > > > 
> > > > > And if the script exits with an error, wait a minute before
> > > > > rebooting
> > > > > the system, rather than just trying to re-execute it. This
> > > > > permits to
> > > > > inspect potential error as well.
> > > > > 
> > > > > Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
> > > > > ---
> > > > >  .../recipes-installer/deploy-image/files/deploy-image-
> > > > > wic.sh   
> > > > > > 2
> > > > > +-
> > > > >  .../recipes-installer/deploy-
> > > > > image/files/install.override.conf 
> > > > > > 2
> > > > > +-
> > > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/meta-isar/recipes-installer/deploy-
> > > > > image/files/deploy-
> > > > > image-wic.sh b/meta-isar/recipes-installer/deploy-
> > > > > image/files/deploy-
> > > > > image-wic.sh
> > > > > index 8043aff1..12c1eea2 100644
> > > > > --- a/meta-isar/recipes-installer/deploy-image/files/deploy-
> > > > > image-
> > > > > wic.sh
> > > > > +++ b/meta-isar/recipes-installer/deploy-image/files/deploy-
> > > > > image-
> > > > > wic.sh
> > > > > @@ -105,4 +105,4 @@ fi
> > > > >  umount "$installdata"
> > > > >  sync
> > > > >  dialog --title "Reboot" --msgbox "Installation is
> > > > > successful.
> > > > > System
> > > > > will be rebooted. Please remove the USB stick." 7 60
> > > > > -reboot
> > > > > +exit 0
> > > > > diff --git a/meta-isar/recipes-installer/deploy-
> > > > > image/files/install.override.conf b/meta-isar/recipes-
> > > > > installer/deploy-image/files/install.override.conf
> > > > > index 73874caa..357d8662 100644
> > > > > --- a/meta-isar/recipes-installer/deploy-
> > > > > image/files/install.override.conf
> > > > > +++ b/meta-isar/recipes-installer/deploy-
> > > > > image/files/install.override.conf
> > > > > @@ -1,5 +1,5 @@
> > > > >  [Service]
> > > > >  ExecStart=
> > > > > -ExecStart=/usr/bin/deploy-image-wic.sh
> > > > > +ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo
> > > > > 'Rebooting
> > > > > in
> > > > > 60
> > > > > s'; sleep 60); reboot"
> > > > 
> > > > Hello Jan.
> > > > 
> > > > This change seems to cause a sort of race issue.
> > > > 
> > > > Few days ago "[PATCH v3 0/2] Cover installer image with tests"
> > > > was
> > > > merged. Tests were OK (and "next" is still OK) on two jenkins
> > > > and
> > > > one
> > > > gitlab instances. But on the third jenkins instance, they were
> > > > failing.
> > > > 
> > > > Boot log:
> > > > 
> > > > ```
> > > > Got config:
> > > >   installer_unattended=true
> > > >   installer_image_uri=/install/isar-image-ci-debian-bookworm-
> > > > qemuamd64.wic.zst
> > > >   installer_target_dev=/dev/sda
> > > >   installer_target_overwrite=OVERWRITE
> > > > bmaptool: ERROR: An error occurred, here is the traceback:
> > > > Traceback (most recent call last):
> > > >   File "/usr/lib/python3/dist-packages/bmaptools/CLI.py", line
> > > > 116,
> > > > in
> > > > open_block_device
> > > >     descriptor = os.open(path, os.O_WRONLY | os.O_EXCL)
> > > > 
> > > > bmaptool: ERROR: cannot open block device '/dev/sda' in
> > > > exclusive
> > > > mode:
> > > > [Errno 16] Device or resource busy: '/dev/sda'
> > > > 
> > > > Rebooting in 60 s
> > > > ```
> > > > 
> > > > But either machine is not booted, or there are some "ext4"
> > > > errors
> > > > happen.
> > > > 
> > > > I couldn't understand few things:
> > > >  - why this happens (unable to get exclusive access to block
> > > > device)$
> > > >  - why device is written even if there is such error in logs$
> > > >  - why sometimes we don't see this error in logs$
> > > >  - why this doesn't happen when "unattended mode" is off.
> 
> 
> It does not happen when "unattended mode" is off, because the race is
> interrupted by the script waiting for user input at the very
> beginning.
> 
> This is a race condition between multiple instances of the install
> script executed in parallel, dependent on how many gettys are active.
> Since the number of gettys and further, which getty get's connected
> in
> the end, via terminal, via serial, ... is kind of hard to predict,
> imo
> the only option to resolve that race is by adding a dedicated
> (singleton) service which is executed on startup.
> 
> I am currently preparing a patch set where a fix for that is
> included,
> similar to this:

Yes, there were rhetorical questions that get ansered after later
debugging.

I wanted (but didn't have time to try) to do something like in the
script:

```
if ! $installer_unattended; then
    if [ "$(tty)" != "/dev/ttyS0" ]; then
        echo "Disable unattended mode on $(tty), it's active on
/dev/ttyS0"
        installer_unattended=0
    fi
fi
```

Futher, "/dev/ttyS0" could be not hardcoded but also passed from
installer configuration during build.

> 
> ```
> [Unit]
> Description=Target-bootstrapper service
> 
> [Service]
> User=root
> ExecStart=/bin/sh -c "/usr/bin/target-bootstrapper.sh || (echo
> 'Rebooting in 60 s'; sleep 60); reboot"
> 
> StandardOutput=journal+console
> TTYPath=/dev/console
> 
> [Install]
> WantedBy=multi-user.target
> ``
> 
> BR Alexander
> 
> > > > 
> > > > The reason is that we have these two "getty" instances
> > > > overriden
> > > > by
> > > > custom one having the race. While
> > > > "serial-getty@ttyS0.service.d"
> > > > does
> > > > the job, "getty@tty1.service.d" produces this error visible in
> > > > logs
> > > > and
> > > > reboot machine in 60 seconds.
> > > > 
> > > > In some cases 60sec is enough for first service to finish
> > > > writing
> > > > to
> > > > the disc, but on slower machines we result in incomplete copy
> > > > and
> > > > broken system.
> > > > 
> > > > Before this patch, when "reboot" calls were in the script,
> > > > "failed"
> > > > instance simply exited and didn't try to reboot the board.
> > > > 
> > > > As a quick fix, I would reverted this patch.
> > > 
> > > That won't be a fix, just papering over.
> > > 
> > > The race is apparently around the target device becoming ready
> > > for
> > > bmaptool, or our script racing with some other service accessing
> > > the
> > > device for a while. I suspect the service is lacking some
> > > dependency
> > > here. So, let's identify the race partner first: How to reproduce
> > > locally?
> > > 
> > > Jan
> > > 
> > 
> > No, my finding is that getty@tty1.service.d conflicts with
> > serial-getty@ttyS0.service.d one. Every of them tries to run
> > bmaptool
> > when unattended mode is active.
> > 
> > meta-isar/recipes-installer/deploy-image-service/deploy-image-
> > service.bb:
> > 
> > ```
> > do_install() {
> >   install -m 0600 ${WORKDIR}/install.override.conf
> > ${D}/usr/lib/systemd/system/getty@tty1.service.d/override.conf
> >   install -m 0600 ${WORKDIR}/install.override.conf
> > ${D}/usr/lib/systemd/system/serial-getty@ttyS0.service.d/override.c
> > on
> > f
> > }
> > ```
> > 
> > It can be easy reproduced if we decrease timeout from 60s to e.g.
> > 10s,
> > to simulate slower system:
> > 
> > ```
> > isar $ sed -i -e 's/60/10/g' meta-isar/recipes-installer/deploy-
> > image-
> > service/files/install.override.conf 
> > isar $ ./kas/kas-container shell kas/isar.yaml 
> > builder@b21a175d3b78:/build$ rm -rf conf/
> > builder@b21a175d3b78:/build$ /repo/scripts/ci_build.sh -T installer
> > ```
> > 
> > Failed instance of overriden getty reboots the system while the
> > target
> > is being populated with the not failed instance.
> > 
> > Reboot hangs at something like
> > 
> > ```
> > [    9.323801] JBD2: no valid journal superblock found
> > [    9.325117] EXT4-fs (sda2): error loading journal
> > ```
> > 
> > 
> > > > 
> > > > Also, I think this behaviour may impact on bmap-tools "psplash"
> > > > patches
> > > > currently we have on maillist. I would expect that user won't
> > > > see
> > > > the
> > > > progress bar on screen because it's somewhere on serial
> > > > console.
> > > > 
> > > > 
> > > > >  StandardInput=tty
> > > > >  StandardOutput=tty
> > > > > -- 
> > > > > 2.43.0
> > > > > 
> > > > 
> > > 
> > 
> > -- 
> > Best regards,
> > Uladzimir.
> > 
> > 
> > 
>

Patch

diff --git a/meta-isar/recipes-installer/deploy-image/files/deploy-image-wic.sh b/meta-isar/recipes-installer/deploy-image/files/deploy-image-wic.sh
index 8043aff1..12c1eea2 100644
--- a/meta-isar/recipes-installer/deploy-image/files/deploy-image-wic.sh
+++ b/meta-isar/recipes-installer/deploy-image/files/deploy-image-wic.sh
@@ -105,4 +105,4 @@  fi
 umount "$installdata"
 sync
 dialog --title "Reboot" --msgbox "Installation is successful. System will be rebooted. Please remove the USB stick." 7 60
-reboot
+exit 0
diff --git a/meta-isar/recipes-installer/deploy-image/files/install.override.conf b/meta-isar/recipes-installer/deploy-image/files/install.override.conf
index 73874caa..357d8662 100644
--- a/meta-isar/recipes-installer/deploy-image/files/install.override.conf
+++ b/meta-isar/recipes-installer/deploy-image/files/install.override.conf
@@ -1,5 +1,5 @@ 
 [Service]
 ExecStart=
-ExecStart=/usr/bin/deploy-image-wic.sh
+ExecStart=/bin/sh -c "deploy-image-wic.sh || (echo 'Rebooting in 60 s'; sleep 60); reboot"
 StandardInput=tty
 StandardOutput=tty