• Edit- I set the machine to work last night testing memtester and badblocks (read only) both tests came back clean, so I assumed I was in the clear. Today, wanting to be extra sure, i ran a read-write badblocks test and watched dmesg while it worked. I got the same errors, this time on ata3.00. Given that the memory test came back clean, and smartctl came back clean as well, I can only assume the problem is with the ata module, or somewhere between the CPU and the ata bus. i’ll be doing a bios update this morning and then trying again, but seems to me like this machine was a bad purchase. I’ll see what options I have with replacement.

  • Edit-2- i retract my last statement. It appears that only one of the drives is still having issues, which is the SSD from the original build. All write interactions with the SSD produce I/O errors (including re-partitioning the drive), while there appear to be no errors reading or writing to the HDD. Still unsure what caused the issue on the HDD. Still conducting testing (running badblocks rw on the HDD, might try seeing if I can reproduce the issue under heavy load). Safe to say the SSD needs repair or to be pitched. I’m curious if the SD got damaged, which would explain why the issue remains after being zeroed out and re-written and why the HDD now seems fine. Or maybe multiple SATA ports have failed now?


I have no idea if this is the forum to ask these types of questions, but it felt a little like a murder mystery that would be a little fun to solve. Please let me know if this type of post is unwelcome and I will immediately take it down and return to lurking.

Background:

I am very new to linux. Last week I purchased a cheap refurbished headless desktop so I could build a home media server, as well as play around with vms and programming projects. This is my first ever exposure to linux, but I consider myself otherwise pretty tech-savvy (dabble in python scripting in my spare time, but not much beyond that).

This week, i finally got around to getting the server software installed and operating (see details of the build below). Plex was successfully pulling from my media storage and streaming with no problems. As i was getting the docker containers up, I started getting “not enough storage” errors for new installs. Tried purging docker a couple times, still couldn’t proceed, so I attempted to expand the virtual storage in the VM. Definitely messed this up, and immediately Plex stops working, and no files are visible on the share anymore. To me, it looked as if it attempted taking storage from the SMB share to add to the system files partition. I/O errors on the OMV virtual machine for days.

Take two.

I got a new HDD (so i could keep working as I tried recovery on the SSD). I got everything back up (created a whole new VM for docker and OMV). Gave the docker VM more storage this time (I think i was just reckless with my package downloads anyway), made sure that the SMB share was properly mounted. As I got the download client running (it made a few downloads), I notice the OVM virtual machine redlining on memory from the proxmox window. Thought, (uh oh, i should fix that). Tried taking everything down so I could reboot the OVM with more memory allocation, but the shutdown process hung on the OVM. Made sure all my devices on the network were disconnected, then stopped the VM from the proxmox window.

On OVM reboot, i noticed all kinds of I/O errors on both the virtual boot drive and the mounted SSD. I could still see files in the share on my LAN devices, but any attempt to interact with the folder stalled and would error out.

I powered down all the VM’s and now i’m trying to figure out where I went wrong. I’m tempted to just abandon the VM’s and just install it all on a Ubuntu OS, but I like the flexibility of having the VM’s to spin up new OS’s and try things out. The added complexity is obviously over my head, but if I can understand it better I’ll give it another go.

Here’s the build info:

Build:

  • HP prodesk 600g1
  • intel i5
  • upgraded 32gb after-market DDR3 1600mhz Patriot Ram
  • KingFlash 250gb SSD
  • WD 4T SSD (originally NTFS drive from my windows pc with ~2T of data existing)
  • WD 4T HDD (bought this after the SSD corrupted, so i could get the server back up while i delt with the SSD)
  • 500Mbps ethernet connection

Hypervisor

  • Proxmox (latest), Ubuntu kernel
  • VM110: Ubuntu-22.04.3-live server amd64, OpenMediaVault 6.5.0
  • VM130: Ubuntu-22.04.3-live, docker engine, portainer
    • Containers: Gluetun, qBittorrent, Sonarr, Radarr, Prowlarr)
  • LCX101: Ubuntu-22.04.3, Plex Server
  • Allocations
  • VM110: 4gb memory, 2 cores (balooning and swap ON)
  • VM130: 30gb memory, 4 cores (ballooning and swap ON)

Shared Media Architecture (attempt 1)

  • Direct-mounted the WD SSD to VM110. Partitioned and formatted the file system inside the GUI, created a folder share, set permissions for my share user. Shared as an SMB/CIFS
  • bind-mounted the shared folder to a local folder in VM130 (/media/data)
  • passed the mounted folder to the necessary docker containers as volumes in the docker-compose file (e.g. - volumes: /media/data:/data, ect)

No shame in being told I did something incredibly dumb, i’m here to learn, anyway. Maybe just not learn in a way that destroys 6 months of dvd rips in the process ___

  • lemmyvore@feddit.nl
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    Yes this should be the first thing. Run smartctl -a /dev/sda (replace with your actual hdd device) and look at the attributes. You can copy it here so we can advise. Typical failure indicators are:

    • Attribute 5 (reallocated sector count)
    • 10 (spin retry count)
    • 184 (end to end error)
    • 187 (reported uncorrectable errors)
    • 188 (command timeout)
    • 197 (current pending sector count)
    • 198 (offline uncorrectable sector count)
    • archomrade [he/him]OP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      this came back clean, though the drives did not have smart reporting enabled. looks like the ata controller or some component between the cpu and sata bus is fucked.

    • archomrade [he/him]OP
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      actually I think i’ve identified an issue with the original SSD. Here’s the readout to sdb, which i was just having more issues with:

      ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
        5 Reallocated_Sector_Ct   0x0032   100   100   ---    Old_age   Always       -       0
        9 Power_On_Hours          0x0032   100   100   ---    Old_age   Always       -       2050
       12 Power_Cycle_Count       0x0032   100   100   ---    Old_age   Always       -       11
      165 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       4194345
      166 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
      167 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       159
      168 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       1
      169 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       1859
      170 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
      171 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
      172 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
      173 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       0
      174 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       6
      184 End-to-End_Error        0x0032   100   100   ---    Old_age   Always       -       0
      187 Reported_Uncorrect      0x0032   100   100   ---    Old_age   Always       -       105
      188 Command_Timeout         0x0032   100   100   ---    Old_age   Always       -       0
      194 Temperature_Celsius     0x0022   074   049   ---    Old_age   Always       -       26 (Min/Max 22/49)
      199 UDMA_CRC_Error_Count    0x0032   100   100   ---    Old_age   Always       -       0
      230 Unknown_SSD_Attribute   0x0032   001   001   ---    Old_age   Always       -       34359738376
      232 Available_Reservd_Space 0x0033   100   100   004    Pre-fail  Always       -       100
      233 Media_Wearout_Indicator 0x0032   100   100   ---    Old_age   Always       -       1773
      234 Unknown_Attribute       0x0032   100   100   ---    Old_age   Always       -       1852
      241 Total_LBAs_Written      0x0030   253   253   ---    Old_age   Offline      -       1787
      242 Total_LBAs_Read         0x0030   253   253   ---    Old_age   Offline      -       9876
      244 Unknown_Attribute       0x0032   000   100   ---    Old_age   Always       -       0