How NASA Built Artemis II’s Fault-Tolerant Computer

(cacm.acm.org)

90 points | by speckx 11 hours ago

7 comments

  • dmk 2 hours ago
    The quote from the CMU guy about modern Agile and DevOps approaches challenging architectural discipline is a nice way of saying most of us have completely forgotten how to build deterministic systems. Time-triggered Ethernet with strict frame scheduling feels like it's from a parallel universe compared to how we ship software now.
    • mvkel 7 minutes ago
      If you look at code as art, where its value is a measure of the effort it takes to make, sure. But then there's Banksy.
    • ramraj07 1 hour ago
      I take the opposite message from that line - out of touch teams working on something so over budget and so overdue, and so bureaucratic, and with such an insanely poor history of success, and they talk as if they have cured cancer.

      This is the equivalent of Altavista touting how amazing their custom server racks are when Google just starts up on a rack of naked motherboards and eats their lunch and then the world.

      Lets at least wait till the capsule comes back safely before touting how much better they are than "DevOps" teams running websites, apparently a comparison that's somehow relevant here to stoke egos.

      • danhon 1 hour ago
        You mean like this?

        "With limited funds, Google founders Larry Page and Sergey Brin initially deployed this system of inexpensive, interconnected PCs to process many thousands of search requests per second from Google users. This hardware system reflected the Google search algorithm itself, which is based on tolerating multiple computer failures and optimizing around them. This production server was one of about thirty such racks in the first Google data center. Even though many of the installed PCs never worked and were difficult to repair, these racks provided Google with its first large-scale computing system and allowed the company to grow quickly and at minimal cost."

        https://blog.codinghorror.com/building-a-computer-the-google...

        • ramraj07 16 minutes ago
          The problem they solved isn't easy. But its not some insane technical breakthrough either. Literally add redundancy, thats the ask. They didnt invent quantum computing to solve the issue did they? Why dunk on sprints?
        • 1970-01-01 27 minutes ago
          Google then had complete regret not doing this with ECC RAM: https://news.ycombinator.com/item?id=14206811
          • ramraj07 18 minutes ago
            It got them to where they need to be to then worry about ECC. This is like the dudes who deploy their blog on kubernetes just in case it hits front page of new york times or something.
      • bfung 17 minutes ago
        One simply does not [“provision” more hardware|(reboot systems)|(redeploy software)] in space.
      • bluegatty 53 minutes ago
        No, space is just hard.

        Everything is bespoke.

        You need 10x cost to get every extra '9' in reliability and manned flight needs a lot of nines.

        People died on the Apollo missions.

        It just costs that much.

        • ramraj07 15 minutes ago
          Yep, spend 100 billion on what should have cost 1/50that cost, and send people up to the moon with rockets that we are still keeping our fingers crossed wont kill them tomorrow, and we have to congratulate them for dunking on some irrelevant career?
        • arduanika 42 minutes ago
          Please, this is hacker news. Nothing else is hard outside of our generic software jobs, and we could totally solve any other industry in an afternoon.
          • geerlingguy 38 minutes ago
            I mean I can just replace Dropbox with a shell script.
            • bluegatty 30 minutes ago
              That's funny because you could! Dropbox started a shell script :)

              Funny though I would assume HN people would respect how hard real-time stuff and 'hardened' stuff is.

      • HNisCIS 46 minutes ago
        What would you suggest? Vibe coding a react app that runs on a Mac mini to control trajectory? What happens when that Mac mini gets hit with an SEU or even a SEGR? Guess everyone just dies?
        • ramraj07 9 minutes ago
          All Im suggesting is to be humble about your mediocre solutions. This is not the only solution and not that ingenious necessarily. Why do you need to bring up vibecoding here? Because people who criticize arrogant nasal engineers are also AI idiots by default?
      • simoncion 1 hour ago
        > ...they talk as if they have cured cancer.

        I'd chalk that up to the author of the article writing for a relatively nontechnical audience and asking for quotes at that level.

    • arduanika 36 minutes ago
      You could even say that part of the value of Artemis is that we're remembering how to do some very hard things, including the software side. This is something that you can't fake. In a world where one of the more plausible threats of AI is the atrophy of real human skills -- the goose that lays the golden eggs that trains the models -- this is a software feat where I'd claim you couldn't rely on vibe code, at least not fully.

      That alone is worth my tax dollars.

    • tayk47999 1 hour ago
      [dead]
  • y1n0 46 minutes ago
    NASA didn't build this, Lockheed Martin and their subcontractors did. Articles and headlines like this make people think that NASA does a lot more than they actually do. This is like a CEO claiming credit for everything a company does.
    • voodoo_child 26 minutes ago
      Nice “well, actually”. I’m sure Lockheed were building this quad-redundant, radiation-hardened PowerPC that costs millions of dollars and communicates via Time-Triggered Ethernet anyway, whether NASA needed one or not.
  • jbritton 1 hour ago
    I wonder how often problems happen that the redundancy solves. Is radiation actually flipping bits and at what frequency. Can a sun flare cause all the computers to go haywire.
    • EdNutting 31 minutes ago
      Not a direct answer but probably as good information as you can get: https://static.googleusercontent.com/media/research.google.c...

      Basically, yes, radiation does cause bit flips, more often than you might expect (but still a rare event in the grand scheme of things, but enough to matter).

      And radiation in space is much “worse” (in quotes because that word is glossing over a huge number of different problems, both just intensity).

  • object-a 1 hour ago
    How big of a challenge are hardware faults and radiation for orbital data centers? It seems like you’d eat a lot of capacity if you need 4x redundancy for everything
    • totetsu 1 hour ago
      They dont go into here.. but I thought that NASA also used like 250nm chips in space for radiation resistance. Are there even any radiation resistance GPUs out there?
      • kersplody 21 minutes ago
        NOPE, RAD hardened space parts basically froze on mid 2000s tech: https://www.baesystems.com/en-us/product/radiation-hardened-...
      • pclmulqdq 1 hour ago
        Absolutely not, although the latest fabs with rad-tolerant processors are at ~20 nm. There are FDSOI processes in that generation that I assume can be made radiation-tolerant.
      • linzhangrun 1 hour ago
        It seems not; anti-interference primarily relies on using older manufacturing processes, including for military equipment, and then applying an anti-interference casing or hardware redundancy correction similar to ECC.
  • starkparker 10 hours ago
    Headline needs its how-dectomy reverted to make sense
    • arduanika 44 minutes ago
      (Off-topic:) Great word. Is that the usual word for it? Totally apt, and it should be the standard.
  • seemaze 17 minutes ago
  • ConanRus 1 hour ago
    [dead]