A toy RTOS inside Super Mario Bros. using emulator save states

(prettygoodblog.com)

246 points | by notorious_pgb 1 day ago

20 comments

Sohcahtoa82 1 day ago
That's pretty neat.
The metaphor I usually go for when it comes to for threads is having several widgets that need to be assembled. To begin working on Widget B, you have to pause the work on Widget A. Adding SMP is like adding a second person to help assemble widgets, but you're still using the same toolbox, so if you both need the same screwdriver, there's no performance benefit to having a second person. Multicore is having multiple workbenches each with their own toolbox so they can operate completely in parallel.
A mutex is like having a specialized tool that can only be used by one widget assembly at a time. Like, each workbench might have a full set of screwdrivers, hammers, wrenches, sockets, etc., but you only have 1 welder.
A semaphore is a tool that can be used by a limited number of widgets, like an oven that can fit up to 4 widgets.
[-]
- notorious_pgb 1 day ago
  I think this is a pretty damn good metaphor for the introduction of these concepts, especially in a context where the learner(s) might need to imminently apply this knowledge.
  What drove me to write this post (as someone who didn't attend college where I might've learned exactly what I'm about to say -- big grain of salt) is that it took me being forced by circumstance to become a firmware engineer for a time to actually understand not just how to multithread well, but what the hell this all even was.
  And I guess I saw one too many Reddit comments (mistake #1) which downplayed the idea that there's any benefit to knowing the answer to such things.
  That all said, this blog post / metaphor is definitely aimed at someone who's already been using threads effectively for some time -- I don't think it'd fare very well as a first introduction.
- eru 1 day ago
  As a special challenge: explain (Software) Transactional Memory, as eg implemented in Haskell, in your metaphor.
  STM is pretty similar to how some databases implement transactions.
  [-]
  - Sohcahtoa82 23 hours ago
    [dead]
- iwontberude 21 hours ago
  SMP is a second processor, even more redundant than a second core. Its own cache, memory, Northbridge access, etc.
  [-]
  - Sohcahtoa82 14 hours ago
    Yeah, I messed up SMT vs SMP. Can't edit it now, though.
    [-]
    - notorious_pgb 14 hours ago
      Ah well -- the article itself has its own fair share of accuracy issues -- and I learned something from your conversation, which wouldn't have happened without said mistake.
- ImPostingOnHN 1 day ago
  > Adding SMP is like adding a second person to help assemble widgets, but you're still using the same toolbox, so if you both need the same screwdriver, there's no performance benefit to having a second person.
  This sounds like hyperthreading – or, maybe I'm missing the screwdriver in the metaphor :)
  [-]
  - ranger207 1 day ago
    Hyperthreading is Intel's implementation of SMP
    [-]
    - _ihaque 23 hours ago
      I think you mean SMT (simultaneous multithreading), as opposed to SMP (symmetric multiprocessing)?
      [-]
      - Sohcahtoa82 23 hours ago
        Ah hell, that was MY mistake. It should be SMT in my original comment and it's way too late to edit it now.
    - Sohcahtoa82 1 day ago
      Funny thing is, when I was first typing that comment, I started with "Hyperthreading" and deleted it in favor of SMP to be vendor agnostic.
      EDIT: And it should have been SMT anyways!
ignormies 1 day ago
This is a super cool visual demonstration of RTOS/scheduling! I love the region-based critical sections!
I took a real-time operating systems course in university as an elective. One of the hardest courses I took the whole four years, but also one of the most interesting. Had a great professor, who gave really demanding, but very instructive, project-based assignments.
I need to find a toy project to play around with this domain again.
[-]
- notorious_pgb 1 day ago
  Thank you!
  I'm curious how effective you feel this specific example might've been if it were delivered during your course. I suspect I've stumbled across a really helpful teaching tool, but having not gone to university, I don't actually know how this stuff is being taught :v
  [-]
  - throwanem 1 day ago
    Nor I, but I've learned from the metaphor, for what that's worth. And the demystification of primitives has considerable value these days.
    [-]
    - notorious_pgb 1 day ago
      > Nor I, but I've learned from the metaphor, for what that's worth.
      It's the sickest possible outcome for me, so: worth a lot!
noone_youknow 1 day ago
This is simply amazing work, I take my hat off to you. And not just the threading in an emulator (which is a genius concept all by itself) but for the article itself - I have spent years trying to find the right way to articulate exactly your points in hopes of engaging people to learn more about the systems they’re building on top of.
From now on, I’ll simply send out a link to this. Thank you!
[-]
- notorious_pgb 1 day ago
  My person!
  I worried that I'd come across as a high-horse dickhead in the mystification section, when that's the fundamental opposite of my intention and attitude here. I'm glad to hear that it resonated with another.
  Although I guess we could both be high-horse dickheads.
  [-]
  - peterkelly 1 day ago
    The mystification section was my favorite part - you make some excellent points there.
  - noone_youknow 23 hours ago
    IMO the mystification section is right on the money, I couldn’t have said it better myself (and believe me, I’ve tried).
  - upghost 1 day ago
    Not only that but now, thanks to you, we can all be high-horse dickheads concurrently.
Ericson2314 23 hours ago
Sorry to be an pedant.
From your previous post
> The first thing to understand about threads is that processors do not support them.
Yes, but I would say by changing the emulator (augmenting the emulator software with your other software, sure) in the outside world, not doing something in the inside world, you have are implementing something closer akin to hardware concurrency support or processes.
> It doesn't map 1:1 to true multithreading (e.g., emulator save states represent the state of the entire machine including RAM, whereas thread contexts represent a much more minimal slice
Right, but I think if we're going to make a threads vs process distinction (vs concurrency in general) whether we have only a small opt-in set of communication channels / synchronization primitives (the mutex and barrier), or a wide open one (shared memory, be careful!) is the operative distinction.
gwbas1c 1 day ago
Last night I read an article about eliminating lag in emulators. It's done with a similar concept. Basically, for each frame, the emulator calculates the state for different button combinations. Then, based on the button you actually push, the state shown moves to the precalculated one.
[-]
- jchw 1 day ago
  Not 100% familiar with exactly what this is but I am familiar with run-ahead, which is basically also the same idea as GGPO, but for eliminating lag:
  - Run the emulator one frame normally, using real polled input. Somehow snapshot the state.
  - Then, run the emulator n frames (usually just one) with the same input. Present the video + audio from the last frame.
  - Synchronize. (Some emulators can get very fancy; instead of just waiting for vsync, they'll also delay until the end of the window minus estimated processing time, to poll input at the last possible moment.)
  - Roll back to the saved snapshot. (I believe you can also optimize if you know the inputs really didn't change, avoiding a lot of rollbacks at the cost of less predictable frame times.)
  The main reason this is even a good idea is because most games will have some of their own processing latency by design, so jumping a frame or two ahead usually doesn't have any noticeable side-effects. This is a pretty cool idea since obviously modern computers with LCD screens have a lot more latency basically everywhere versus older simpler machines connected to CRTs.
  Unfortunately, this sort of approach only works when your emulator's state is small enough and fast enough to restore.
  I actually have been dying to experiment with designing an emulator to have fast incremental snapshots from the ground up to see if you could manage to make this feasible for more modern consoles. You could, for example, track dirty memory pages with userfaultfd/MEM_WRITE_WATCH, and design structures like JIT caches to be able to handle rewinding without having to drop the entire cache. I'm actually not sure that all emulators clear their caches upon loading state, but you know, more generally, I would like to know how fast and small you could get save states to be if you were designing for that from the ground up.
  [-]
  - achierius 1 day ago
    What do you mean by JIT cache in this case -- like an inline data cache, or like a cache for the jitted binary code?
    [-]
    - jchw 1 day ago
      The latter, though I'm just using it as an example really. I think for most emulator core designs today you wind up dumping a lot of ephemeral state when you rewind or load state (and sometimes even saving state has some overhead due to synchronization.)
- cortesoft 1 day ago
  i think this is kind of the inverse of how some emulators that add remote play deal with network lag... they will predict what control the opponent will give for the current frame and display it, but then when they actually receive the real opponent control some number of frames later, they will go back to that frame and re-calculate based on knowing the actual control input for that frame, and then catch up to the current frame. That way the game isn't waiting for remote inputs to refresh the screen, but can then reconcile once all information is known.
- notorious_pgb 1 day ago
  Interesting; which emulator(s) was this for? I have to imagine this strategy is mostly effective below a certain total complexity level -- however you want to define that -- and the more modern you get, the more you're lucky to even render a frame at all.
- potato3732842 1 day ago
  Sounds a lot like speculative execution for hardware.
  [-]
  - sparky_ 17 hours ago
    Was just thinking the same, this is effectively branch prediction/precalculation.
- cevn 1 day ago
  Umm, that's brilliant. Is this technique employed in all modern emulators or a recent thing?
aftbit 1 day ago
https://web.archive.org/web/20250528203416/https://prettygoo...
https://archive.ph/msHkO
Cloudef 1 day ago
Isn't this more like coroutines / green threads? I guess it doesn't matter as they are essentially same concept apart from cooperative scheduling vs. preemptive scheduling.
[-]
- notorious_pgb 1 day ago
  I suppose the actual thing it is is closer to what you're describing, but the thing that it makes viscerally understandable is threads itself -- I hope.
mort96 20 hours ago
This is probably a stupid question, but why real-time? Isn't it enough to "just" make an OS in SMB, doesn't adding real-time guarantees make it even more difficult, by restricting what allocator and scheduling algorithms you can use?
In any case, this looks like a really cool project.
abyesilyurt 1 day ago
Tsoding recently built coroutines for C from scratch. The concept of coroutines is related to threads, so I bet you’d like the video if you liked the article.
https://youtu.be/sYSP_elDdZw
CGamesPlay 1 day ago
Would be neat if the subworld memory was shared between the save states, to really drive home the mutex concept.
[-]
- notorious_pgb 1 day ago
  This is one of the core weaknesses of the analogy that I'm trying to figure out how to resolve: there's no shared memory, unlike actual threads.
  I think the solution requires that the game code is what's executing threading syscalls, not the Lua around the game code.
  It'd definitely be a super sick evolution to add true shared in-game memory.
  [-]
  - kelnos 1 day ago
    I guess in a way the three Mario threads are more like processes? And they're using OS-level or shared-memory mutexes and condvars that can be shared between processes?
    [-]
    - notorious_pgb 1 day ago
      I see what you're getting at -- though then it raises the question: what would it mean for those processes to have multiple threads within them?
      The way I've been looking at it, if I want to take this further, processes would maybe be _NES ROMs_, and each process can have multiple threads. Swapping between processes would inherently be more expensive than between threads within a process, which I think would be a really instructive aspect. Also, the whole idea of "sharing memory" between entirely different games is obviously ridiculous, which could be yet another teaching tool.
      Then if we want to expand it even further, maybe multicore support would look like multiple emulator instances, connected via sockets to support cross-core IPC. You'd probably want to institute an "all the threads of a process have to be on the same core" rule (so they can locally share primitives in the same Lua script), which would be a great way to demonstrate the principle of building systems which adapt to their realistic constraints, as opposed to building systems which exactly model a textbook.
      I've gotten ahead of myself, though.
      [-]
      - Ericson2314 23 hours ago
        Easier to leave out the parallelism and just look at concurrency.
        For a game like Mario, you can split up the memory and decide what you want to be shared and what do you want to be per-thread. E.g. starting small you can rig up just a few fixed variables like scores or whatever to be persisted from one game to the next. trying to push that further without causing endless corruption should be fun :)
  - leni536 1 day ago
    You have critical section in the bonus section behind the pipe. Maybe shared state protected by that critical section could be your shared memory, like the state of the coins there.
liamwire 1 day ago
Brilliant visual demonstration, really helps build an intuition for the concepts at play.
Refreeze5224 1 day ago
This is a fantastic way of demonstrating threading and scheduling, well done!
dsfdsfsfew 1 day ago
Kind of pushing the definition of RTOS here, I was expecting something like using ACE to run a concurrent program, still an interesting and informative post.
[-]
- notorious_pgb 1 day ago
  You're goddamn right it's pushing it.
  I struggled with how to convey this; the ultimate goal is to viscerally demonstrate what a thread IS, so I'm comfortable with it being... something strange... but I do want it to resemble actual concurrent systems to a useful degree.
  Like, the thread scheduler code is userland code -- similar to, say, FreeRTOS -- but the "threads" themselves aren't exactly threads in an exact sense, since their context packages are entire save states for a console, not just its processor.
  Also, the game code (the true userland?) has no ability to access the threading API whatsoever, which really harms the analogy.
  So I went with RTOS because it's a preemptive scheduler with synchronous reschedule-on-block; where the scheduler is implemented in the same codebase as the code using it. But to be honest, nothing I know of fits.
loas 1 day ago
I really enjoyed this, thank you.
anthk 1 day ago
Forth does that too. https://www.bradrodriguez.com/papers/mtasking.html
hei-lima 1 day ago
Wow! So cool!
bitwize 1 day ago
"...But first we need to talk about parallel universes."
shakna 1 day ago
If you're going to make this your DooM thing, clearly the next thing to do is grab a DooM engine and get to work. (PsyDoom has Lua support even).
But congrats. This is worth being included in a uni course.
[-]
- notorious_pgb 1 day ago
  It'd be really interesting to push this to that level -- the big challenge likely being that DooM's state (including video memory) is certainly much larger and expensive to swap than an NES save state.
  One would have to get really clever...
  [-]
  - noone_youknow 16 hours ago
    … and introduce something like virtual memory?
curtisszmania 1 day ago
[dead]