Brian's talk on the Rust Runtime

Like to talk about concurrency and the runtime and specificially some patterns for concurrency. Also talk about the I/O system.
Also want to know what concerns are and what other people would like to know. Because I have a plan, but it would help if you ask questions
Have an agenda and will stop at the end of each section to let people ask questions. To start, want to talk about what the runtime is and what it covers. Then talk about features.
To talk about the runtime, we need to cover some background issues because some Rust design issues shape the features of the runtime.
ryan: We want to know how many threads are best for performance. Servo created 2*#cores. Is more better?
So servo does what?
ryan: For selector matching, servo created 2*#cores in tasks. Is that best?
This is using the task pool? We ask for the number of cores and create exactly that number of tasks. Was this your code or Simon's code? Did you measure?
jack: hyperthreading?
We take HT into account. Probably takes problems with naive problems in the scheduler. it's better to create more. Since tasks couldn't migrate in the past, we needed to create a lot of tasks to keep the threads busy. So under the current scheduler, you should not need to overcommit. If there are tasks available to work I would expect all of the cores to be running them. But maybe the workload was preventing it? But maybe having more tasks forces the layout task out. I'd be interested. What's different with task-per-core vs. 2-tasks-per-core. There is a lot of
lbergstrom: What's the expected number?
One task per core. Scheduler design time! Let's talk about that.
The scheduler has a thread per core. If a machine has two cores. Then we have two threads. Each thread is a scheduler. Each scheduler has a queue of work (called the WorkQueue). The WorkQueue is filled with tasks. Every scheduler prefers to do work out of its own WorkQueue if there is work available. If one scheduler has an empty WorkQueue, then it will go look for work in all of its peers' queues.
When a scheduler finds out that there is no work in its queue, it becomes a thief. It then looks at all the other schedulers' WorkQueues and steals tasks from the other queues.
So, all cores should be occupied as long as there is work to do on each core. If you have 4 cores and 4 tasks, they should all be occupied.
If you have a TaskPool and create a task per core, assuming that there is no other work in the system to do, you would expect all of those cores to do work in such a way that all of the tasks are balanced across those four cores. In any TaskPool that you create, you would expect to create only one task per core and would expect them to distribute across the entire set of schedulers.
Do you remember what you saw with only 1 task per core? Did it use all of the processors with only 4 tasks on 4 cores instead of 8 tasks on 4 cores?
QUESTION: how does ownership work?
Tasks in rust are owned objects (~Task). Like any other object you pass between tasks, ownership is transferred. Just like any other type in rust, the item is just transferred like any othe ritem.
QUESTION: but if I spawn two tasks, then are those tasks in the work queue?
If you spawn a single task, the task that is currently running is pushed into the work queue. And then the current core starts running the task that you just spawned.
QUESTION: So we have a main thread. Spawn two tasks. Do two end up in the work queue.
Yes, two will end up in the work queue. Assume you have an empty queue for the current scheduler. The RunningTask is called MAIN. The first thing that MAIN does is call do_spawn. What spawn actually does is calls queue.push(MAIN) and then MAIN goes into the WorkQueue and there is no running task at that point (we are just running the scheduler). We then call scheduler.run(task:new) and now the running task becomes the newtask1, which was the body of the spawn. So the MAIN task is no longer running and is in the WorkQueue.
So in a multi-core system, if there is nothing to do for another core, then it will come to the WorkQueue and pop MAIN off of the WorkQueue and that task will begin to run on the second core.
QUESTION: So if we have two cores and spawn 4 tasks. Then the the WorkQueue contains the other tasks and they are not running. If we finish running a task, do we grab another?
If newtask1 terminates, the next thing that the scheduler will do is try to find something on the current WorkQueue. Otherwise, it will try to steal.
QUESTION: So if I have N cores and 2N tasks, those tasks are not running and just stay there. So I can only run N tasks concurrently. If there are more than that, they are not running?
Yes, but this is not supposed to be a problem because any time a task is running it is not doing a computation. So, if you have N cores and have at least N tasks in the system ready to work, the entire system should be using the maximum CPU. In theory you should only need a task per core to maximize the CPU.
QUESTION: What should do about hyperthreading?
We treat HT threads as additional cores in Rust. There will be an additional scheduler.
pcwalton: maybe talk about OS threads vs. green threads.

OS threads vs. green threads

One of the key design decisions in Rust is to use user-space thread scheduling. This approach is called green threads. Threads/processes are the unit of processing in a OS. Most runtimes tend to rely on OS threads, but we use green threads. In this model, we do not let the scheduler manage threads but instead have Rust manage the threads. This approach has some trade-offs in how the runtime is designed, so it is important to understand why we did this.
Benefits are primarily that they are cheaper. Creating an OS thread requires lots of resources. Constructing and destroying a thread requires a lot of time. OS threads are also expensive because of the cost of a kernel-mode context switch from user-space to kernel-space, especially on x86. When an OS thread is scheduled by the OS, it has to first switch into the kernel to start and then back into the user to start it.
With green threads, there is still a context switch. But instead of switching kernel, all we need to do is switch the registers. Switching the registers is considerably cheaper. Fast context switching is something that we will rely on heavily in servo because we want to be able to do round-trips between the script task and layout task in order to make layout queries. In gecko (and any other web browser), that is just a function call. Since we have separated them, we have more work to do. With OS scheduling it would be very expensive. With green threads, it will be more expensive, but we believe that it is fine.
pcwalton: Our numbers show that the cost of green thread switches are still dominated by the cost of performing layout.
Green threads require much less state, which means you can create many more of them concurrently. In Erlang, you can create millions of erlang tasks because they all have green threads. In Rust, at one point we were counting on the scalability of green threads and millions of small tasks. But, that benefit is not going to pay off quite as much. We have great scalability, but hundreds of thousands of concurrent tasks are more reasonable. The challenge is that the stack frames will require at least 4KB of stack space.
So any time you create a task, you are creating 4KB of memory. On a 32-bit OS, divide your maximum virtual memory by 4KB.
pcwalton: Go uses 8KB for their default stack size, so we have more space, but only by a factor of two.
The performance of threads across different operating systems is very diverse. On linux, people recommend that you use threads. On Windows, threads are very expensive and do not scale nearly as well. OSX is in the middle. But if the entire threading system is under control of the runtime. So you can count on nearly identical scaling and performance on any platform. That is probably the single best argument in favor of green threads.
Green threads have some problems. The biggest is that the scheduling is very hard and we use heuristics. OSes have been tuned over many years and do scheduling very well. And Rust currently does not have the same amount of tuning as an OS. We will probably not get the same level of fairness because Rust tasks are not preemptible. They will execute until they yield control. An OS thread only runs until it stops.
jack: when does a rust task give up control?
Currently ad-hoc. Points in the message-passing system where we need to prevent deadlock. So mainly send() and recv() calls. There are occasionally cases in other concurrency primitives where they need to yield.
pcwalton: I/O, too. And timers.
Yes, I/O as well. Networking, file, timers, printing messages to the string will yield.
jdm: So when you yield, the task goes on the work queue. Where?
pcwalton: If you yield due to I/O, no.
If you are blocking the task, it goes somewhere else to a synchronization object where it can be resumed later. e.g., and I/O handle. The task will go along with that I/O handle for a while until the I/O is completed. The WorkQueue is double-ended. The scheduler owns and pushes/pops from the front. So locally it acts like a stack.
jack: If you send a message to another task, is it the case that if you send, you will run the other task immediately?
Yes, if you send and the other task has called recv(), you will pull that task out and run it immediately.
Drawing this, imagine that you have a sending task. It will call chan.send(msg). The way that script/layout works (and messages in general) is that when the receiver is waiting, then the channel itself owns the task (not a WorkQueue). So if the sender sees that the channel has a receiver, the sending task will push itself onto the WorkQueue and will begin running the receiver task immediately.
QUESTION: when do we run the sending task again?
Another scheduler may steal the sending task while it is on the queue. Otherwise, that sending task on the WorkQueue behaves like any other task on the queue.
This strategy of stealing from each other queues is called work-stealing. Any questions?
jack: Have we benchmarked the native thread performanace? Like if we didn't have green threads what the performance is?
We have a scheduler mode where we can dedicate an entire thread to a task. There was some workload where I was switching back and forth between them. The thread per task mode was surprisingly fast.
jack: I would like to test servo performance on Linux to see if it is faster with native scheduling.
In spawn, could have a different function. I/O should still work. Right now, we don't have the infrastructure to do raw threads. But, the single threaded scheduler mode has a bug that causes a lot of work to be serialized. So probably not a fair test.

Define the runtime

Sometimes it is hard to know the difference between the runtime and standard library. I think of it that the runtime is the code require to support the Rust memory model (especially pointer ownership) as well as the task system. The runtime lives in src/stdrt. It is part of std and is part of the standard library, but could be separated one day. It provides two layers of important services.
1. The global heap (indicated by ~ "twiddle" functions). std::rt::global_heap. Rust cannot function without the global heap. The whole runtime is written using no runtime service itself except for the global heap (malloc and free). If you can create ~ pointers on the heap, you can create almost anything.
1. All of the services that support the task model. This includes the local heap (@-boxes "at boxes), failure and unwinding the stack on failure, logging (which is task-local), context switching, and task management. All in std::rt::task.
QUESTION: without the global heap, just like C?
With the global heap, Rust is basically C. And that's what we use to implement C. But without the global heap, we cannot implement the rest.
QUESTION: but you have the local heap, right?
The local heap is very complicated and requires a lot of services. Without tasks you do not have a notion of local.
kmc: You can always replace the global_heap stuff pretty easily on different platforms.
Lots of the rt is related to scheduling. Lots of concurrent data structures, to make sure that tasks perform the way that the language specification requires (if such a specification existed). The runtime is all written in Rust (self-hosted).
QUESTION: memory model?
Yes, we skipped it

Memory model

QUESTION: if we have two tasks and one task has @-pointers. We are referring to it on the other one. @-pointers have reference counts. If I use them from another thread using unsafe, does it still increase the reference counts? Problem was, if I use the @-pointers from the other thread, then when the task finishes, it doesn't seem to finish them.
This is a horrible thing to do :-)
QUESTION: yes, but that's what the code is doing.
In this situation, if you are really trying to read the managed pointer from another thread. It would be best to transmute the managed to unsafe in the original thread and only referring to it as unsafe in the other thread and then you don't have to worry about the reference counting.
As to what actually happens with the RC, it's not clear where all the refcount bumps are.
pcwalton: No, it shouldn't refcount. We should follow up on the layout data.
jack: This is a common problem.
Only in servo where people do terrible things.
jack: Can use forget to tell it not to decrement the refcount.
pcwalton: We never implemented sending the data back to the layout task to be destroyed. The script task is trying to destroy layout data, which is bogus, and the layout task owns @-boxes.
jack: Why?
pcwalton: The DOM has script task.
jack: Layout has an item, the JS destructor has a problem.
pcwalton: Should have a channel to send them back to the layout task.
How does this work at all?
kmc: We leak them from script (I'm not sure about this)
pcwalton: Should be a channel to send them back on so they clean up.
jack: We will work around it for now, but this is broken.
pcwalton: I leak it, but then we crash on shutdown.
QUESTION: can you explain why the rust runtime relies on libuv?
Has to do with the nature of green threads and blocking. In a normal OS thread, when you go to read some data from the network, the operating system will handle the scheduling of the thread and will block it and run some other thread. While you are reading, your thread is no longer executing, it is just waiting for data. Another thread will be scheduled and will start running.
In a green threads system, if you read from the network, your entire scheduler gets blocked. When that happens no other green threads can run. If that happens on every scheduler, then your tasks are no longer making any progress.
Fundamentally, traditional stream-based I/O in C, Java, etc. are just incompatible with green threads. What we instead do is when we read from a socket, you do not block the entire scheduler. Instead you take the current task, stop running it, and let the scheduler run another task. But you cannot build this on top of normal sockets because they rely on the OS to schedule threads. So, what we do instead is instead of handling the I/O inside of tasks, we handle it in the scheduler and rely on the scheduler to negotiate both the I/O and the scheduling of tasks. This requires an I/O system that is different from traditional I/O. To explain how this works, I can talk about the structure of the scheduler and event loops.
So, the scheduler just loops looking at its queue for tasks to run and running them. It is a simple loop. There is also another I/O model, called evented I/O, in epoll (kqueue on BSD, IOCompletionPorts on Windows) where all that you do is make a system call in the loop and the system tells you what the next event is and you use it. epoll for example will tell you the next file handle that is available and you read from it when that happens. That is an I/O event loop as implemented by epoll. This has parallels to our scheduler event loop. If you are familiar with node.js, this made evented I/O popular, as everything in the system responds to I/O events and they handle I/O asynchronously with callbacks.
What we do to avoid blocking the scheduler is we build it on top of an I/O event loop. So instead of the scheduler looking in the work queue for tasks to run, it also queries epoll for I/O to handle. It interleaves I/O responses with task execution. When you read I/O from the task, the scheduler stops runnign the task, adds an event to the I/O event loop and then goes back and starts running another task. Then we have the I/O backed up and we do not block another scheduler.
This is why we use libuv, since it has evented I/O. Critically, it works across all platforms.
jack: Key feature is that it uses something other than select on Windows.
dherman: Are there concerns about libuv on other platforms?
yichoi: Just asking for explanation for others.
libuv is our biggest runtime dependency. It is also a little immature. We think that it is good, but we are wary of being committed to it forever because it is not clear that it will also be the best solution. We can run the standard library without linking to libuv in case libuv ever becomes a problem, we have lots of options for plugging in different event loops.