mirror of
https://github.com/servo/servo
synced 2026-04-25 17:15:48 +02:00
Page:
Transcription parallelism
Pages
Adding a new WebIDL binding
Alternative Logo Proposals and Related Swag
Asynchronous WebAssembly compilation project
Austin Oxidation
Autogeneration of style structs
Basic SVG support project
Beginner's guide to rebasing and squashing
Benchmarking
Benchmarks
Bots
Browser Engine Research
Build Errors FAQ
Buildbot administration
Building for Android
Building for Magic Leap
Building for UWP
Building on ARM desktop Linux
Building
CI Services we use
CSS parse error reporting
CSSOM student project
Canvas rendering project
Cargo upgrade service project
Code rust concurrency
Code Review
Code of Conduct
Coding standards
Compiler upgrade recipes
Compositor Layer Design
Contributing
Control Servo using WebDriver
Creating and viewing WARC web archives in Servo
Creating new OpenSSL Windows binary distributions
Cross compiling from linux to mac
Crowbot
Css selector matching meeting 2013 07 19
DOM Design
DOM documentation
DOM missing pieces
Debugging JS web compat issues
Debugging and editing tools
Debugging
Design
Developer tools student project
Devtools CSS errors
Devtools plans
Devtools
Diagnosing SpiderMonkey JIT issues
Eric Atkinson visit 2013 09 10
Events and sundry
Expand HTTP request response monitoring
Fetch improvement project
Firefox Reality release notes
FirefoxReality build
Firewall setup for servo master1
Focus student project
Form validation student project
GSoC project brainstorming
Garbage collected DOM
Getting started with layout
GitHub Labels
Github & Critic PR handling 101
Github workflow
Glossary
Governance
Graphics toolkit integration
HTML parser improvement project
HTMLElement binding conversion
HTTP archive support project
HTTP library requirements
Hawaii Rooting
High priority content for layout
Highfive
HoloLens 2 test plan
Home
How to generate GStreamer binaries for CI
Image load conformance student project
Image maps project
Implement HTML charset parsing project
Implement ImageBitmap project
Implement missing WebAudio automation student project
Implement support for missing XMLHttpRequest APIs
Implement worker modules
Implementing a web standard (RGSoC)
Improve specification conformance of unicode bidi library
Incremental flow tree construction
Infrastructure
Integrate xml5ever
Intern project brainstorming
Intern projects
JS objects, wrappers, and cross origin concerns 2013 08 07
Layout 2020
Layout Overview
Layout resources
Layout revamp ideas
Leo meyerovich visit 2013 07 22
Linux sandboxing
London Oxidation
London Security
Meeting 2014 10 27
Meeting 2014 12 08
Meeting 2012 02 08
Meeting 2012 02 16
Meeting 2012 07 20
Meeting 2013 04 01
Meeting 2013 04 15
Meeting 2013 04 22
Meeting 2013 04 29
Meeting 2013 05 06
Meeting 2013 05 13
Meeting 2013 05 20
Meeting 2013 06 03
Meeting 2013 06 10
Meeting 2013 06 14
Meeting 2013 06 17
Meeting 2013 06 24
Meeting 2013 07 01
Meeting 2013 07 15
Meeting 2013 07 22
Meeting 2013 07 29
Meeting 2013 08 05
Meeting 2013 08 12
Meeting 2013 08 19
Meeting 2013 09 09
Meeting 2013 09 16
Meeting 2013 09 23
Meeting 2013 09 30
Meeting 2013 10 14
Meeting 2013 10 21
Meeting 2013 10 28
Meeting 2013 11 04
Meeting 2013 11 18
Meeting 2013 11 25
Meeting 2013 12 02
Meeting 2013 12 09
Meeting 2013 12 16
Meeting 2014 01 06
Meeting 2014 01 13
Meeting 2014 01 21
Meeting 2014 01 27
Meeting 2014 02 03
Meeting 2014 02 10
Meeting 2014 02 24
Meeting 2014 03 10
Meeting 2014 03 17
Meeting 2014 03 24
Meeting 2014 03 31
Meeting 2014 04 07
Meeting 2014 04 14
Meeting 2014 04 21
Meeting 2014 04 28
Meeting 2014 05 05
Meeting 2014 05 13
Meeting 2014 05 19
Meeting 2014 06 09
Meeting 2014 06 17
Meeting 2014 06 23
Meeting 2014 06 30
Meeting 2014 07 07
Meeting 2014 07 14
Meeting 2014 07 21
Meeting 2014 07 29
Meeting 2014 08 04
Meeting 2014 08 11
Meeting 2014 08 12
Meeting 2014 08 18
Meeting 2014 08 25
Meeting 2014 09 08
Meeting 2014 09 15
Meeting 2014 09 22
Meeting 2014 09 29
Meeting 2014 10 06
Meeting 2014 10 13
Meeting 2014 10 20
Meeting 2014 11 10
Meeting 2014 11 17
Meeting 2014 11 24
Meeting 2014 12 15
Meeting 2015 01 05
Meeting 2015 01 12
Meeting 2015 01 26
Meeting 2015 02 09
Meeting 2015 02 23
Meeting 2015 03 02
Meeting 2015 03 16
Meeting 2015 03 30
Meeting 2015 04 06
Meeting 2015 04 13
Meeting 2015 04 27
Meeting 2015 05 04
Meeting 2015 05 11
Meeting 2015 05 18
Meeting 2015 06 01
Meeting 2015 06 08
Meeting 2015 06 15
Meeting 2015 07 06
Meeting 2015 07 13
Meeting 2015 07 27
Meeting 2015 08 10
Meeting 2015 08 17
Meeting 2015 08 24
Meeting 2015 08 31
Meeting 2015 09 14
Meeting 2015 09 21
Meeting 2015 09 28
Meeting 2015 10 05
Meeting 2015 10 12
Meeting 2015 10 19
Meeting 2015 10 26
Meeting 2015 11 02
Meeting 2015 11 09
Meeting 2015 11 16
Meeting 2015 11 30
Meeting 2016 01 04
Meeting 2016 01 11
Meeting 2016 01 25
Meeting 2016 02 01
Meeting 2016 02 08
Meeting 2016 02 22
Meeting 2016 03 07
Meeting 2016 03 21
Meeting Devtools Servo 2
Meetings
Microdata project
Minutes Hackathon 2012 03 27
Missing DOM features project
More ServiceWorker support project
More developer tools student project
Mozlandia Automation
Mozlandia B2S
Mozlandia JS
Mozlandia Rust In Gecko
Mozlandia WPT
Mozlandia gfx
Mozlando Devtools Servo
Mozlando Oxidation
Mozlando SM Servo
Mozlando Servo Bluetooth
Mozlando Servo MagicDOM
Mozlando Servo SMStrings
Mutation observer project
Mutation testing project
NCSU student projects
Network security project
Off main thread HTML parsing project
Offscreen canvas improvements project
Offscreen canvas project
Orlando Oxidation 2018
Oxidation 2015 11 05
Persistent sessions student project
Preparing ARM libraries for CI
Priority of CSS properties
Priority of DOM implementation
Priority of dom bindings
Private browsing student project
Profiling
Project proposal deadlines
Prototype JS form controls student project
Prototype ways of splitting the script crate
Publishing a new ANGLE NuGet version
Publishing a new app store release
Push vs Pull for caching
Random web content project
Refactor GLES2 student project
Refactor bluetooth support student project
Remaining work
Removing push notifications from IRC hooks
Replace C libraries student project
Report new contributors project
Representation of computed style
Research
Reviewer
Roadmap
Running Web Platform Tests on Servo
Rust HTML parser
Rust SpiderMonkey debugger API
Rust cssparser code walk 2013 08 02
SaltStack Administration
San Francisco Oxidation
Servo Benchmarking Report (December 2024)
Servo Benchmarking Report (November 2024)
Servo Benchmarking Report (October 2024)
Servo Layout Engines Report
Servo and SpiderMonkey Report
Servo for Gecko Developers
Specification Links
SpiderMonkey related tasks
SpiderMonkey infodump
SpiderMonkey upgrade details
Storage student project
Streaming webassembly student project
Strings
Student project brainstorm
Student projects
Styling overview
Stylo hacking guide
Summer of Code 2014: Implement XMLHttpRequest
Summer of Code 2016: Fetch API
Summer of Code 2016: File support
Summer of Code 2016: ServiceWorker infrastructure
Summer of Code projects
Summit meeting 2013 09 09
Support WebDriver based tests project
Syncing web platform tests (WPT)
TaskCluster
Testing
Tools
Tracking intermittent failures over time project
Transcription Notes from Servo Architecture talk in Suwon
Transcription notes from rust patterns talk in suwon
Transcription parallelism
Transcription rust concurrency
Transcription rust runtime
Transription layout and acid2
Trinity College Dublin student projects
UPenn student projects
Updating the Rust compiler used by Servo
Upgrading non taskcluster linux CI machines
Upgrading the UWP gstreamer binaries
Upgrading the windows LLVM binaries
Upgrading wptrunner
Using DOM types
Using Rust Spidermonkey Prototype
Using WebWorker Prototype
Version 0.1
Videos and presentations
WebAudio JS interfaces student project
WebAudio nodes student project
WebCompatBug
WebSocket student project
Webdriver student project
Webdriver tests student project
Webrender Overview
Whistler 2019 notes
Whistler Bugzilla
Whistler FFOS
Whistler GFX
Whistler Houdini1
Whistler Houdini2
Whistler Necko
Whistler Oxidation 2019
Work items for new contributors
Workweek COW DOM
Workweek alt js
Workweek android arm
Workweek boot 2 servo
Workweek compiler lints
Workweek displaylist
Workweek dogfooding
Workweek encoding
Workweek generated content
Workweek governance
Workweek graphics stack
Workweek graphics toolkit
Workweek incremental layout
Workweek js bindings status
Workweek layers
Workweek layers2
Workweek pixels
Workweek rasterization
Workweek reftests
Workweek roadmap
Workweek script crate
Workweek security
Workweek string interning
Workweek tables
Workweek writing modes
XML parser student project
infra triage notes
jQuery status
webxr.today support
Clone
1
Transcription parallelism
Lars Bergstrom edited this page 2013-11-13 20:17:49 -08:00
String interning
- We have a string interning version with a linked list of arrays with a has function. Atom has raw ptr to string and lwoercases. In Servo, when it compares strings, it converts to lower, which is really slow. Instead, pointer comparisons, which is a huge speedup.
- Comment from bz is that it's not thread-safe. I just use the global hash table. Does it need to be thread-safe?
- pcwalton: We have also been discussing. I think we have a good idea about how to do this.
- I am trying to implement a lock-free interning. I am using AtomicPtr. Is that the right way?
- brson: I would prefer AtomicOption
- I just need to read the value. Only has take, no read. So I can't read the value unless I remove it from the hash table.
- jack: We should fix that in the std library
- brson: May be intentional. If you want multiple threads, you may want an UnsafeAtomicPtr.
- Currently, just using AtomicPtr and compare-and-swap. This work is in progress. Servo does not use string interning, but we will get good perf improvements with it. In October, we had our internal evaluation with showing numbers. That first (sequential) version had a huge contribution.
Parallel CSS selector matching
- I have not implemented this one; another guy did. Similar to the one in the paper.
- We start with the DOM tree, do a preorder traversal. Split the vector into 8 pieces (for 4 cores). For each piece, we send it to a task, do CSS selector matching independently. After we get the results, we do one more traversal to calculate the CSS values for each node. Gave us about 1.2x perf improvement on our test file (5k nodes, with ~1000 CSS selectors). Then one more pass to calculate the CSS values, which we tried to parallelize with a task. In 10ms by default, but with two tasks, it went to about 200ms.
- Originally, matching was 80ms. The original attempt was 400ms. After the rust fixes, 50-60ms. Works because CSS selector matching currently reads the DOm tree nodes and traverses upwards, so it can be done independently. Sibling selector matching would require us to do it differently.
- jack: So with 8 tasks, 1.2x speedup?
- Yes, better than 4 tasks.
- pcwalton: Probably lost a lot of performance by creating a vector hurt. If we directly followed the links using unsafe code, it would probably be a lot faster. That vector creation probably costs a lot of code.
- How do we do that?
- pcwalton: Traverse the nodes. Use a taskpool with 8 tasks. We probably want to do something like work-stealing. If you want to do a pre-order traversal, Once you process a node, enqueue its children.
- They're independent.
- pcwalton: Then whenever a task picks off a node, it should push the children into the queue before running anything. The other tasks should be looking for work to steal. Sounds like what Leo did in his paper. The moment the parent pushes their children.
- brson: Can we just round-robin the noes?
- jack: Just traverse the tree, create a task for selector matching...
- pcwalton: Too slow to create those tasks.
- brson: Use a pool, maybe, that's persistent. Traverse by sending a message for each node to the tasks in the task pool. But that may be too slow. So maybe tile and send 4-8 at a time.
- pcwalton: Won't the tiling be a sequential bottleneck? I would have an atomic counter so that the next child goes into the work queue for the task.
- kmc: Need a concurrent queue.
- pcwalton: Need for work-stealing anyway.
- jack: Can we use what Rust already uses?
- kmc: Do we need it for layout anyway?
- brson: Traversing n cores is going to be very fast. Can send all the messages to the cores quickly.
- pcwalton: Why not just let the tasks discover their children? Seems like it saves one tree traversal sequentially.
- brson: I'm just thinking it's easier and faster.
- kmc: Maybe need work-stealing now.
- The reason we did this is because we needed to split/chunk the work so we could pass it to other tasks. That gave us good improvement. Maybe another way.
- pcwalton: Maybe batch up your kids and send them to another task?
- Maybe batch up 4 or 8 at a time?
- pcwalton: Ultimately, sad. Work-stealing is going to be much better.
- kmc: When you built the vector, did you make sure the vectors are cache line-aligned? So that multiple processors are not touching the same thing.
- Just have 8 vectors separately allocated. Need to send it unsafely.
- pcwalton: There's a par function in the stdlib that used to be able to chunk and perform work on vectors in parallel. That's probably what you would want.
- kmc: Takes a slice and returns two subslices that can be borrowed independently? parMap?
- pcwalton: Consumes the vector & returns slices that can be sent?
- brson: Traversing the entire DOM seems like a huge bottleneck.
- kmc: ad-hoc traversal
- pcwalton: just work-stealing with a spinlock?
- brson: pthread_mutex in rust is better than that... Maybe we should start using it everywhere and talk about it.
- brson: we need a deque. Let's take one and write it in Rust.
- kmc: jason was going to do it. I should just follow up.
- lbergstrom: I'll take the CSS selector matching and port work stealing to it.
- Task spawn takes a lot of time.
- yichoi: Was early stages, maybe we could test it again.
- brson: task spawning has not been tuned yet.
- kmc: But green threads are our cheap abstraction.
- jack: What does a task that does nothing take? 4kb?
- brson: a little more than that.
- jack: in Erlang, they preallocate the processes.
- brson: We cache nothing associated with tasks. It is a fairly big structure. They're not inlined, etc.
- kmc: Tricks used in kernel.
- lbergstrom: Can you provide the test case?
- Yes.
Succint(compact) representation
- Worked on this in graduate school. Idea is to flatten the DOM in an array using balanced paren scheme. Nodes have ancestor/decendant relationship. Imagine parenthesizing the tree. Allows fast traversal and perform DOM operations by reading the open/close parens and counting. This optimizes the memory space by using bit-per-paren and create an array of corresponding DOM nodes.
- So preorder traversal is just an array walk. Can read the bits from the index to traverse the tree. Might be simple for implementing traversals.
- Counting the number of bits in the byte is required (or we can hash the values into the table) to try and make traversals fast. getParent may be slower (if you start from the back and have to read the numbers) -- no back ptr. But we can build a reverse index.
- kmc: Like succint dictionary representations?
- lbergstrom: how do handle add/removeChild
- We leave empty spaces in the array. Update might be slower because we might have to shift the bits. Instead, we leave spaces.
- jack: So bitmask with empty vs. full, too, then?
- Index on top of them counts the number of bits stored. Forward/backward access allows the traversals. Size of the index is small.
- jack: Do the nodes still contain pointers to their parents? Or only contain non-pointer. first/last/next_sibling/prev_sibling are only in the index now? So, if we wanted to do the COW DOM, do we need two indexes?
- Yes.
- pcwalton: Maybe better for the flow tree. That one does not have to interact with the GC (that's difficult for this one). The flow tree we have complete control over. The flow tree also uses a lot of memory. The flow tree does a lot of traversals and its allocation needs to be extremely fast. It's also what we do all the reflow on. So, we might be able to get some interesting optimizations on it. And its shape does not change once it's built.
- jack: Might work very well for COW DOM, since those pointers are annoying. We were going to have 5 extra pointers. In this representation, we get extra savings.
- pcwalton: Don't we just need one? For the dirty node?
- jack: One plus 5 for each that changed. Because the dirty nodes are interior...
- pcwalton: let's talk later.
- yichoi: Can save memory usage. Not sure, but maybe we can exploit SIMD with this?
- kmc: Many database companies use complicated data structures of this kind.
- Can do; will send it to you.
- dherman: Are you thinking about all nodes everywhere or just some certain kinds of nodes?
- Just wanted to see if this technique seems interesting.
- dherman: Might be 2-3 types of nodes where there are huge wins. For simple nodes with perfect children. Probably try just doing it where we expect it to wind. Like where it's not deep and not a lot of bi-directional/sibling traversals on them. Should see if there's a place where it works.
- I have a paper on this work.
- kmc: If modifications are more expensive, maybe store it on the side.
- Yeah, maybe use more compact representation.
- brson: Send us papers.
Parallelization ideas
- Prescanner to scan the stream and read the external images and download them.
- May be able to reduce the number of traversals if we only have inherited & synthesized attributes. Ones that don't depend on descendants.
- jack: prescanning seems like an obvious win.
- dherman: Don't we do that?
- pcwalton: Nobody as agrressively as the zoom project did. That was a huge win. It's also because their success measure was initial page load (ignoring caches). That's primarily network anyway, so it was a huge victory.
- dherman: Matters for modules. JS will have modules with the name of external modules. That has to implemented in the JS engine side. We could have a prescan_for_requested_modules API. Not immediately necessary, but should fit into the same infrastructure.
Rule hash for CSS selector matching
- Is in other browsers and enables other optimizations from Leo's paper. We take every node and go through all the rules matching each one to see which match the node. Obviously slow.
- Idea: reduce the number of rules we attempt to check against. We partition the rules into sets. If the last part is a selector with an ID (with a #), then we know the node should have id="sidebar". We can put it in a hash table that matches id->that rule. Can do for IDs, classes, and elements. That allows us to reduce the number of rules we need to check.
- (See example on the slide for how it reduces them)
With rule hashing
- 3x speedup on standard web pages with ~1k CSS rules. There is a PR for that. Using this we can implement more optimizations.
- Redundant selector elimination is possible. We can group rules that imply other rules so that we don't have to worry about multiple matching.
- ryan: We combined string interning with the rule hash and got a lot of perf improvement.
- Since not all the hash tables fit in cache, start with ID rules for all nodes, then all class, then element, then the mandatory ones to check and you'll avoid cache misses.
- dherman: No tuning?
- Yes, just do it one at a time. one more optimization is to preallocate a vector for the rules that match a node. Then avoid allocating on the heap. Can have a vector of size 15-ish and only allocate on the heap if you need it.
- I will be working on this over the coming weeks.
- pcwalton: I'll r+ it now.
- jack: Have we started timing how fast we are compared to gecko and webit?
- pcwalton: 10x slower, missing string interning. And the 50x allocations really hurt us. Once we get those close to zero, we will be very competitive, except for where we need the Bloom filter. But that is not all pages. A lot of pages just use ID and CLASS, and then we're fine because it won't help.
- kmc: why that?
- pcwalton: remember your parents. Descendant selector matching.
- jack: Match the right hand side. Then match the parents. The Bloom filter will tell you if the parent is in your tree.
- pcwalton: people mistakenly write it as "a div div" instead of "a > div > div", which casues an explosion. Web developers don't understand that the first one is the wrong and expensive to do the match.
- dherman: I just want a list of tips for how to parallelize for servo.
- pcwalton: We have that for firefox, particularly on selectors.
- kmc: developer UI could show you hints, and people will go nuts for it.
- jdm: This is 2x faster b/c of servo's parallel speedups
- pcwalton: a clear helper would help. e.g., if you can detect when the floats do not impact later, we can tell people to add a clear.
- dherman: at Northeastern, they wrote something that would tell you if you failed
- jack: Like google search, we should show the parallel speedups we got in the UI. Right in the title bar.
- lbergstrom: Can't do speedup without a baseline...
- jack: Just CPU utilization