How efficient can Meteor be while sharing a huge collection among many clients?

- QUESTION -
Imagine the following case:
  • 1,000 clients are connected to a Meteor page displaying the content of the "Somestuff" collection.
  • "Somestuff" is a collection holding 1,000 items.
  • Someone inserts a new item into the "Somestuff" collection
What will happen:
  • All Meteor.Collections on clients will be updated i.e. the insertion forwarded to all of them (which means one insertion message sent to 1,000 clients)
What is the cost in term of CPU for the server to determine which client needs to be updated?
Is it accurate that only the inserted value will be forwarded to the clients, and not the whole list?
How does this work in real life? Are there any benchmarks or experiments of such scale available?

- ANSWER-

The short answer is that only new data gets sent down the wire. Here's how it works.
There are three important parts of the Meteor server that manage subscriptions: the publish function, which defines the logic for what data the subscription provides; the Mongo driver, which watches the database for changes; and the merge box, which combines all of a client's active subscriptions and sends them out over the network to the client.

Publish functions

Each time a Meteor client subscribes to a collection, the server runs a publish function. The publish function's job is to figure out the set of documents that its client should have and send each document property into the merge box. It runs once for each new subscribing client. You can put any JavaScript you want in the publish function, such as arbitrarily complex access control using this.userId. The publish function sends data into the merge box by calling this.addedthis.changed and this.removed. See the full publish documentation for more details.
Most publish functions don't have to muck around with the low-level addedchanged and removed API, though. If a publish function returns a Mongo cursor, the Meteor server automatically connects the output of the Mongo driver (insertupdate, and removed callbacks) to the input of the merge box (this.addedthis.changed and this.removed). It's pretty neat that you can do all the permission checks up front in a publish function and then directly connect the database driver to the merge box without any user code in the way. And when autopublish is turned on, even this little bit is hidden: the server automatically sets up a query for all documents in each collection and pushes them into the merge box.
On the other hand, you aren't limited to publishing database queries. For example, you can write a publish function that reads a GPS position from a device inside a Meteor.setInterval, or polls a legacy REST API from another web service. In those cases, you'd emit changes to the merge box by calling the low-level addedchanged and removed DDP API.

The Mongo driver

The Mongo driver's job is to watch the Mongo database for changes to live queries. These queries run continuously and return updates as the results change by calling addedremoved, and changed callbacks.
Mongo is not a real time database. So the driver polls. It keeps an in-memory copy of the last query result for each active live query. On each polling cycle, it compares the new result with the previous saved result, computing the minimum set of addedremoved, and changed events that describe the difference. If multiple callers register callbacks for the same live query, the driver only watches one copy of the query, calling each registered callback with the same result.
Each time the server updates a collection, the driver recalculates each live query on that collection (Future versions of Meteor will expose a scaling API for limiting which live queries recalculate on update.) The driver also polls each live query on a 10 second timer to catch out-of-band database updates that bypassed the Meteor server.

The merge box

The job of the merge box is to combine the results (addedchanged and removed calls) of all of a client's active publish functions into a single data stream. There is one merge box for each connected client. It holds a complete copy of the client's minimongo cache.
In your example with just a single subscription, the merge box is essentially a pass-through. But a more complex app can have multiple subscriptions which might overlap. If two subscriptions both set the same attribute on the same document, the merge box decides which value takes priority and only sends that to the client. We haven't exposed the API for setting subscription priority yet. For now, priority is determined by the order the client subscribes to data sets. The first subscription a client makes has the highest priority, the second subscription is next highest, and so on.
Because the merge box holds the client's state, it can send the minimum amount of data to keep each client up to date, no matter what a publish function feeds it.

What happens on an update

So now we've set the stage for your scenario.
We have 1,000 connected clients. Each is subscribed to the same live Mongo query (Somestuff.find({})). Since the query is the same for each client, the driver is only running one live query. There are 1,000 active merge boxes. And each client's publish function registered an addedchanged, and removed on that live query that feeds into one of the merge boxes. Nothing else is connected to the merge boxes.
First the Mongo driver. When one of the clients inserts a new document into Somestuff, it triggers a recomputation. The Mongo driver reruns the query for all documents in Somestuff, compares the result to the previous result in memory, finds that there is one new document, and calls each of the 1,000 registered insert callbacks.
Next, the publish functions. There's very little happening here: each of the 1,000 insert callbacks pushes data into the merge box by calling added.
Finally, each merge box checks these new attributes against its in-memory copy of its client's cache. In each case, it finds that the values aren't yet on the client and don't shadow an existing value. So the merge box emits a DDP DATA message on the SockJS connection to its client and updates its server-side in-memory copy.
Total CPU cost is the cost to diff one Mongo query, plus the cost of 1,000 merge boxes checking their clients' state and constructing a new DDP message payload. The only data that flows over the wire is a single JSON object sent to each of the 1,000 clients, corresponding to the new document in the database, plus one RPC message to the server from the client that made the original insert.

Optimizations

Here's what we definitely have planned.
  • More efficient Mongo driver. We optimized the driver in 0.5.1 to only run a single observer per distinct query.
  • Not every DB change should trigger a recomputation of a query. We can make some automated improvements, but the best approach is an API that lets the developer specify which queries need to rerun. For example, it's obvious to a developer that inserting a message into one chatroom should not invalidate a live query for the messages in a second room.
  • The Mongo driver, publish function, and merge box don't need to run in the same process, or even on the same machine. Some applications run complex live queries and need more CPU to watch the database. Others have only a few distinct queries (imagine a blog engine), but possibly many connected clients -- these need more CPU for merge boxes. Separating these components will let us scale each piece independently.
  • Many databases support triggers that fire when a row is updated and provide the old and new rows. With that feature, a database driver could register a trigger instead of polling for changes.

0 comments:

Post a Comment

Powered by Blogger.