2014년 12월 8일 월요일

Some questions on Node's threading and scaling.

I'm new to Node. It looks very wonderful so far, I must say. 

Is there any official documentation on Node Internals, addressing questions such as the following:

1. How is the priority of the main thread that initiates the asynchronous I/O, creates servers etc handled vis-a-vis the event-handler thread that calls the callbacks? Do I have to know about this aspect of things as a Node programmer?

2. To use multi-cores, I can have multiple Node processes, sure. But how do I serialize/synchronize these processes when doing a set of (non-atomic) operations in an atomic manner over, say, a filesystem or a database? 

3. How big is the Node's worker thread-pool? Is its size configurable, or does it automatically scale up/down with load? When this thread pool is full, say, due to OS limitations or due to configuration parameters (provided that is even possible), does Node block or throw an exception or crash when yet another worker thread needs to be engaged?

4. Since I/O happens asynchronously in worker threads, it is possible for a single Node process to quickly/efficiently accept 1000s of incoming requests compared to something like Apache. But surely the outgoing responses for each of those requests will take their own time, won't it? For example, if an isolated request takes a minimum of 3 seconds to get serviced (with no other load on the system), then if concurrently hit with 5000 of such requests, won't Node take A LOT of time to service them all? If this 3-second I/O task happens to be involve exclusive access to certain resources, then it would take 5000 x 3 sec = 15000 sec, or over 4 hours of wait to see the response for the last request coming out of the app. In such scenarios, would it be correct to say that a single-process Node configuration can handle 1000s of requests per second, especially compared to Apache (akin to the nginx-Apache benchmark), when all that Node may be doing is putting the requests on hold till they get serviced? I'm asking this because as I'm reading up on  Node I'm often hearing how Node can address the C10K problem without any co-mention of any special application setup.

5. What about the context switching overhead of the workers in the thread pool? If C10K requests hit a Node-based application, won't the workers in the thread pool end up context-switching just as much as Apache's regular thread pool...? because all that would have happened in Node's main thread would be a quick request-parsing and -routing, with the remainder (or, the bulk) of the processing still happening in some thread somewhere? That is, does it really matter (as far as  minimization of thread context-switching is concerend) whether a request/response is handled from start to finish in a single thread (in the manner of Apache), or happens up in a Node-managed worker thread with only minimal work (the request parsing and routing) subtracted from it?



> Is there any official documentation on Node Internals, addressing questions such as the following:
>
> 1. How is the priority of the main thread that initiates the asynchronous I/O, creates servers etc handled vis-a-vis the event-handler thread that calls the callbacks? Do I have to know about this aspect of things as a Node programmer?
That's the same thread: the main thread runs in a loop, spawning IO (which ends up in a background thread pool if it's to the filesystem, because Unix has broken async IO APIs for non-network IO), and looping until there's an event to trigger a callback, which it does.

The background thread pool priority isn't terribly interesting because it's IO-bound, so there's never CPU scheduling going on for it in any useful sense. You can model the whole thing as single threaded execution with parallel IO. You really don't have to know about threads unless you're diving under the hood.

> 2. To use multi-cores, I can have multiple Node processes, sure. But how do I serialize/synchronize these processes when doing a set of (non-atomic) operations in an atomic manner over, say, a filesystem or a database?
Locking! Filesystem locks provided by your OS, or a distributed lock service. A proper distributed lock service lets you scale across machines, not just CPUs, of course.

And then if you can design things to be stateless or use data types like CRDTs, you may be able to avoid locking for some tasks.

> 3. How big is the Node's worker thread-pool? Is its size configurable, or does it automatically scale up/down with load? When this thread pool is full, say, due to OS limitations or due to configuration parameters (provided that is even possible), does Node block or throw an exception or crash when yet another worker thread needs to be engaged?
Nope: It's fixed-sized, not configurable. If you overflow it, operations are queued. There's a pending IO queue, and a queue of events to process on the way back. It's really pretty invisible, just a shim to do asynchronous IO with synchronous filesystem IO primitives. Thanks, Unix!

> 4. Since I/O happens asynchronously in worker threads, it is possible for a single Node process to quickly/efficiently accept 1000s of incoming requests compared to something like Apache. But surely the outgoing responses for each of those requests will take their own time, won't it? For example, if an isolated request takes a minimum of 3 seconds to get serviced (with no other load on the system), then if concurrently hit with 5000 of such requests, won't Node take A LOT of time to service them all? If this 3-second I/O task happens to be involve exclusive access to certain resources, then it would take 5000 x 3 sec = 15000 sec, or over 4 hours of wait to see the response for the last request coming out of the app. In such scenarios, would it be correct to say that a single-process Node configuration can handle 1000s of requests per second, especially compared to Apache (akin to the nginx-Apache benchmark), when all that Node may be doing is putting the requests on hold till they get serviced? I'm asking this because as I'm reading up on  Node I'm often hearing how Node can address the C10K problem without any co-mention of any special application setup.
That depends: Is that 3 second IO able to be concurrent? Your question involves "exclusive access to certain services", so yes, that sounds like you'd have to serialize. With that constraint, nothing could be quick about it. It's an artificial restriction you don't run into very often. Node would accept all the requests, but you'd respond to them serially.

Node's remarkably efficient. Without that restriction, compare it to nginx, which is also single-threaded, not Apache, which is complicated. If you're doing little on the CPU for each, it can accept a great many connections very quickly; it will queue the IO, and it'll shovel data to the connections as fast as possible.

> 5. What about the context switching overhead of the workers in the thread pool? If C10K requests hit a Node-based application, won't the workers in the thread pool end up context-switching just as much as Apache's regular thread pool...? because all that would have happened in Node's main thread would be a quick request-parsing and -routing, with the remainder (or, the bulk) of the processing still happening in some thread somewhere? That is, does it really matter (as far as  minimization of thread context-switching is concerend) whether a request/response is handled from start to finish in a single thread (in the manner of Apache), or happens up in a Node-managed worker thread with only minimal work (the request parsing and routing) subtracted from it?
Ignore the thread pool: network IO is non-blocking. It all runs in a single thread, start to finish. Maybe node will move a few expensive things into threads and handle them as events eventually -- like parsing HTTP headers -- but those are small tweaks at C10K scale. You really can model node as single threaded and be quite accurate for all but the most esoteric use cases or details.

C10K is an interesting problem for some servers, because O(n) expensive event listeners start dominating the time. The unix select() call fails here, and you need something like epoll. libuv under the hood uses epoll or equivalents, not select. C10K is not generally about having 10,000 requests come in at once, but 10,000 concurrently connected clients.

The next barrier is C100K. At that scale the CPU time to handle requests starts to dominate the time, OS socket limits start hampering you, the ephemeral ports to assign to connections start running out so you need multiple IP addresses, and the data structures at every level start becoming really important. People have pushed node that far with success.

Give it a try -- run a tool like apachebench (ab) or wrk against an HTTP server using node, and you can start seeing about how it responds.

smime.p7s


Aria already sent a very thorough and detailed response (thank you!), but I'd just like to comment on a couple of small details.


> On 6 Dec 2014, at 03:03, Harry Simons <simonsharry@gmail.com> wrote: > I'm new to Node. It looks very wonderful so far, I must say. 

Welcome! Node is indeed a pleasure to play and work with!
> Is there any official documentation on Node Internals, addressing questions such as the following:

As far as I know, there's no central documentation on how Node works under the hood.

However, there are several resources that can help you get the big picture. One of the fundamental components of Node.js is libuv, and you can find an introduction here: http://nikhilm.github.io/uvbook/introduction.html. V8 is another base component, and you can find more information about it here: https://code.google.com/p/v8/.

As for Node.js itself, there is no documentation on its implementation that I know of. But since the code is rather small, you should be able to find answers to your questions by reading it. If you have any question, we'll be more than happy to help you answer them.
> 3. How big is the Node's worker thread-pool? Is its size configurable, or does it automatically scale up/down with load? When this thread pool is full, say, due to OS limitations or due to configuration parameters (provided that is even possible), does Node block or throw an exception or crash when yet another worker thread needs to be engaged?

Nope: It's fixed-sized, not configurable. If you overflow it, operations are queued. There's a pending IO queue, and a queue of events to process on the way back. It's really pretty invisible, just a shim to do asynchronous IO with synchronous filesystem IO primitives. Thanks, Unix! 

The thread pool is actually configurable with the "UV_THREADPOOL_SIZE" environment variable. You can find more information about how it's implemented here: https://github.com/libuv/libuv/blob/v1.x/src/threadpool.c#L142. However, I have never changed its value, and I'm not sure this is a recommended solution when the thread pool is overloaded.

In addition to filesystem I/O, the thread pool is used for name resolution done with calls to dns.lookup (which uses getaddrinfo on POSIX OSes and GetAddrInfoW on Windows, both synchronous blocking calls). Some node users run into issues when dns.lookup calls take a long time or timeout, because they can prevent other operations (like filesystem operations) to run on the fixed-size thread pool. A good example of such an issue can be found here: https://github.com/joyent/node/issues/2868.

It would be tempting in this case to increase the value of the "UV_THREADPOOL_SIZE" environment variable, but I believe a better alternative is first to investigate why dns.lookup calls take a long time or fail, and/or to use dns.resolve. dns.resolve uses asynchronous I/O and thus doesn't run on the thread pool. Other alternatives can be found in the above mentioned issue, and elsewhere. Keep in mind though that dns.resolve doesn't behave like most applications running on the same system since it doesn't use getaddrinfo (or GetAddrInfoW on Windows). It also always does a DNS query over the network, which may or may not be desirable.


I hope this helps, and please feel free to ask further questions!



> Aria already sent a very thorough and detailed response (thank you!), but I'd just like to comment on a couple of small details.
Awesome! I stand corrected about being able to configure the size of the thread pool. Good to know!

And thank you for taking the time to chime in -- some great info and references there!



Aria, Julien -  those were very enlightening answers! Looks like, I'll be busy for a few months now digging into this area. I'm incidentally new to web application development, too. Though I'll be making Node/Express my first web application development platform/framework, I won't have the benefit of any rich, first-hand experience of the history leading up to Node :-(

Maybe a last question, a very basic one?

> C10K is not generally about having 10,000 requests come
> in at once, but 10,000 concurrently connected clients.
Node or not, is there any point in getting (and remaining) connected to a web app with extremely large latencies / response times, versus waiting in queue to getting connected and maybe even this attempt timing out in the process. Basically, if I've understood it correctly, I see no merit in claiming before my users that my app can support c10k clients when at the end of the day the app becomes too slow to be usable by these c10k users, forcing me to look for standard alternative platforms and architectures outside Node to scale up and out, whatever they happen to be. (I don't know at this point what these scale-friendly platforms/architectures are, but I assume they exist and people are already using them.)



> Aria, Julien -  those were very enlightening answers! Looks like, I'll be busy for a few months now digging into this area. I'm incidentally new to web application development, too. Though I'll be making Node/Express my first web application development platform/framework, I won't have the benefit of any rich, first-hand experience of the history leading up to Node :-(
>
Yeah, it's a lot to dig into -- but largely, to write applications on node, you don't need to know it. The model of node as a single-threaded non-blocking-IO process works remarkably deep, though there are details.

> Maybe a last question, a very basic one?
>
> > C10K is not generally about having 10,000 requests come
> > in at once, but 10,000 concurrently connected clients.
>
> Node or not, is there any point in getting (and remaining) connected to a web app with extremely large latencies / response times, versus waiting in queue to getting connected and maybe even this attempt timing out in the process. Basically, if I've understood it correctly, I see no merit in claiming before my users that my app can support c10k clients when at the end of the day the app becomes too slow to be usable by these c10k users, forcing me to look for standard alternative platforms and architectures outside Node to scale up and out, whatever they happen to be. (I don't know at this point what these scale-friendly platforms/architectures are, but I assume they exist and people are already using them.)
Yeah. If you do much processing for each connection, you can totally spend enough time on CPU that the latency is high. However, people routinely do 10,000-client apps with a single process using node, and I've certainly seen a demo of a 500,000-client m:n message passing app using multiple processes (perhaps multiple machines -- I don't recall at this point)

Node may be that platform for you.




> 2. To use multi-cores, I can have multiple Node processes, sure. But how do I serialize/synchronize these processes when doing a set of (non-atomic) operations in an atomic manner over, say, a filesystem or a database?

Locking! Filesystem locks provided by your OS, or a distributed lock service. A proper distributed lock service lets you scale across machines, not just CPUs, of course.

And then if you can design things to be stateless or use data types like CRDTs, you may be able to avoid locking for some tasks. 


Afterthought: It will be nice if Node can support a distributed lock service out of the box thereby making its cluster feature "more complete". Meaning, if I can spawn, fork etc from within Node, then I should also be able to coordinate the work of these new processes using Node JS API itself. 



> Afterthought: It will be nice if Node can support a distributed lock service out of the box thereby making its cluster feature "more complete". Meaning, if I can spawn, fork etc from within Node, then I should also be able to coordinate the work of these new processes using Node JS API itself.
No need to put it in the box: We have npm and a great registry full of things to use. Maybe the lockserver module?





댓글 없음:

댓글 쓰기