Thursday, February 12, 2009

why cache and virtual directories???

I know I have mentioned this before, but since there’s an ongoing ping-pong match about cache — particularly “persistent cache” — playing out on the identity blogs lately, I thought I’d return the volley. Ashraf Motiwala covered the topic again in his blog yesterday, so here are my two cents.

First off, while it’s true there are times when cache makes no sense, there are other times when it really does.  Cache is used everywhere in your PC, software, servers, EVERYWHERE —so arguing against cache seems completely strange to me.

Second, I always find it funny to hear the arguments against having more options. Why argue against choice and options? I can cite many projects that have been deployed with caching using virtual directories, and yes, this includes "persistent" cache. 

Third, cache is necessary because when merge multiple tables (join) across different databases (or directories for that matter), the results are just not fast enough for any type of security application. Anyone familiar with databases should understand this quickly. Once you join several objects or tables, the response rate of the source is dramatically reduced. The joins necessary to create views are sometimes too complex to do on the fly for most directory-enabled applications, such as would be common for IdM/security. This, in my mind, is a key functionality of virtual directories after aggregating sources for a common protocol.

Fourth, 2 to 5 milliseconds can be a big deal, and cache is essential to eliminating that lag. Think about it, if I have to search for a member in a directory, and then search a database table for additional attributes to join to this object — do you really think it will perform at close to the same speed? And that is with just two sources...imagine the performance hit you’d take by adding additional sources and multiple join operators.

When your directory is expected to perform at 8000 queries a second, adding 2 to 5 milliseconds can be a VERY big deal. OK, let's keep the math simple and take a closer look what the problem is…

  • I have a directory that performs at 5000 q/sec
  • That translates into .2 milliseconds per query (5000/1)
  • Your "overhead" is 2 milliseconds (the best performance cited)
  • My queries now take 2.2 milliseconds (11 times slower)
  • Now instead of 5000 q/sec when I access my directory I get only 455 q/sec

Some people might argue this is not a "minimal" performance hit. There are many initiatives where this type of speed would be totally unacceptable. This is actually a perfect case where persistent cache would be helpful. A persistent cache could easily bring this query rate back up to 5000 q/sec (or higher), even in the case of more complex operations such as more than two sources and more than one join. 

Fifth, the idea that you compromise the "freshness" of data for the sake of speed misses the point about what sort of information we’re dealing with here. We’re dealing mostly with identity using directories (people and other objects), and the identities themselves do not change very often in comparison to other data, such as transactions where updates and write operations are more common than search/query. For example, in your bank account, your “identity” information (name, address, phone, pin number, passwords, etc) changes far less your balance and activity.

The idea is correct, cache will create a lag in updates being available to client applications, BUT virtual directory implementations using persistent cache with event-detection cache-refresh mechanisms, offer (near) real-time incremental updates of information. 

Furthermore, if an account is disabled, and it takes 1 second to update that account to all cache instances this would be an improvement. Currently many organizations currently are taking several minutes or even a full 24 hours between this type of updating. 

Well, those are my 5 cents worth of comments... I know I promised only 2, but who's counting pennies, just milliseconds, right? :)

Oh, btw, no one has mentioned the distributed remote persistent cache story, which I have seen implemented with virtual directories — now we’re talking about some serious advantages... If anyone is interested, I would participate in such a discussion... 

4 comments:

Phil Hunt said...

You are forgetting the role of a virtual directory proxy server as a load-balancer and multiplexor.

In practice, performance doesn't simply add. 5000q/sec does not degrade to 500q/sec. 2ms per transaction does not impact the collective throughput of the server.

Different vendors have different internal architectures. So a persistent cache will always be limited by the performance of that single data store.

A load-balancer is limited only by the ideal maximum load of the aggregate of all servers it proxies. This is why caching within a proxy (of either kind) holds little value in practice.

Anonymous said...

I believe that your math is somewhat misleading with regard to the performance hit resulting from the use of a VDS. I have addressed this in a blog posting of my own.

Anonymous said...
This comment has been removed by a blog administrator.
TPS said...

response to comments: http://identityinfrastructure.blogspot.com/2009/02/why-cache-comments.html