Wednesday, February 18, 2009

why cache comments...

Just some quick notes about comments made recently… They are great and I can see that more clarification of my postings are in order…

(1) yes, the math is very simple, as was the comment about adding 2 - 50ms as no big deal to performance. After seeing that comment, I wanted to make sure people understood the possible impact of this – yes, it is relative to your environment, architecture, and deployment (that’s always true). 

The previous blog postings refer to using virtual directory as primarily a proxy tool, meaning you keep the underlying data structure relatively intact. Then I agree that persistent cache would be an almost bizarre approach. 

So, allow me to make the point I should have to begin with… Sometimes you want to represent the information in a way that it significantly different than the way it is currently stored…  This means creating new views of existing data, for example, across multiple database sources and tables. This would involve multiple joins, and are costly in terms of processing.  As Mark Wilcox mentioned, there are other tools available for solving these problems. Some databases support materialized hierarchical views of data to solve this problem. It is also possible with some virtual directories to solve this type of problem, but you need a persistent cache, doing this dynamically will be too slow for many applications.

(2) I never argued that it was ok to wait hours for updated information.  The fact is that many organizations CURRENTLY have such a situation where updates can take hours or even a day to synchronize. I was proposing a solution that would do the same thing within 1 second…. Thanks for letting me clarify that point…   

Hopefully this clarifies things a bit, and thanks for the dialog!

Thursday, February 12, 2009

why cache and virtual directories???

I know I have mentioned this before, but since there’s an ongoing ping-pong match about cache — particularly “persistent cache” — playing out on the identity blogs lately, I thought I’d return the volley. Ashraf Motiwala covered the topic again in his blog yesterday, so here are my two cents.

First off, while it’s true there are times when cache makes no sense, there are other times when it really does.  Cache is used everywhere in your PC, software, servers, EVERYWHERE —so arguing against cache seems completely strange to me.

Second, I always find it funny to hear the arguments against having more options. Why argue against choice and options? I can cite many projects that have been deployed with caching using virtual directories, and yes, this includes "persistent" cache. 

Third, cache is necessary because when merge multiple tables (join) across different databases (or directories for that matter), the results are just not fast enough for any type of security application. Anyone familiar with databases should understand this quickly. Once you join several objects or tables, the response rate of the source is dramatically reduced. The joins necessary to create views are sometimes too complex to do on the fly for most directory-enabled applications, such as would be common for IdM/security. This, in my mind, is a key functionality of virtual directories after aggregating sources for a common protocol.

Fourth, 2 to 5 milliseconds can be a big deal, and cache is essential to eliminating that lag. Think about it, if I have to search for a member in a directory, and then search a database table for additional attributes to join to this object — do you really think it will perform at close to the same speed? And that is with just two sources...imagine the performance hit you’d take by adding additional sources and multiple join operators.

When your directory is expected to perform at 8000 queries a second, adding 2 to 5 milliseconds can be a VERY big deal. OK, let's keep the math simple and take a closer look what the problem is…

  • I have a directory that performs at 5000 q/sec
  • That translates into .2 milliseconds per query (5000/1)
  • Your "overhead" is 2 milliseconds (the best performance cited)
  • My queries now take 2.2 milliseconds (11 times slower)
  • Now instead of 5000 q/sec when I access my directory I get only 455 q/sec

Some people might argue this is not a "minimal" performance hit. There are many initiatives where this type of speed would be totally unacceptable. This is actually a perfect case where persistent cache would be helpful. A persistent cache could easily bring this query rate back up to 5000 q/sec (or higher), even in the case of more complex operations such as more than two sources and more than one join. 

Fifth, the idea that you compromise the "freshness" of data for the sake of speed misses the point about what sort of information we’re dealing with here. We’re dealing mostly with identity using directories (people and other objects), and the identities themselves do not change very often in comparison to other data, such as transactions where updates and write operations are more common than search/query. For example, in your bank account, your “identity” information (name, address, phone, pin number, passwords, etc) changes far less your balance and activity.

The idea is correct, cache will create a lag in updates being available to client applications, BUT virtual directory implementations using persistent cache with event-detection cache-refresh mechanisms, offer (near) real-time incremental updates of information. 

Furthermore, if an account is disabled, and it takes 1 second to update that account to all cache instances this would be an improvement. Currently many organizations currently are taking several minutes or even a full 24 hours between this type of updating. 

Well, those are my 5 cents worth of comments... I know I promised only 2, but who's counting pennies, just milliseconds, right? :)

Oh, btw, no one has mentioned the distributed remote persistent cache story, which I have seen implemented with virtual directories — now we’re talking about some serious advantages... If anyone is interested, I would participate in such a discussion... 

Metadata and Integration

Wow, nice to see someone articulate the problem. Too many vendors and architects use too diverse of a language for solutions to converge right now... David puts the finger on the nose here... We have to start understanding the metadata and semantic relationships (esp for security) in our systems, in a way that is scalable..

Check out David Linthicum's posting...