Eventual Consistency: October 2007

Monday, October 29, 2007

Ruby on.... jetty?

So I'm playing around with JRuby, and I think its really cool. The java community has really come a long way and created some great tools for doing enterprise deployments, why should the ruby community spend time backtracking and re-inventing the wheel? The JVM has become a proven execution environment and with an open byte-code standard its trivial to compile to it (well not trivial, but do0-able).

So with JRuby you can get the syntactial sugar of ruby with the heavy lifting of such java services and JMS, hibernate, and even servlet containers. Currently Ruby on Rails uses either WeBrick or Mongrel as their web server. Both are written in ruby and, well, slow. If there is one thing the java community has done a good job of the past few years its build fast and reliable servlet containers (just look at tomcat and jetty).

What I'm wondering is how hard would it be to replace mongrel/WeBrick with jetty in a JRuby on Rails app? Just think of the performance gain with Jetty's NIO engine?

sequel to the rescue

No sooner than I pushed the "post" button on my last post did I find sequel. Its a light-weight ORM in Ruby that looks promising. Does all the association fetching and even has 2nd-level cache support.

http://code.google.com/p/ruby-sequel/

Active Record, the sludge in Ruby

So I've been learning Ruby lately, and being a bug ORM guy I decided to checkout the source for active record and see how it works. There are some very serious issues here. Active record uses the dynamic mixin ability of Ruby to do its work, but that same logic kills itself in performance. To explain myself further I we should compare the active record model with JPA.

JPA
Java persistence architecture delegates all ORM work to the persistence manager, an external library that manages db entities. If I want an entity I ask the persistence manager for it, if I want to save, I do it through the persistence manager. When my app starts up the persistence manager looks at all my entities and my schema and creates the necessary object structure to do its ORM magic. The mapping/querying logic is done once and used by everyone.

ActiveRecord
Active record delegates all ORM work to the entity itself. Entities must know how to query/save themselves and if they are transient or persistent. They do this by mixing in methods to to the ruby-defined entities at runtime and then, when an instance of an entity is instantiated, created the necessary links to do its ORM stuff.

The disadvantage of active record is the function of ORM processes is spread throughout however many entities you have. It becomes hard to do things like locking, transactional caching and even transactional enlistment. If I have a 100 entities, each entity has to be concerned with cache invalidation and managing their relationships.

Until the ruby community either fixes this in active record, or comes out with a JPA/Hibernate -like ORM they will never be in the spotlight. If you talk to any ruby-evangelist they always bring up that Twitter uses RoR. And that is true, but if you read any of their posts to scale Twitter had to remove ActiveRecord from their deployment because it was just too slow.

So I think Ruby is cool but I really don't like active record. Anyone interested in porting hibernate to Ruby???

Friday, October 26, 2007

Netbeans and Ruby

I was working on a Ruby project I'll be revealing later, and I was looking for an IDE. I thought I could get by using VI but after a few years of using Java with Eclipse I've been spoiled. After a little looking I found the Netbeans 6 Beta. They have Ruby package that is fantastic! It contains auto-completion, tool tips, and helpers. Make sure you download the version with the Ruby package installed.

http://dlc.sun.com/netbeans/download/6.0/milestones/latest/

If you are writing Ruby code its well worth the look.

Sunday, October 21, 2007

C++ Garbage

So I've been having conversations about smart/auto pointers and c++ with some of my friends. We all love java and other more modern languages that have garbage collection but there are times when you need to move back into C/C++ land, and rightfully so. I'm not one of those "java/ruby/python is the language to replace all other languages." I use the best tool for the job.

So anyways, one of the most painstaking things about C++ is its absence of garbage collection. Sure, you can always just create things on the stack and pass-by-value but that has its own issue (both in performance and logic). Smart and auto pointers help but don't always give the performance that java gc does. One of the nice things about the GC in java is it is not tied to the owning thread. When an object no longer has any reference to it that thread doesn't block while the object is destroyed. The garbage collector runs in the background, even in parallel, collecting objects smartly that can be destroyed and only doing the work it thinks it can do without consuming too many resources. If I suddenly un-reference or move out of scope a million objects the VM won't necessarily destroy all million right then. It will, over time, destroy them when it thinks it won't hurt performance.

So back to C++. Using C++ templates it isn't difficult to do reference counting on your objects. So what we really need to implement is a singleton garbage collector/reference manager that would keep track of all managed objects, and when they are no longer referenced would mark them to be removed, and then remove them when it thinks its best.

Part 1 Singleton Reference Manager
So first we would need to create a singleton C++ class:

class ReferenceManager
{
public:
   static ReferenceManager* Instance();
protected:
   ReferenceManager();
   ReferenceManager(const ReferenceManager&);
   ReferenceManager& operator= (const ReferenceManager&);


};


Now, if you have never done singletons in C++, the way we insure that one and only one gets created/destroyed is using a local static:

ReferenceManager* ReferenceManager::Instance ()
{
 static ReferenceManager inst;
 return &inst;
}

This way a local static instance is created the first time this method is called, and after that the same instance is used. Its also important to note that since inst is local static, it will automatically be destroyed when the application terminates.

References can be obtained like this:

ReferenceManager *p1 = ReferenceManager::Instance();
ReferenceManager *p2 = p1->Instance();
ReferenceManager & ref = * ReferenceManager::Instance();

Part 2. Reference Counting:
In my next post I'll create my own smart pointer to do reference counting and communicate its state with this reference manager. I would do that now but its beautiful outside and the lawn needs mowing.

Thursday, October 18, 2007

Avoid locking with JMS

One of the hardest things to do in a high-load system is reporting. Sometimes the amount of data you have to keep track of is too big.

Example:
I have a page that is going to be displayed, and need to :

Log every view of the page individually.
Keep a running total of everytime a page is viewed.
Have a max number of times a page can be viewed, and when that has been reached stop showing the page.

Now lets assume that we need to keep all these values in a database so we can scale-up our app server. For requirements 2 & 3 we need to keep an aggregate table with a column that represents a "counter". Without this table we would have to do an aggregation query on the log table (1) to see how many times this has been viewed, as our data grows this will get slow.

So we create an aggregate table table that contains the page name and a counter of how many times that page has been displayed. Every time that page is viewed we insert into the log table and update the counter in the aggregate table.

Locking is a b$$ch

There is a fundamental flaw with the above solution, if I get 100 concurrent page views, the database will have all those page views "lock" on that row in the aggregation table, this is so the db can provide ACID. This will significantly decrease our ability to scale. You could lower your transaction isolation in your db, but you aggregate value consistency may suffer.

JMS to the rescue

A good solution is to use JMS. When a page request comes in we still insert a row into the log table, but with the same XA transaction we also publish a message onto a durable JMS destination. No locking will occur. Now have a consumer (ether a message-driven-bean or a plain jms message consumer) will consume those messages with a durable subscription and update the aggregate asynchronously. Who cares if they lock, it will be on a completely different thread thats not affecting the performance of serving the page.

Obviously its possible to go over your max show variable (race condition that a page is served before the message consumer has a chance to process and update the aggregate) but when you get to high-yield systems you have to start playing payoffs.

So the moral of the story: JMS can play a big role in high-yield systems with its durability and async processing. Next time you have a locking issue or find you are doing too many things during a request, try moving those things behind a durable JMS topic, it may save you some time!

Thursday, October 11, 2007

Python and SQLAlchemy

So if you read this blog its no secret I'm a fan of Hibernate. Not because I'm a Java evangelists but really because I have yet to see many of Hibernates features in other ORMs. Hibernate has gone above and beyond to make an ORM that is not only focused on pojo-based database access but also scalability and durability.

You can tune hibernate to do association fetching and second-level caching that meets your load needs and reduce the amount of bottleneck your database is. You can do projection to limit query context. You can also enlist hibernate in java's JEE global transaction to make it durable and ACID compliant. Many of the other languages and ORMs seem to focus more on developer convenience than these things.

In the python world my boss pointed me to SQLAlchemy. Its a python ORM that is based highly on the design of Hibernate. I like what I see, the association mapping and querying is very similar and powerful.

One thing I see that is missing is the powerful second-level caching concept. Post people's response to this is "Oh, use memcache." Well sure but you loose the acidic properties of your database and ORM. Developers are tasked with first checking in memcache for their entitys, then going to the db, and the same with saving. Both of which don't partipicate in the database transaction they are working in. However, after looking at SQLAlchemy a little more I wonder how hard it would be to build transparent transactional enlistment of it with memcache.

Just think, you decorate your SQLAlchemy object with some sort of marker saying that you want it to participate in second-level caching. Then, along with setting up your persistence context (DB access, etc) you also setup a memcache resource. Now when you access and save your SQLAchemy entities it also pushes and pulls them out of memcache behind the scenes and honors the database ACID transactions. This could do wonders for the scalability of python.

This smells like an open source project to me! Comment with ideas and directions.

Sunday, October 7, 2007

JEE isn't just about ORM

I had a conversation with a friend about JEE and how they thought it can be overkill. I should preface by saying I do believe there are situations where JEE is overkill, which I won't go into. His argument is if you don't need to access a database with an ORM then you don't need JEE.

I strongly disagree with this, mostly because I do this all the time. I love Hibernate and EJB3 entity beans, but it doesn't work for all schemas. If you have a highly normalized relational schema where you can take advantage of 2nd-level caching then entity beans is a blessing. But if you are recording stats and doing more OLAP stuff then you may want to stick with straight JDBC.

However, JEE works great with JDBC also. In fact most deployments I've done, including one I'm currently responsible for, use both. Take the following example

public @Stateless class PersonStatsBean implements PersonStats {

@Resource(mappedName="java:/LoginStatsDS")
DataSource loginStats;

@PersistantContent
EntityManager em;

public void recordLogin(int person_id) throws JDBCException
{
Person person = (Person)em.find(Person.class,person_id);
Connection con = null;
try {
con = loginStats.getConnection();
PreparedStatement stmt = con.prepareStatement("insert into login_stats(person_id)
values (?)");
stmt.setParameter(person.getId());
stmt.executeUpdate();
con.close();
}

}

In this example I am saving a row that keeps track of every time a user logs into my system. I have an EJB3 entity called Person that is managed by the entity manager. But I also inject a a regular JDBC data-source into my bean that keeps track of login stats. The login_stats table may be a million rows long and denormalized for reporting efficiency so it really doesn't map well to an ORM. The entity manager and the data source could be pointing to the same or different databases, I don't care.

The cool thing is even though I'm using entity beans and JDBC, the application server makes sure they are both enlisted in the same XA transaction. So even though I ask the container for a regular JDBC data-source by the time I get it an XA transaction has already started and is being managed. I could save/update some EB3 entities and execute some JDBC statements and they will be fully ACID compliant.

So yes Matilda, you can have your EJB3 and JDBC too. By using JEE we are not tying ourselves to a predetermined set of abstractions. If I want an ORM, I get that... If I want to do straight JDBC, I get that also. And if I want to do both, I can have that too AND they all play by the same ACID rules.

P.S. You notice that I have to have this function throw a JDBCException. I wish JDBC would follow the path of EJB3 persistence and make their Exception hierarchy extend RuntimeException rather than Exception. But thats just a picky observation, not a framework killer

Eventual Consistency