Archive for June, 2005

Come on DoubleClick, let’s get real

Thursday, June 23rd, 2005

Bennie Smith, privacy chief (whatever that is) of DoubleClick, seems to be a little out of touch with reality. In an interview, he warns browser makers against ad-blocking. He seems to think that browser plugins like Adblock will cause publishers to start charging for content.

His prognosis is a little off the mark if you ask me. There is still fantastic opportunity for Web-based ads. Yes, people are going to block banner ads. They are annoying and images are easy to block. No matter how convenient it is to provide banner ads, and no matter how convenient they have been in the past, people are quickly becoming less tolerant of them. Banner ads are becoming declasse. But that doesn’t mean people are totally intolerant of ads. Web users don’t like in-your-face banner ads, but they seem to be fine with less invasive forms of advertisement. Take Google and Yahoo’s advertising strategy for example. They are non-invasive, easy on the eyes, text-based, and cannot be blocked using Adblock. Take a lesson DoubleClick; the last time I checked, Google and Yahoo were still free and were making a ton of money off of advertising. There is a change in what people are willing to accept and if DoubleClick doesn’t adapt to the market demands, it will lose in the long run.

As I wrote in another blog entry (Banner-Ad-Free Syndication), companies like FeedBurner can learn something from this too. I suggested that advertisements in feeds should be in the form of individual feed entries and should remain text-based. A reader is more likely to read a text-based ad that appears as a blog entry than they are a banner ad. The typical user auto-filters banner ads in their brain.

Dick Costolo from FeedBurner commented on my blog entry, saying that they choose banner ads over text ads because they don’t want to affect search engine results. He argues that the ad text would mess up the search engine index. I think this is an admirable concern, but I’m not so sure that it is business smart. If I were FeedBurner, I would focus on pleasing my customers and their readers and not worry so much about how I may affect search engine indexing. These are decorated feeds, so they wouldn’t affect indexing of the Web-page itself. It wouldn’t even affect indexing of the original feed. It would only affect the FeedBurner modified feed. So the only search engines that you would affect are those that index your FeedBurner feed. I’m no expert, but I would assume that this would be limited to blog-specific feeds search engines.

Regardless, search engines are designed to filter through real-world content and find relevant pages based on your keywords, but the order of the search results are based on criteria other than just keywords. The fact of the matter is that advertisements are part of the real world. The more that users see FeedBurner banners in feed, the more likely that FeedBurner will write itself a one-way ticket into a users blocked URL list. Moreover, the more users that see the banners, the more likely that their URL will be included in default blacklists that are shipped with blocking software.

If we (as content providers) are really concerned about text-based ads messing with search engine indexing, we need to propose an HTML standard tag that would prevent search engines from indexing the content within the tag.

Example:

<html>
<body>

Here is some indexable content.

<noindex>
My Text based ad goes here and will not be indexed
</noindex>

</body>
</html>

I know that there are strategies for telling search robots to ignore entire pages, but it would really be nice to have a strategy for telling a search engine robot to selectively ignore content within a page. A so-called “line-item veto” for HTML indexing. Or… am I ignorant, and is there something like this available already?

Site Feed

Thursday, June 23rd, 2005

I didn’t realize that my site feed was no longer available on my sidebar. It must have been removed when I was editing my blog template. I’m not sure how long it’s been gone. Anyways, for those who are interested in subscribing to my blog, the link to the feed is available again. I use an Atom feed because blogger.com generates the Atom feed for me automatically. Atom works well with bloglines.com, my aggregator of choice.

Graphing Throughput

Tuesday, June 21st, 2005

In my post on calculating throughput and response time I discussed how to measure the average throughput and response time for a given amount of time. I think it’s important to understand how to calculate these averages, but providing a single average causes you to mask other important trends and information about the system behavior.

This is a given in any statistical analysis. We don’t want to publish a large set of data because there is too much information, so we try to extract important information. But if we shrink the derived data set too much then it is hard to use this information to extrapolate any other trends in the data.

This is why I think it is valuable to capture all the data you can, derive averages, and create graphs.

It is valuable to capture and keep as much data as possible from a test run. This allows you to go back and derive new trends, averages, and graphs. I like to capture a message count for each second of a test run. Let’s say I’m sending messages into an asynchronous application and receiving messages out of the system after they have been processed. A table of this type of data might look like this:

Time Input Output

This table gives me a lot of information. For time 0 seconds to time 1 second, I sent 5 transactions into my system and received no messages out of my system. From time 3 seconds to 4 seconds, I sent 10 messages into my system and received 5 messages out of my system… and so on.

In order to convert this raw data into something more useful, I may derive some averages. Providing averages is important because it allows someone to quickly compare one set of data to another. If I run a test script 10 times, and 1 run has an average throughput that is 40% lower than all the others, then I immediately know that something went wrong. But of course, it is hard to extract anything further from a single number.

For the sample data above, I might extract the following averages:

Input rate = M / Ts = 92 messages / (11 - 0) seconds = 8.36 messages / second
Output rate = M / Tr = 92 messages / (12 - 2) seconds = 9.2 messages / second
Throughput = M / Tp = 92 messages / (12 - 0) seconds = 7.67 messages / second

Where:

Tss = Send Start Time
Tse = Send End Time
Trs = Receive Start Time
Tre = Receive End Time

Ts = Send Time = Tse - Tss
Tr = Receive Time = Tre - Trs
Tp = Processing Time = Tre - Tss

M = # of messages

Note: Yes, the output rate for a system can be greater than the input rate to a system. Think of a line to a roller coaster ride. People can trickle in to line for a ride at a fairly slow rate. But when the gates open, people quickly get out of line to board the roller coaster. The output rate is greater than the input rate.

Graphs are also important because it is easier to visualize trends. Graphs are also pretty and can score bonus points with managers and business-types. I think it is useful to used stacked graphs that show the behavior of one trend compared to another. For the data above, we may create a stacked graph:

Click on the image to see it in full size

This link is very useful if you want to create stacked charts with vertical separation in Excel. I used this as a guide in creating the graph you see above.

It is equally useful to create graphs of your system response time as a function of time. This can show you how and when your system starts to degrade.

Calculating Throughput and Response Time

Monday, June 20th, 2005

In software, response time measures a client’s perspective of the total time that a system takes to process a request (including latency). The response time of a single request is not always representative of a system’s typical response time. In order to get a good measure of response time, one will usually calculate the average response time of many requests. Response time is usually measured in units of “seconds / request” or “seconds / transaction”. (Note: Don’t confuse response time and latency.)

Throughput is the measure of the number of messages that a system can process in a given amount of time. In software, throughput is usually measured in “requests / second” or “transactions / second”.

When I first started doing performance analysis, I naively assumed that throughput and response time were linearly related and were thus reciprocols of one another. Though there are conditions that might allow these two system measurements to be inversely proportional, it is definitely not a given.

Let’s look at a real-life example; consider a checkout lane in a grocery store. Let’s assume that the cashier always takes 2 minute to check out a customer. Let’s also assume that there is no line and that a new customer walks up to the cashier at the exact moment that another customer was done checking out, with absolutely no delay between customer checkouts. If we have 10 such customers, we would calulate response time and throughput as follows.

To calculate response time, we sum up the total checkout time for all customers and divide by the number of customers:

Response time = 20 minutes / 10 checkouts = 2 minutes / checkout

To calculate latency, we calculate the average wait time in line:

Latency = 0 minutes / 10 checkouts = 0 minutes / checkout

We can also measure the rate at which things occur:

People that got in line / minute

  • This is the queue input rate
  • The first person got in line at time 0, the last person got in line at time 18 min
  • 10 people / 18 minutes = .56 people got in line / minute

People that got to the register / minute

  • This is the queue output rate and it is also the system input rate
  • The first person got to the register at time 0, the last person got to the register at time 18 min
  • 10 people / 18 minutes = .56 people started checking out / minute

People that finished checking out / minute

  • This is the system output rate
  • The first person finished checking out at time 2 min, the last person finished at time 20 min
  • 10 people / 18 minutes = .56 completed checkouts / minute

People that the cashier checked out / minute

  • This is the processing rate
  • The first person started checking out at time 0 min, the last person finished checking out at 20 min
  • 10 people / 20 minutes = .5 checkouts / minute

As you can see there are many different rates that we can measure. People use the word throughput to refer to all of these different rates, but generally when we talk about throughput in software we are referring to the processing rate (people that the cashier checked out / minute). Depending on how we are measuring, either the system input rate or the queue input rate is also known as the system load. Accordingly, the term load testing is used to describe a test where we send many requests into a system and observe the its non-functional behavior.

Based on this input, it looks like throughput and response time are inversely proportional:

Throughput = .5 checkouts / minute
Response Time = 2 minutes / checkout

or…

Throughput = 1 / Response Time [NOT ALWAYS TRUE]

This is because we have no latency and because our system was provided with exact conditions that allow it to have a load without customer wait time or cashier idle time.

Let’s vary our example a little. What if 10 people used this same checkout lane, but each person arrived in line 1 minute after the last person was done checking out? The cashier is just twiddling his thumbs, waiting for a customer for 1 minute. The cashier is still capable of checking out a customer in 2 minutes, so the average response time is still 2 minutes / customer, but the throughput of people coming out of the checkout lane is not the same.

Response time = 20 minutes / 10 checkout= 2 minutes / checkout
Latency = 0 minutes / 10 checkouts = 0 minutes / checkout
Throughput = 10 checkouts / 29 minutes = .34 checkouts / minute

Let’s also consider the opposite. What if 10 people used the checkout line at nearly the same time. In other words, what if there was a line of 9 people behind a customer who is being checked out? From a customer perspective, the checkout time (or response time) is the amount of time from when they get in line until they are done checking out, and the latency is how long it takes them to get to the cashier from the time they get in line. The first person to get to the checkout lane wouldn’t wait at all. The first person in line (not the one being checked out currently) would wait 2 minutes to start checking out, the second person would wait 4 minutes, and so on until the last person who would wait 18 minutes to start being checked out.

Response time = 110 minutes / 10 checkout= 11 minutes / checkout
Latency = 90 minutes / 10 checkouts = 9 minutes / checkout
Throughput = 10 checkouts / 20 minutes = .5 checkouts / minutes

From a customer’s perspective, the average customer checkout time is greater, even though the clerk is still working at the same speed and is able to push 10 people through line in 20 minutes. The checkout lane is saturated at the point when the queue input rate exceeds the queue output rate. As you can see, the rate at which customers are getting in line makes all the difference. The term degradation is often used to describe a system whose response time increases when the load is increased. In our grocery example, our system starts degrading when we have more than one customer get in line every two minutes.

In this grocery example, people can form a line. In a software system, the line (or queue) is either going to be on the sender side or the receiver side, depending on whether the system is synchronous or asynchronous. If a receiver blocks all messages until it is done executing its current request, then the system is synchronous and the queue is on the sender’s side. If the receiver accepts messages as fast as possible, and uses a seperate execution thread to execute request, then the receiver must have a queue and the system is said to be asynchronous. You could have a queue on both the sender and receiver, but this is usually superfluous. See: Synchronous vs. Asynchronous Systems.

Software load is usually measured in requests per second. For example, you may describe the load on a system as “10 request per second”. In a real-world scenario, the load will change as a function of time. In a grocery store, more customers will try to check out at peak shopping hours. In the stock market, the most volume is traded in the first and last 15 minutes the market is open. A Web page will have different load depending on the day of the week and the time. Thus, if you are designing a test of your system, you want to determine the behavior under different types of load.

In order to improve the performance of our grocery store, we can make it multi-threaded by adding more lanes. This concurrency helps in two ways:

  • It can minimize the response time that each customer experiences by reducing wait times in line
  • It can increase throughput

Let’s say we have 10 lanes, 10 customers that get to the checkout area at the same exact time, each customer goes to a different lane, each checkout takes 2 minutes.

Response time = 20 minutes / 10 checkout= 2 minutes / checkout
Latency = 0 minutes / 10 checkouts = 0 minutes / checkout
Throughput = 10 checkouts / 2 minutes = 5 checkouts / minute

With a single lane, our response time for this same load was 11 minutes / checkout, but with multiple lanes, our response time is 2 minutes / checkout, the best that our system can provide. Increasing the number of lanes (or threads) increased our throughput and allowed us to maintain optimum response time.

But, of course, nothing is free. In this case, we’ve increased the number of active employee resources to optimize our performance, but we must pay for those resources. In software, we have to worry about system resources. We can spawn off multiple threads, but we have to be careful how much CPU and memory each execution thread is utilizing.

Face Up to Web Application Design Using JSF and MyFaces

Friday, June 17th, 2005

My new DevX article introduces JavaServer Faces (JSF). Here’s an excerpt:

“JavaServer Faces provides an alternative to Struts or Spring MVC for those who want a Web application framework that manages UI events in a Java Web application. JSF is now a standard part of the J2EE specification and provides a viable alternative to non-standard Web frameworks.”

“If you’ve worked on more than one Web application with different teams, you’ve probably worked with more than one Web application framework. J2EE always provided Web technologies, but never defined a Web application framework for managing page navigation, request lifecycle, request validation, etc. Developers had to develop these features themselves or utilize one of many open-source frameworks such as Struts and Spring MVC. Enter JavaServer Faces.

JSF is specified in JSR 127 (see Related Resources section in the left column) and “defines an architecture and APIs that simplify the creation and maintenance of Java Server application GUIs.” This article will discuss the basic ideas behind JavaServer Faces and then show you an example using the Apache MyFaces JSF implementation.”

Banner-Ad-Free Syndication

Wednesday, June 15th, 2005

I have a suggestion for syndicated blogs and news sources: don’t use banner ads in your feeds, people will likely just block or ignore them.

It is fairly simple to block images from a particular site or URL pattern, and I know many people that do this, including myself. I use Bloglines, a Web-based feed reader, and Firefox as my Web browser. I block images using a Firefox plugin called Adblock. I block banner ads on all of the blogs that I subscribe to fairly easily using Adblock. For example, I block every image that comes from FeedBurner.

If you want to have ads in your feed, I would suggest using either text-based ads that are tacked on to the end of an entry or using seperate (banner free) syndicated items to display ads. I think that “feed decorating services” such as FeedBurner should offer the ability to do this for you automatically (which they may already do).

I would wager that people are more likely to read banner-free ads that appear as seperate items in a feed. This is because people wouldn’t instinctively ignore them like banner ads. They are also more likely to read them because there is no way to distinguish an “ad entry” from a “real entry” until you read the headline and/or scan the actual entry.

Agile Tools

Friday, June 3rd, 2005

It’s important to realize that Agile is not a single methodology or process. Agile is an umbrella term that describes a group of processes that share a common set of ideals. These processes include eXtreme Programming, Model Driven Architecture (MDA), Scrum, etc.

This site gives you a summary of what Agile is, has links to all the methodologies that fall under the Agile umbrella, and list several tools that are used in the agile suite of methodologies.

Each methodology has a set of tools that they use or favor. Some are more comprehensive than others. For example, Extreme Programming projects typically use a minimalist approach to project management tools, abandoning Gannt charts in favor of index cards and favoring Wikis for capturing requirements. Another example is MDA, which is a methodology that is pervasively dependent on tools that generate code from visual models describe in UML or other modeling languages.

I’m definitely no RUP expert, but I listened to Martin Fowler do a talk on RUP and he described it as a large set of best practices and a suite of tools which you can use to customize your own methodology, picking and choosing which practices you want to utilize. For example, there are very high-ceremony waterfall projects that use RUP and there are XP projects that use RUP (check out Robert Martin’s dX [pdf]).

Deleting methods off of an interface

Friday, June 3rd, 2005

When you go to delete a method signature off of an interface, you have to be careful. If you delete the method signature, the compiler will notify you of any references to the method on the interface. After resolving those, the compiler will be happy. But, you may still have implementations of the method on classes that implemented the interface. This is, of course, because classes can provide methods in addition to what is on their interface. You may also have references to the methods by code that had references directly to the implementation class, and not the interface.

One strategy for handling this is to deprecate the method on the interface and all the implementing classes, remove all references to the deprecated methods, then delete the methods.