Saturday, May 29, 2010

How To Provision Data Storage Capacity For Content Caching

An engineer who recently called me storage capacity for Internet caching appreciate. The technician, is a part of a team that is responsible for setting up Internet gateway for an ISP, in a small country in the Middle East.

The technician was very limited numbers available to do the necessary drainage. So he could phrase the question very simplistic -

"The ISP is able to serve 40,000 users. What should the provided storage capacity for contentCaching? "

As part of the team SafeSquid that content filtering proxy builds, we often get this kind of query, but with one essential difference. Most are formulated as queries - "We have an Internet-pipe of X Mbps. What is the recommended Data Storage Capacity for efficient caching?"

A sensible advice, such a query can be derived if we can concentrate on a few assumptions, and on some simple facts.

1st Only the content was fetched over HTTP can be cached.

2ndThe maximum speed can be retrieved with the contents depend on the Internet pipe.

3rd There are a lot of HTTP traffic, which is un-cacheable, for example - Streaming Audio / Video, pages that display results of the other SQL queries, including queries driven search engine, and even HTML content on the Web Mail.

4th Most important content gets cached pages are HTML embedded images, style sheets, Java scripts and other files you would download and run onthe local desktop or another application profile considered, such as PDF / Flash (some) files.

5th A simple request to a web page, with a standard browser automatically releases, downloads from a variety of content such as cookies, images and other embedded objects. These are required by the browser to display the page as the page design pro. All components that make the Web page is not necessarily "sources" from the Web site that was therequested Web page.

6th Modern Internet browsers provide caching, which is a manageable and is quite similar to the caching principles in the design of caching proxies involved. Thus, any content or object can not necessarily be required. But even these browsers depend on the availability of local memory on the client systems, and is not usually a few hundred MB. And in any case, these local caches are not simultaneously by several users.

7th Internetresources could have different uses, depending on time of day, which at peak and off peak hours.

Therefore, when we have Internet Pipe of 10Mbps, the Max-data, we can transfer (data throughput)

= 10Mbps x 60 seconds = 600 Mbits of data in a minute

= 600 x 60 = 36 000 Mbits of data in an hour

Suppose now the company uses a Bandwidth Manager to reserve QoS for each predefined application (or protocol). In general, applications such asSMTP and VPN are given the lion's share, split almost 50%, and the rest gets between HTTP / HTTPS, and others.

But I do not know of a few customers in the pipes, which would invest exclusively meant for SMTP and / or VPN, and a separate (cheaper) internet connection for HTTP / HTTPS.

If the company has chosen to enable the host web servers within its own premises, then entire distribution plan completely changed the.

Even in the case, but has noUsing a bandwidth manager that is located in a "first come, first serve could," we still run by an estimated dose of transport on the basis of applications or protocols.

So to build on our algorithm, it may be appropriate to coin a term - HTTP_Share, so that - HTTP_Share = x% of the Internet pipe.

Now HTTP_Share would mean the Max data that would get transferred over HTTP traffic

Therefore, further to our earlier derivation of 36,000 Mbps throughput perHour, if we factor HTTP_Share

HTTP_Traffic = x% of the data throughput

Now, if x = 35 (35% of the total data transfer was for HTTP)

HTTP_Traffic / h = (0.35 x 36,000) Mbits / hour = 12 600 Mbit / hour

Well guess, the company has off-peak and peak hours of Internet usage, so that 40% of the day (approximately 9.6 hours) of peak-hours, while 60% of the day off-peak. Peak hours of the day-times when we would witnessTOTAL use of our Internet line. And if we assume that the utilization of 30%, ie the stress level during off-peak around 25% of the peak, we can further estimate on the basis of the above derivation -

HTTP_Traffic / day = ((12,600 x 0.4) + (12,600 x 0.6 x 0.25)) x 24

HTTP_Traffic / day = ((0.4 x 1) + (0.6 x 0.25)) x 12 600 x 24 = 166 320 Mbits

This is looking a very simplistic model. Would be more realisticRequiring a reasonable hourly rate is stepping, that a proper distribution pattern over the day.

Now we deal with the toughest and most questionable part!

What would be the ratio of the cacheable_content HTTP_Traffic?

Based on my experience in various customer premises, take I prefer - 30%.

This would 166 320 x 0.3 = 52 617 Mbits content might mean a day, are cached.

Standard practice is to store content for at least 72 hours (store-age).

That is, we wouldneed a storage of at least 49 896 Mbits.

Thus, a conventional 8bits 1byte = conversion, tells me that we need a storage of at least 6237 MBytes

Another interesting picture, which should be visible during the peak hours is that the HTTP_Traffic when it is viewed as the data downloaded by the proxy server, should be smaller than the data to the client sent, and the difference would be the caching efficiency. That would mean that the cached content has been used to service the requestsmade by customers.

In the general discussion, we have not considered that the performance degradation caused by factors such as network latency would.

In the described method but still no answer to the original question. Since the initial question, the Internet pipe was not defined. So I was pretty skeptical that such a calculation could ever be implemented since it is the number of users (clients), was defined, but has been known as my approach onInternet_Pipe. My arguments and insistence on the fact that downloaded content is cached can be a fraction of the credible HTTP based content. And the maximum content can be downloaded, will depend on the Internet_Pipe, if you are a user or one million users. Tushar Dave of Reliance Infocomm, has helped me, turned the mystery with an interesting algorithm is complete, but the missing piece of the puzzle is complete!

Suppose your ISP offersCustomers with 256 Kbps connections, for 40,000 users, it does seem close to 10 Gbit / s Internet pipe.

But actually, that in fact generally never true (in fact, for 40,000 users an ISP Commission would be an Internet-pipe of less than 1 Gbps in most cases!). The ISP is never a request from each user to receive at the same time, every moment. Such as the OFF-time is known, ie the time when a user is that the content has already been picked up. An ISP can safelyexpect at least 50% of the OFF-time.

OFF-time can actually go up to more than 75% if the ISP's service does more home users and small businesses where the Internet is not shared between multiple users. Second, most of these user accounts are governed by a range of CAP, for example, a user for accounts that can allow a download of a select few GBS.

In the above derivations, we estimated the HTTP_traffic / day from the Internet pipe, now we just have to instead deriveHTTP_traffic / day from the expected HTTP_Traffic per month.

Thus the estimate of throughput-all be derived, without knowing the Internet pipe! And the above derivation is still valid!

So let's see if we can do some calculations (empirical, of course!)

Connections = 40,000

user_connection = 256Kbps

HTTP_Share = 35%

ON_time = 50%

peak_hours = 60%

off_peak_utilisation = 25%

cacheable_content =35%

store_age = 3 days

PEAK_HTTP_LOAD (in kbps) = x-x compounds user_connection 3.584 million = HTTP_Share

NORMAL_HTTP_LOAD (in Kbps) = x PEAK_HTTP_LOAD ON_time = 1.792 million

HTTP_Traffic / hour (in kbit) = NORMAL_HTTP_LOAD x 3600 = 6.4512 billion

Cache_Increment / hour (in kbit) = x cacheable_content (HTTP_Traffic / hour) = 2.25792 billion

Total_Cache_Increment / day = 24 x ((1 - x peak_hours

off_peak_utilisation) + peak_hours) x (Cache_Increment/ Hour) = 2257920000

Required storage capacity (in kbps) = x store_age (Total_Cache_Increment / day) = 6.77376 billion

Required storage capacity (in Mbits) = 6.615 million

Required storage capacity (in Gb) = 6459.9609375

Given 8 bits = 1 byte, it looks like we need us a little over 800 GB of storage

However, I would requisition a storage capacity that can be used for a possible increase of the downloaded content of 35% (cacheable_content accommodate) to obtain straightstore_age least three cycles, ie 800 x 1.35 = 1968 ^ 3 GB

The above calculation is subject to quite a lot of assumptions. But it should allow adjustments in the course of rational, very simple.

For example - if the lines rose by 20%, we would need 20% more memory!

But more importantly, it allows anyone to differ with my assumptions,

and yet nearly the storage required.

Looks so simple now,Thanks Tushar.



Related : macintosh laptop get domain name

No comments:

Post a Comment