Quick Tech Notes: May 2010

Monday, May 31, 2010

Ship It! - 2.8 Choosing Tools

Be sure your tools use an open format like XML or plain text. It makes integration and reporting easier.

Tip 11: Use the best tool for the job

Tip 12: Use open formats to integrate tools

Never have a vital part of your product cycle (such as the build system) written in a niche or non-core technology, especially if only one developer knows it. Use a technology that anyone in the shop can configure and maintain.

Never let a critical technology (like your build system) be created as a technology experiment. Use a tool designed for builds to create your builds, not the cool new technology that a team member wants to learn. There are plenty of non-critical areas for technology learning to take place. Never create automated tools that run on only one machine. Never hard-code dependencies, such as network drives. Put everything you need in your SCM system, and the network drives become unimportant.

Tip 13: Keep critical path technologies familiar

I must admit I'm a tinkerer and need to be aware of introducing shiny new things at the wrong times. I've also been on the other end of the stick where I had to use a sub-system written in a technology that I wasn't comfortable with. Only the author knew the technology and we ended up using something that more people understood.

Saturday, May 29, 2010

Ship It! - 2.6 Track Features

A new feature in your product refers to added functionality. It’s making your product do something that it didn’t do before. Keep a unified list of your feature requests. Prioritize them, and keep a basic estimate of the time involved to investigate or add the feature. You may also want to keep a list of the top items on your white board for better visibility.

How Do I Get Started?
Same list as with Issue Tracking.

Am I Doing This Right?

do you use this system as a first stop when it's time to generate the next release's feature list?
do you routinely record new product ideas in the system?
do you reject many of the submitted feature requests?
can you generate the last product version's "new feature" list by running a report?
can your stakeholder easily check on a feature's status?

This is an interesting idea and I don't know if it has ever been used on my past projects. Typically, somebody would send a document that said "build this" and we would, no questions asked. I'm wondering what the risks are, if any, of letting your customers have access to the feature list? Do you want to put up your ideas for a super secret project for your customers, and possible competitors, to see?

Friday, May 28, 2010

Ship It! - 2.5 Track Issues

Common questions that an issue tracking system should be able to answer:

what version of the product has the issue?
which customer encountered the issue?
how severe is it?
has the problem been reproduced in-house and by whom?
what was the environment?
what version of the product first exhibited the issue?
in what version was it fixed?
who fixed it?
who verified the fix?

Tip 8: Avoid collective memory loss

How Do I Get Started?

pick an issue tracking system
set up a test system for yourself and learn how to use it
generate a one-page quick-start guide for internal users
start keeping all new issues in this system
move pre-existing issues over to the new system as time permits

Am I Doing This Right?

can you generate a list of top-priority, unaddressed issues? How about second-tier issues?
can you generate a list of last week's fixes?
can your system reference the code that fixed the issue?
do tech leads use the system to generate to-do lists for development?
does your tech support guys know how to get information out of the system?
can your system notify interested parties so others can see when an issue is fixed?

The issue tracking systems I've used have been, at best, horrible. I'm waiting for the project that uses a reasonable set of tools which are integrated, like the stack from Atlassian. I think you need integrated tools to answer questions about who fixed defect A, what was the fix and what build did it show up in. If you have to repeat yourself between tools, you are going to get lazy or forget and the information won't be available to those who might need it.

Thursday, May 27, 2010

Ship It! - 2.4 Build Automatically

Rebuilding each time code is committed keeps your code base clean by catching compile errors as soon as they occur.

Tip 6: Build continuously

Tip 7: Test continuously

How Do I Get Started?

select a build system, don't write your own
obtain a clean machine to run on
install your automatic build system, configure it for the environment and document every step of the install

Am I Using This Right?

do you have tests in the system?
is anyone paying attention to the system?
does the build get fixed quickly?
does your build finish in a reasonable time?

My favorite idea from this section is that a build kicks off as soon as somebody commits code. Most projects I've worked on used scheduled builds but, like the books suggest, that allows many changes to be added which increases the time needed to figure out who broke the build. One small project, however, did triggered builds and I really liked it. We had to go in and modify a Subversion script to do it but it worked as advertised. If you forgot to build everything prior to check in, the CI server told the whole world.

Wednesday, May 26, 2010

Ship It! - 2.3 Script Your Build

You have a problem if you do anything by hand in your build or packaging process. Humans forget steps and lose focus - machines don't.

Tip 4: Script builds on day one

Tip 5: Any machine can be a build machine

How Do I Get Started?

have a team member manually build the system while you take notes
define the individual steps
pick a build tool
incrementally script each step eliminating manual operations one by one
run the script on another workstation
have another team member use the script without your help

You should be able to build your entire product:

with one command
from SCM
on any team member's machine
with no external environmental requirements, such as network drives

In Java, Ant is the defacto build tool but it can get messy pretty quickly. Crafting if-then-else logic using XML constructs isn't pretty. I'm glad to see tools like Gradle gaining some traction.

Tuesday, May 25, 2010

Ship It! - 2.2 Manage Assets

You should have everything you need to build the entire product; if you don’t, then perhaps you aren’t using the tool properly.

Tip 3: If you need it, check it in.

If you can generate it, then don't check it in.

pick an SCM
learn how to use it
generate a single-page quick-start guide that shows how to use the system for common operations
show the system to your team and make sure everyone is comfortable with it
import your code and supporting files
start keeping all your files in SCM

Am I Doing This Right?

Are you actively using the system?
How long would it take to get a new machine up and running?
Can you perform SCM operations quickly?
Are you backing up the SCM's repository?
Can you check out the entire project?
Can you look at the differences between local edits and the code in SCM?
Can you view the history for a specific file -- who changed this file when did they do it?
Can you update your local copy with other developer's changes?
Can you commit your changes to the SCM?
Can you remove the last changes you pushed into SCM?
Can you retrieve a copy of the code tree as it existed last Tuesday?

I agree that an SCM is extremely important to a project and it is probably software suicide not to use one. One area of tension I've run into regarding the "if you need it, check it in" rule was around libraries. One project I worked on stored any dependent libraries right next to the source which was great. The build was easy to reproduce, partly due to this fact. As the project grew and many releases went out the door, our disks started filling up -- fast. For each branch you checked out you got a new copy of log4j plus the 800 other libraries. Developers checking out over slow WANs really complained because it could take hours to move files around. I suggest thinking about what happens when you start branching your code as you implement the "if you need it, check it in" rule.

Monday, May 24, 2010

Ship It! - 2.1 Develop in a Sandbox

Isolate others from the effect of your work until you are ready. Code is shared via the repository. The build machine is an unattended server that simply gets all of the latest source code from the repository, builds, and tests it, over and over again. The result of this build is the product release.

Tip 2: Stay in the sandbox

Luckily, I've been in environments that used an SCM and some even ran CI servers. None of them, however, consistently ran tests as part of the build but I can see the benefit of doing so.

Sunday, May 23, 2010

Ship It! - Chapter 1

Ship It!: A Practical Guide to Successful Software Projects by Jared Richardson, Will Gwaltney, Jr is a book that presents some ideas on software practices that can help keep developers, managers and customers happy about the software being produced. The practices are based on real-world experience and are not part of an overall methodology -- you pick and chose the practices that make sense to you and your team. I plan on summarizing chapters as I read them. I feel it helps me to better understand the material because I have to take the time to think about the messages the book is trying to convey and pick out the highlights. Let's get started.

Habitual Excellence
Extraordinary products are merely side effects of good habits. Purposely seek out good habits, and add them to your daily routine. I agree with the idea that it is the things you do routinely that determines how effective you are in your daily situations so it makes sense to try an cultivate "good" habits.

Tip 1: Choose your habits.

A Pragmatic Point of ViewThe book is a collection of good habits and it is never too late to install good habits. Hopefully, the habits won't be too painful or expensive to implement.

Road Map

Infrastructure
Techniques
Process
Common Problems and How To Fix Them

Potentially, the last section might be most interesting. I'm sure I'll have seen some of the practices that will be suggested but haven't been able to implement for one reason or another. I'll be curious to see what some possible solutions are.

Moving On
Try out the ideas and discard the ones that don't work for you. Make sure you understand the benefits before proposing a technique. It is nice to see that it isn't an all-or-nothing mentality. I'm guessing that that the more practices you adopt the more benefit your project will see but every team is different so you've got to see what works for you.

Saturday, May 22, 2010

Release It! - Closing Thoughts

I really enjoyed this book. It gave me the feeling that the advice given came from real-world experience rather than some theoretical model that same "wizard" says we should all do. I've seen a lot of the anti-patterns on the projects I've worked on and it was nice to see that they happen to other projects as well -- misery loves company. Most of the solutions, however, I haven't seen, at least in the form that is presented in the book. The Circuit Breaker, for example, is a great idea that I have never thought of. In the past, I've muscled my way through stack traces, log files and application server consoles trying to figure out "the bottleneck" or "the memory leak". Now, I have a better insight as to what can be done to prevent late night conference calls and getting on planes. I think Micheal has done a good job of explaining his ideas and I'm glad to see that he is refining his ideas through the NFJS circuit. Overall, I give this book two thumbs and say it is worth the time and money. Go grab yourself a copy.

One of the things I'm going to put onto my TODO list is to see how these ideas are expressed in Java, my current language of choice. The notion of a Test Harness that can be used to kick your application in the head and see how it reacts seems like a great idea but I wonder how much code you have to write to create one? Does one already exist? Should I write a basic framework and release into the wild? Is a Test Harness so tightly coupled to an application that you can really share it between projects? Can I create a Circuit Breaker abstraction that can take advantage of Spring and expose itself as a JMX MBean? If I do start releasing code, I'll be sure to mention it in this blog.

Friday, May 21, 2010

Release It! - 18.4 Releases Shouldn't Hurt

This might be my favorite chapter but executing what he proposes will take lots of practice.

Releases shouldn't be a big deal. Frequent releases forces you to get good at deployments. Reduce the effort needed to release by automating the process as much as possible.

Zero Downtime Deployments
The key is to break up the deployment into phases. Instead of adding, changing, and removing stuff—such as database columns and tables, constraints, services—all at once, add the new items early, with ways to ensure forward compatibility for the old version of the code. Later, after the release is rolled out, remove stuff that is no longer referenced, and add any new constraints that would have broken the old version

Expansion
The first step is to add new “stuff.” The stuff consists of URL-based assets, web service endpoints, database tables and columns, and so on. All the stuff can be added without breaking the old version of the software, under certain conditions. URL-based resources, such as style sheets, images, animations, or JavaScript files, should be given a new URL for each new revision. For web services, each revision of the interface should be given a new endpoint name. Similarly, for remote object interfaces, defining a new interface name (for example, with a numeral after the interface name) for each version ensures that the old version of the software gets the interface it wants while the new version gets the interface it wants. For socket-based protocols, the protocol itself should contain a version identifier. This definitely requires that the receiving applications must be updated before the senders. It also implies that the receiving application must support multiple versions of the protocol. If it’s simply impractical to support multiple protocol revisions, another option is to define multiple service pools in the load balancer on different ports. By far, the most conflicts—and the most troublesome conflicts—will arise in the database. Schema changes are rarely forward compatible, and they’re never by accident. Still, it is possible to break schema changes into phases. In the “expansion” phase, tables and columns get added. Any columns that will eventually be NOT NULL are added as nullable, because the old version doesn’t know how to fill these in. Later, in the cleanup phase, constraints will be added. This goes for referential integrity rules, too. They cannot be added during expansion because the old version would immediately violate the relationships. You can use database triggers to bridge data meant for the old schema to fill in columns in the new schema. It also works in the opposite direction.

Rollout
With the preparations from the “expansion” phase in place, the actual rollout of the new software on the application servers should be trivial. This could take a few hours to a few days, depending on how cautiously you want to approach it.

Cleanup
After the new release has baked long enough to be accepted, it is time to clean up. This includes removing the bridging triggers and extra service pools. Any columns or tables that are no longer being used can be removed. Old versions of static files can be removed, too. At this point, all the application servers are running on the new version of the code. This is the time to convert columns to NOT NULL that need it, as well as to add referential integrity relations (though constraints enforced in the database can cause large problems for the ORM layer). This is also the time to drop any columns and tables that are no longer needed.

I wonder if something like dbDeploy can be used in this scenario? Maybe not since it doesn't understand the notion of a cleanup phase.

Thursday, May 20, 2010

Ship It! - 2.7 Use a Test Harness

A testing harness is the tool or software toolkit you use to create and run your tests.

Tip 9: Exercise your product - automate your tests

Try to use a common testing framework. Make sure it can be driven from the command-line. Take a peek at MetaCheck to see if would help your team.

Tip 10: Use a common, flexible test harness

Unit Tests are designed to test your individual class or object. They are stand-alone, and generally require no other classes or objects to run. Their sole purpose in life is to validate the proper operation of the logic within a single unit of code.

Functional Tests are written to test your entire product’s proper operation (or function). These tests can address your entire product or a major subsystem within a product. They test many objects within the system.

Performance Tests measure how fast your product (or a critical subsystem) can run. Without these tests, you can’t tell whether a code change has improved or degraded your product’s response time (unless you are really good with a stopwatch!).

Load Tests simulate how your product would perform with a large load on it, either from a large number of clients or from a set of power users (or both!). Again, without this type of test, you can’t objectively tell if your code base has been improved or degraded.

Smoke Tests are lightweight tests and must be carefully written to exercise key portions of your product. You would use smoke tests because they run fast but still exercise a relevant portion of your product. The basic idea is to run your product to see if it “smokes,” i.e., fails when you invoke basic functions.

Integration Tests look at how the various pieces of your product lines work together. They can span many products: sometimes your products and sometimes the third-party products you use.

Mock Client Testing is used to create tests from your client’s point of view. A mock client test tries to reproduce common usage scenarios for your product, ensuring that the product meets minimum functional specifications. This type of testing can be very effective for getting essential testing coverage in place to cover the most commonly used code paths.

How Do I Get Started?

select a testing tool or toolkit
start adding tests to problem areas
ensure your tests are being run as part of the build system

Am I Doing This Right?

are your tests effective? Are you catching bugs?
What are your code coverage numbers? Are they increasing over time?
is your product testable?
do your tests tell you whether they pass or fail?
does everyone in the shop have the ability to add tests?

Testing is important and it seems like everyone has their own definition for the different type of tests. I'm comfortable with the definitioins the book presents. I've seen discussions around testing your system via Archetypes. Test your system like you are a power user who uses more features than the normal user. Test your system like you are a first time visitor just browsing the landscape. I think Mock Client Testing is along those lines of thinking.

Release It! - 18.3 Adaptable Enterprise Architecture

Architectures that are inspired by biology and ecology might be less efficient but very robust. The most useful criterion for evaluating architectures is this: “Does it make IT better at responding to its users’ needs?” Most enterprise architectures are not constructed with this goal in mind. Rather, they are constructed with the needs of the IT group in mind.

Dependencies Within a System
Systems should exhibit loose clustering. In a loose cluster, the loss of an individual is no more significant to the larger entity than the loss of a single tree in a forest. The members of a loose cluster can be brought up or down independently of each other. There should be no time-ordering requirements for the activation of the members of the cluster. The members of one cluster or tier should have no specific dependencies—or knowledge of—the individual members of another tier. The dependencies should be on a virtual IP address or service name that represents the cluster as a whole. Direct member-to-member dependencies create hard linkages that prevent the endpoints from changing independently. The members of a cluster should never need to know the identities of every other member in the cluster. Broadcast notifications, such as cache invalidation messages, should go through a publish/subscribe topic or command queue.

Dependencies Between Systems: Protocols
No matter the protocol, both ends of the interface must both speak and understand the same language. Sooner or later, the language will inevitably need to change. Using protocol versioning can help. For a time, a system can speak multiple versions of the protocol in order to give collaborating systems time to migrate to the newer version. Use your Test Harness to ensure multi-version compatibility. Placing version information in file formats is also a good idea.

Dependencies Between Systems: Databases
Do not use databases just for integrating between systems. Use a higher level abstraction, such as a web service, instead. Using a db for integration violates encapsulation and is too highly coupled. Most use cases do not require up-to-the-second accurate data so using an hour-old snapshot might be ok. Use ETL tools to pull data out of production so that reports can be done without impacting production.

Wednesday, May 19, 2010

Release It! - 18.2 Adaptable Software Design

Dependency Injection
Encourages loose coupling and makes testing easier. Using interfaces is key so that things can be easily swapped out.

Object Design
Strive for loose coupling and tight cohesion. Relying on another object's behavior is coupling. If an object is cohesive, then its public methods touch much of its state. If subsets of methods only touch a subset of state, then maybe another object is hiding inside and the object is not considered cohesive. Coupling affects adaption more than cohesion. Try to avoid APIs that require too much external context -- accept the minimum information required to get the job done and the object is more likely to be adaptable in the future.

XP Coding Practices

refactoring
unit testing

Agile Databases
Database schemas will change. How can you make those changes as painlessly as possible? Consider having a table the indicates the current revision of the schema and have the system check that revision at startup to make sure both the object side and the data side of the house are in agreement. Fail Fast and refuse to startup if they don't. Bump up the revision number if the schema itself did not change but the interpretation of the data did.

Release It! - 18.1 Adaptation Over Time

Any action to change the system has a cost: design, development, and testing effort, plus the cost of release. If the cost of making these changes exceeds the value returned by filling a gap or removing a bump, then the rational choice is to not make the change.

Monday, May 17, 2010

Release It! - 17.6 Standards, De Jure and De Facto

SNMP
MIBs are tough to write and tougher to get operations to import. Could try to use the Java JMX-to-SNMP connector.

CIM
Replaces SNMP but not widely supported.

JMX
Great way for Java applications to provide visibility. Spring makes it easy to expose POJOs as MBeans. JMX makes scripting possible, which is great for the operations folk. Scriptable administration is gold to an operations person.

What to Expose
Ideally, you should expose every state variable in the application but that isn't practical. Try these for starters:

traffic indicators - Page requests total, page requests, transaction counts, concurrent sessions
resource pool health - Enabled state, total resources,12 resources checked out, high-water mark, number of resources created, number of resources destroyed, number of times checked out, number of threads blocked waiting for a resource, number of times a thread has blocked waiting.
database connection health - Number of SQLExceptions thrown, number of queries, average response time to queries
integration point health - State of circuit breaker, number of timeouts, number of requests,average response time, number of good responses, number of network errors, number of protocol errors, number of application errors, actual IP address of the remote endpoint, current number of concurrent requests, concurrent request high-water mark
cache health - Items in cache, memory used by cache, cache hit rate, items flushed by garbage collector, configured upper limit, time spent creating items
All counters have a time component, such as "within the last 10 minutes".

JMX and SNMP Together
You can bridge the JMX and SNMP worlds and AdventNet appears to be a leader in this area. SNMP is tree based but JMX is object based but it possible to make them work together.

Operations Database
An Operations Database accumulates status and metrics from all the servers, applications, batch jobs, and feeds that make up the extended system. The OpsDB contains the data you will need to look for correlations and for capacity planning. You can see what "normal" looks like for your system. A suggested OpsDB object model is presented:

Feature - unit of business significant fuctionality -- same features mentioned in the SLA
Node - one of the "pieces" that comprise a feature, such as a web server, firewall, database, etc. Don't model everything, just what you think is relevant.
Observation Type - name and sub-type of the Observation, useful in reporting.
Observation - a single data point obtained from a Node.
Measurement - a performance statistic from a Node, such as resource pool high water mark, available memory, etc. It is a type of Observation.
Event - a type of Observation.
Status - an important state change, such as a Circuit Breaker going from closed to open. It is a type of Observation.

Feeding the Database
Provide an API into the OpsDB but keep it simple. It is not an important system component so don't stress your system by waiting around for an underperforming OpsDB. Adjust any of your systems scripts or batch files so that they write to the OpsDB. Can also write a JMX MBean to feed the db with data samples.

Using the Operations Database
In this section, the book addes new objects to the model.

ExpectationType - paired with an ObservationType. It defines the characteristics of an Expectation.
Expectation - an allowed range, time frame, acceptable status, deadline, etc. Anything that is "normal" and expected for a particular metric. A violation will trigger an alert. You can use historical data to fine tune these values.
Make sure to keep the OpsDB in top shape. Letting cruft build up will slow it down and stress your system.

Supporting Processes
You need an effective feedback mechanism or collecting and reporting this data is waste of money.

Keys to Observation:

Every week, review the past week’s problem tickets. Look for recurring problems and those that consume the most time. Look for particular subsystems that cause a lot of problems or a development team (if there is more than one). Look for problems related to a particular third party or integration point.
Every month, look at the total volume of problems. Consider the distribution of problem types. The overall trend should be a decrease in severity as serious problems are corrected. There should also be an overall decrease in volume. (There will be a sawtooth pattern as new code releases introduce new problems.)
Either daily or weekly, look for exceptions and stack traces in log files. Correlate these to find the most common sources of exceptions. Consider whether these indicate serious problems or just gaps in the code’s error handling.
Review help desk calls for common issues. The can point toward user interface improvements as well as places the system needs to be more robust.
If there are too many tickets and help desk calls to review thoroughly, look for the top categories. Also sample tickets randomly to find the things that make you go “hmmm.”
Every four to six months, recheck that old correlations still hold true.
At least monthly, look at data volumes and query statistics.
Check the database server for the most expensive queries. Have the query plans changed for any of these? Has a new query hit the most expensive list? Either of these changes could indicate an accumulation of data somewhere. Do any of the most common queries cause a table scan? That probably indicates a missing index.
Look at the daily and weekly envelope of demand (driving variables) and system metrics. Are traffic patterns changing? If you suddenly see that a popular time is dropping in popularity, it probably indicates that the system is too slow at those times. Is there a plateau in the driving variables? That indicates some limiting factor, probably responsiveness of the system.
If a metric stops being useful, stop tracking it. The system keeps changing and so must your view into it.

For each metric being reviewed, consider each of the following. How does it compare to the historical norms? (This is easy if the OpsDb has enough data to start forming expectations.) If the metrics continues its recent trend, what happens to other correlated metrics? How long could the trend continue—what limiting factor will kick in? What will result from that limiting factor?

I wonder if somebody has already written the code to implement this model?

Sunday, May 16, 2010

Release It! - 17.5 Monitoring Systems

A Monitoring System is some entity outside the process itself must be watching—some black-box tool monitoring the health and well-being of the application and its host. In almost every case, the selection of a monitoring system will be done for you. The monitoring system becomes part of the environment for which you design. Try and decouple your system from the monitoring API as much as possible. I haven't had any practical experience using a monitoring system but I've got Hyperic HQ on my todo list.

Saturday, May 15, 2010

Release It! - 17.4 Logging

If you want to avoid tight coupling to a particular monitoring tool or framework, then log files are the way to go. Nothing is more loosely coupled than log files; every framework or tool that exists can scrape log files. Log files, however, are badly abused.

Configuration
Make the location of your log files configurable. Operations is going to want to specify where they live.

Logging Levels
Logging should be targeted to operations, not development -- they'll spend way more time with them than you will. Anything WARN or above should be seen by operations and warrant a phone call or at least some poking around into the system's current status. ERROR should be reserved for really bad stuff, such as Circuit Breaker tripping. An NPE might not be worthy of an ERROR level message, depending on the context in which it was thrown.

Catalog of Messages
Externalize your log messages into a properties file and abstract your logging sub-system to make use of it. This allows you to catalog all errors that operations might see and allow them to look up in a knowledge base.

Human Factors
Log files are for humans, not machines. Humans are good at visual pattern matching so make the format of your logs uniform and simple so that operations can quickly spot something important and act on it. Time stamp, error code, message level, component and message details are all useful things to present. Messages should include an identifier that can be used to trace the steps of a transaction. This might be a user’s ID, a session ID, a transaction ID, or even an arbitrary number assigned when the request comes in. When it’s time to read 10,000 lines of a log file (after an outage, for example), having a string to grep will save tons of time. Interesting state transitions should be logged, even if you plan to use SNMP traps or JMX notifications to inform monitoring about them. Logging the state transitions takes a few seconds of additional coding, but it leaves options open downstream. Besides, the record of state transitions will be important during post-mortem investigations.

Logging is a place where I've struggled over the years. What is too much? What is too little? Is it on by default or only when things start acting weird. I think this advice is sound, specifically only log stuff intended for operations. I've been experimenting with a logging implementation that tries to enforce these recommendations and I'm liking what I've seen thus far.

Friday, May 14, 2010

Release It! - 17.3 Enabling Technologies

A black-box technology sits outside the process, examining it through externally observable things. Black-box technologies can be implemented after the system is delivered, usually by operations. White-box technology runs inside the thing being observed—either a process or a whole system. The system deliberately exposes itself through these tools. These must be integrated during development. White-box technologies necessarily have tighter coupling to the system than black-box technologies. JMX is an example of a White-box technology.

Thursday, May 13, 2010

Release It! - 17.2 Designing for Transparency

Transparency arises from deliberate design and architecture. “Adding transparency” late in development is about as effective as “adding quality.” Visibility inside one application or server is not enough. Strictly local visibility leads to strictly local optimization. Visibility into one application at a time can also mask problems with scaling effects. In designing for transparency, keep a close eye on coupling. It’s relatively easy for the monitoring framework to intrude on the internals of the system. The monitoring and reporting systems should be like an exoskeleton built around your system, not woven into it. In particular, decisions about what metrics should trigger alerts, where to set the thresholds, and how to “roll up” state variables into an overall system health status should all be left outside of the application itself. These are policy decisions that will change at a very different rate than the application code itself will.

Wednesday, May 12, 2010

Release It! - 17.1 Perspectives

Transparency refers to the qualities that allow operators, developers, and business sponsors to gain understanding of the system’s historical trends, present conditions, instantaneous state, and future projections. Transparent systems communicate, and in communicating, they train their attendant humans. Without transparency, the system will drift into decay, functioning a bit worse with each release. Systems can mature well if, and only if, they have some degree of transparency.

Historical Trending
Historical records have to be stored somewhere for a period of time. The historical perspective is best served by a database: the OpsDB. The OpsDB can be used to investigate anomalies or trends. Because it contains system- and business-level metrics, it can be used to identify correlations in time and across layers. Because it can be used to discover new and interesting relationships, the historical data should be broadly available through tools such as Microsoft Access and Microsoft Excel.

Predicting the Future
Good predictive models are expensive to build. It’s possible to develop “good enough” models by finding correlations in past data, which can then be used—within a certain domain of applicability—to make predictions. These correlative models can be built into spreadsheets to allow less technical users to perform “what if” scenarios. Remember, an application release can alter or invalidate the correlations on which the projections are built.

Present Status
“Present status” describes the overall state of the system. This is not so much about what it is doing as what it has done. This should include the state of each piece of hardware and every application server, application, and batch job. Events are point-in-time occurrences. Some indicate normal, or even required, occurrences, while others indicate abnormalities of concern. Parameters are continuous metrics or discrete states that can be observed about the system. This is where transparency is most vital. Applications that reveal more of their internal state provide more accurate, actionable parameters. For continuous metrics, a handy rule-of-thumb definition for nominal would be “the mean value for this time period plus or minus two standard deviations.”

The Infamous Dashboard
The present status of the system is obviously amenable to a dashboard presentation. (It practically defines a dashboard.) The dashboard should be broadly visible; projecting it on a wall in the lunchroom isn’t out of the question. The more people who understand the normal daily behavior of the system, the better. Most systems have a daily rhythm of expected events. Their execution falls in the category of “required expected events.” The dashboard should be able to represent those expected events, whether or not they’ve occurred, and whether they succeeded or not.

Green - all must be true:
* All expected events have occurred.
* No abnormal events have occurred.
* All metrics are nominal.
* All states are fully operational.

Yellow - at least one must be true:
* An expected event has not occurred.
* At least one abnormal event, with a medium severity, has occurred.
* One or more parameters is above or below nominal.
* A noncritical state is not fully operational. (For example, a circuit breaker has cut off a noncritical feature.)

Red - at least one is true:
* A required event has not occurred.
* At least one abnormal event, with high severity, has occurred.
* One or more parameters is far above or below nominal.
* A critical state is not at its expected value. (For example, “accepting requests” is false when it should be true.)

Instantaneous Behavior
Instantaneous behavior is the realm of monitoring systems. This is also the realm of thread dumps. Frameworks such as JMX also enable a view into instantaneous behavior, because they allow administrators to view the internals of a running application.

Tuesday, May 11, 2010

Release It! - 16 Case Study: Phenomenal Cosmic Powers, Itty-Bitty Living Space

Long story short: web site goes dark on Black Friday because a downstream integration can't handle the load. Using Perl scripts, they were able to script the resizing of resource pools to get the system back online, albeit not at full throughput. The moral of the story seems to be this: "The ability to restart components, instead of entire servers, is a key concept of recovery-oriented computing. Although we did not have the level of automation that ROC proposes, we were able to recover service without rebooting the world. If we had needed to change the configuration files and restart all the servers, it would have taken more than six hours under that level of load. Dynamically reconfiguring and restarting just the connection pool took less than five minutes (once we knew what to do)."

Monday, May 10, 2010

How can I install Lucid Lynx (Ubuntu 10.04) on my HP EliteBook 8440p?

I finally figured out how to install Lucid on my work laptop, an HP EliteBook 8440p. This primary issue is that the nVidia NVS 3100M video card doesn't play well with Lucid out of the box. You have to take a few extra steps to get it working. I followed the steps found on Linux Laptop Wiki and summarize them here.

use the alternate install disk to install Lucid using the text installer
after installation, hit shift during boot to get to the grub screen
change 'quiet splash' to 'nosplash nomodeset' on line that starts with 'linux'
crtl-x to continue booting
log into your account and activate the hardware drivers
edit edit /etc/default/grub so that GRUB_CMDLINE_LINUX_DEFAULT is set to 'nosplash nomodeset'
sudo update-grub

Reboot and you should be all set.

Release It! - 15 Design Summary

This chapter recaps the information that has been presented thus far:

"It can be hard to draw attention to these topics during the hustle and
rush of a development project, especially once crunch mode begins.
There’s good and bad news here; you can choose not to deal with these
issues during development. If so, you will deal with them in produc-
tion...time and time again. Dealing with these issues in development
does not necessarily cost much, in time or effort, and what it does cost
is far outweighed by the long-term cost of ignoring them.

Remember that your application will run on a server with multiple net-
work interfaces. Be sure it binds to the correct address for any sockets
it listens to, and be sure that any special routing requirements are set
up and documented. Administrative functions should be exposed on
the administration and monitoring network, not the production net-
work.

Be sure to use virtual IP addresses to access clustered services, such
as database servers or web services provided by other systems. Using
the VIP allows the service provider to fail over—whether planned or
unplanned—without necessitating the reconfiguration of your system.
Applications should be able to run as application users; they should
not require root or Administrator permissions. Sensitive configuration
parameters, such as database passwords or encryption keys, should be
kept in their own configuration files.

Not every system requires five nines of availability. The cost of greater
availability increases radically at each level. Considering the availabil-
ity requirements as a cost/benefit trade-off (well, a cost/cost trade-off,
really) with the sponsors helps move the discussion forward.
Rather than defining the availability of the entire system as a whole,
I prefer to define the availability of specific features or functions per-
formed by the system. Be sure to write exclusions for loss of availability
caused by external systems.

Load balancing and clustering are two prerequisites for high availabil-
ity. You can employ a variety of techniques, with a wide range of costs.
Armed with your availability requirements, you can apply various load-
balancing and clustering solutions as needed to meet the requirements
at efficient cost. Each of these solutions has its own unique set of con-
siderations, so defining the high-availability architecture early makes
development and deployment much easier.

Your application’s administrators will never know as much about its
internals as you will. You can help reduce the likelihood of operator
error by making your application obvious to configure. This means
separating essential plumbing, such as Spring’s beans.xml files, from
environment-specific configuration. Mixing them is the equivalent of
putting the ejection seat button next to the radio tuner. Sooner or later,
something bad will happen.

Spend some time making your application simple to operate. Start-
up and shutdown should be nondisruptive to users, and any admin-
istration duty must be scriptable. Pretty Java desktop administration
GUIs help the novice learn his way around, but nobody wants to click
through the pretty GUI for the thousandth time."

Sunday, May 9, 2010

Release It! - 14.3 Start-up and Shutdown

Try and get the system to verify every resource it needs is available before opening the doors for business. There should be a way to communicate to operations if there is a problem. A shutdown process should disallow new transactions but complete the one already being worked on. Use Timeouts or the application may never get shutdown. JavaEE servers do some of this but you'll probably have to fill in the missing pieces with your own code.

Saturday, May 8, 2010

Release It! - 14.2 Configuration Files

Property files suffer from hidden linkages and high complexity—two of the biggest factors leading to operator error. A common error in designing a configuration scheme is mixing production configuration with basic plumbing. Whenever possible, keep production configuration properties separate from the basic wiring and plumbing of the application. They should be in separate files so the administrators do not accidentally edit internals. The production configuration files should not be anywhere underneath the installation directory of the software itself because the installation directory is likely to be overwritten on the next upgrade. Using a version control system in operations to manage configuration files is a good idea. If you have configuration settings that are shared between machines, then keep those separate from those that are specific to a single machine. You want operators to immediately know what the scope of a setting is. It might be smart to have tools that verify that files that should be synchronized in a horizontally scaled deployment are actually in synch -- trust but verify. Consider configuration properties to be part of the operator's user interface so select clear and precise names. Try naming the property by function, such as authentication_server.

Friday, May 7, 2010

Release It! - 14.1 Does QA Match Production?

If your system is easy to administer, it will have good uptime. What’s more, you’ll find it easy to get help and resources from operations. On the other hand, if your system is difficult or annoying to administer, it will be neglected, deprecated, and probably implemented incorrectly. It might even get sabotaged.

It is usually impractical to have QA's environment match exactly with Production's -- otherwise they would be production. Typically, the cause of a testing failure is a mismatch in topology between QA and production. Topology is the number and connectivity of the servers and applications. If you consider each server and application instance to be a node and each connection or dependency to be an arc, you can define a graph that represents the system topology. It usually costs too much to make QA's toplogy match Productions.

Keep Them Separated
Avoid sharing hosts between applications. Certain class of bugs get hidden when applications run co-located on the same server. Use a virtualization solution to give each application its own host. It is a cheap way to emulate the production topology.

Zero, One, Many
If you are going to run a dozen instances in production, you probably don’t need to run a full dozen in QA. You should definitely run more than one, however. Virtualization is, again, a great tool here.

Just Buy the Gear
Hours of downtime due to firewalls and load balancers that didn't exist in QA exceeds the cost of purchasing that gear for QA. Don't be penny wise and pound foolish.

Thursday, May 6, 2010

Release It! - Chapter 13.4 Clustering

Load balancing does not require collaboration between the separate servers. When the servers are aware of each other and actively participate in distributing load, then they form a cluster. Clusters can be used for load balancing, in the case of active/active clusters. They can also be used for redundancy in the case of failure. These are called active/passive clusters, meaning that one server handles all the load until it fails and then the passive one takes over and becomes active. There is overhead in the implementation of clustering so the scaling factor is not linear.

For applications that do not have their own native clustering, it is possible to run them under the control of a cluster server, such as Veritas Cluster Server. The author has an interesting opinion on cluster servers: "I am ambivalent about cluster servers. At the same time, they are marvelous and kludgy. They can add redundancy and failover to applications that weren’t designed for it. Configuring the cluster server itself is finicky, though, and applications usually have a few small glitches when failing over. The biggest drawback is probably that these run in active/passive mode. So, redundancy is achieved, but scalability is not. I consider cluster servers a Band-Aid for applications that don’t do it themselves." Cluster servers are also expensive. I've used clustering solutions baked into JavaEE servers but you need to test to see exactly how the fail over works.

Wednesday, May 5, 2010

Release It! - 13.3 Load Balancing

Scaling Horizontally involves balancing the load amongst a pool of identically configured servers. There are multiple load balancing techniques that can be used.

DNS Round-Robin
This technique simply associates several IP addresses with the service name. Each IP address points to one of the servers in the pool. Over time, DNS should have "pointed" the clients evenly across the servers in the pool. This technique usually implies that the servers can be directly reachable from the client -- no hiding behind firewalls. Once the client connects to a server, there is no way to migrate him to another server. Also, there is no guarantee that the load is evenly spread across the cluster because each client's processing requirement is different. All we can gurantee with this technique is that the number of connections has evenly been distributed. DNS round-robin load balancing is inappropriate whenever the calling system is another long-running enterprise system. Anything built on Java will cache the first IP address received from DNS, guaranteeing that every future connection targets the same host and completely defeating load balancing. Ouch!

Reverse Proxy
A Reverse Proxy, such as Squid, routes each application request to one of the servers in the pool. The web and application servers in the pool need to be reconfigured to generate URLs in terms of the proxy and not the server itself, since the proxy is the gateway into the pool. You can also configure the proxy to be a cache for static web content. Since the proxy handles each request into the system, it can quickly become over taxed and become a bottle neck. The commonly used proxies are not aware of the health of the individual servers and will happily route a request to a sick or down server. Not good.

Hardware Load Balancer
Hardware load balancers are specialized network devices that serve a similar role to the reverse proxy server. Because they operate closer to the network, hardware load balancers frequently provide better administration and redundancy features. The load balancer removes dead servers from its service pool, directing connections to the healthy ones. The big drawback to these machines is—of course—their price.

Tuesday, May 4, 2010

Release It! - Chapter 13.2 Documenting Availability Requirements

Vagueness in SLAs aren't good for anybody, especially you. Don't offer SLAs for the entire system, rather offer SLAs for discrete pieces of functionality. Some are more important to the customer than others and are probably worth the expense of making them more available than others. Remember that SLA inversion says you can't have SLAs any higher than worst SLA of your dependent systems. Availability needs to be defined, specifically how it is being tested. An automated program generating synthetic transactions is better than humans randomly clicking around a UI. Define the mechanisms used for checking availability and how the monitoring tools will report problems. Consider defining the following for each SLA:

How often will the monitoring device execute its synthetic transaction?
What is the maximum acceptable response time for each step of the transaction?
What response codes or text patterns indicate success?
What response codes or text patterns indicate failure?
How frequently should the synthetic transaction be executed?
From how many locations?
Where will the data be recorded?
What formula will be used to compute the percentage availability? Based on time or number of samples?

It might also make sense to see if the customer requires integration with their own monitoring tools.

Monday, May 3, 2010

Release It! - Chapter 13.1 Gathering Availability Requirements

The proper way to frame the availability decision is in straightforward financial terms: actual cost vs. avoided losses. If 99% uptime results in 500 minutes of downtime, how much does that translate into lost revenue? If your system earns $1000 an hour then you have the potential of losing $8000 per month due to down time. Given that math, does it make sense to pay for making the system %99.50 available? In short, do the math or you might be paying for something that doesn't give enough in return. Hopefully, somebody in the company can tell you what the system earns per hour or it makes the calculation difficult.

Sunday, May 2, 2010

Release It! - Chapter 12.2 Configured Passwords

Passwords are the Achilles heel of application security. Any password that grants access to a database with customer information is worth thousands of dollars to an attacker and could cost the company thousands in bad publicity or extortion. These passwords must be protected with the highest level of security achievable. At the absolute minimum, passwords to production databases should be kept separate from any other configuration files. They should especially be kept out of the installation directory for the software. Files containing passwords should be made readable only to the owner, which should be the application user. Password vaulting keeps passwords in encrypted files, which reduces the security problem to that of securing the single encryption key rather than securing multiple text files. Consider using Tripwire to monitor secured files. Keeping passwords safe is a pain but I suppose it is better than letting the bad guys at your customer's data.

Saturday, May 1, 2010

Release It! - Chapter 12.1 The Principle of Least Privilege

The principle of “least privilege” mandates that a process should have the lowest level of privilege needed to accomplish its task -- which usually means running as a non-admin user. Using privileged accounts gives crackers a place to attack your systems. To further contain vulnerabilities, each major application should have its own user, such as "tomcat" for an Apache Tomcat process.

Search This Blog

Monday, May 31, 2010

Saturday, May 29, 2010

Friday, May 28, 2010

Thursday, May 27, 2010

Wednesday, May 26, 2010

Tuesday, May 25, 2010

Monday, May 24, 2010

Sunday, May 23, 2010

Saturday, May 22, 2010

Friday, May 21, 2010

Thursday, May 20, 2010

Wednesday, May 19, 2010

Monday, May 17, 2010

Sunday, May 16, 2010

Saturday, May 15, 2010

Friday, May 14, 2010

Thursday, May 13, 2010

Wednesday, May 12, 2010

Tuesday, May 11, 2010

Monday, May 10, 2010

Sunday, May 9, 2010

Saturday, May 8, 2010

Friday, May 7, 2010

Thursday, May 6, 2010

Wednesday, May 5, 2010

Tuesday, May 4, 2010

Monday, May 3, 2010

Sunday, May 2, 2010

Saturday, May 1, 2010