Print Story Tilley's "Realistic Evaluation"
By Alan Crowe (Sat Mar 08, 2008 at 08:49:25 AM EST) evaluation, testing, bungee-boss, experts (all tags)
TheophileEscargot posted a link to a PDF of a talk by Nick Tilley about realistic evaluation of public policy initiatives. My response is too long for a comment so I promoting it to a diary entry.

I hit worries on page one about piecemeal, small scale evaluation.

Gun control has a a small scale, with nutters shooting dozens, and a large scale with governments going bad and massacring their citizens in there millions. Evaluation had better not drop the latter issue on the floor.

Page 2 is also scary. It is traditional in education to evaluate a new teaching method, find it works, deploy it widely, and be surprised when it fails. This tradition is based on the originators of new methods being charismatic individuals who can teach really well. What works with average teachers in average class rooms across the land is a different matter.

Worse, incentives matter. One tries hard and makes it work to get through the evaluation. Then one slacks of and rots sets in. Methods that were originally deployed with sensitivity to their goals degenerate into mere formalism and stop working.

The world depicted on page 2 is a ruined world in which things don't work and cannot be fixed because they worked in the evaluation so that's all right then.

The domestic violence anecdote on page 3 and 4 sucked. My maternal grandmother had to put up with my grandfathers violent temper. If he had been prosecuted and convicted he would no longer have been of good character and would have lost the army pension on which the family finances depended. Well that is the story I've been told, though I have my doubts.

The wider point is that getting repeat call out rates down is a rotten measure. When the cops arrest the perp, how is that understood? One possible understanding is that the cops will make trouble for the perp so that he loses his job and their is no money to buy food for the children. The woman takes the hint and doesn't bother the cops again.

The anecdote wasn't about this kind of problem, but naively accepted that if the police were not called violence had not re-occurred.

Page 7 bullet 2

Context: what conditions are needed for a measure to trigger mechanisms to produce particular outcome patterns?
What I'm seeing is a use of language to conceal the problems of the model.

The classic issue in welfare reform is that welfare programs both relieve poverty and create a system of incentives that mis-shape their recipients creating dis-satisfaction; the eventual outcome is some way from the good intentions that motivated the initiators of the welfare program.

So reforms change the welfare program. Typically the change directly addresses the perceived short-comings of the existing system. While the reform may work initially in also changes the system of incentives, setting up a dynamic that may spend a generation working itself out, and leading once again to an unhappy outcome.

Tilley's language fights against these insights. He talks of a mechanism that /triggers/ a change from one /regularity/ to another. Officialdom takes action. Next year in measures the new regularity and pronounces it good or bad.

Tilley's language is obscuring the idea that officialdom takes an action that changes society's trajectory, causing things to unfold through time in new ways.

Pre-chaos theory few people questioned the comforting myth of stability. You measure your data x0, x1, x2,... and if you can keep out external disturbances it will settle down to a fixed value x123 = x124 = x125 ... = x

A sociologist like Tilley looks a societies. He see both change and stability. He attributes the change to outside disturbance and views the stability as inherent.

In fact both change and stability are dynamic processes. What is at issue is the direction of the response to perturbation. If forces that arise in response to external perturbation are directed towards the existing attractor, they are restoring forces and we see stability. Other directions lead to trajectories, perhaps in surprising directions.

Worse still a limit cycles. Society can go round in circles. We talk about the swing of the political pendulum, projecting a circular trajectory in political space onto one dimension (perhaps left-to-right) and notice the cyclical swing, back and forth.

To return to Tilley's domestic violence example, Tilley interprets the variation in results as being due to different contexts. However the variation can arise within a single context if domestic violence is a cyclical phenomenon within communities. He may well be seeing that your intervention has one result in one phase of the cycle and a different result in another phase.

That is a rather abstract criticism. What makes it sting is that Tilley is eliminating this possibility linguistically rather than empirically. Talk of regularities and triggers blinds him to possibilities that must be looked for and seen or not seen.

Another concern with the domestic violence example is the crudeness of the intervention. Cops are called to domestic violence incidents and bring with them knowledge of rows with-in their own marriages, their parents marriages, those of friends and relations. The police response is likely to be reasonably sophisticated and solidly within the local cultural mainstream. It is the second part that is the problem. If the community has a problem with wife-beating, it is likely that the local police, who grew up in that community, wear the local blinkers. The problems are likely to be that the local police have a fairly sophisticated model of how to handle domestic violence, but it is unwise rather than over-simple.

In comes Tilley's friends with a surprisingly crude intervention: always arrest the perp. The first big problem is the fatuous crudity of the intervention. Officers will try to work around it to achieve good outcomes, by various means and with varying degrees of success. Tilley's example is of stirring up the mud and then being surprised by the turbidity. It doesn't help his case for emphasising context.

The second big problem is that the policy creates a weapon. Ordinarily the police are called to a domestic and exercise their judgement, for better or worse. (do they do a poor job or is domestic violence a hard problem?)

With automatic arrest the woman has a weapon to wield against her lover/oppressor. Ordinary people resist the bureaucratic weaponisation of everyday life. They seem to intuit that it will screw them over in the end. Yet incentives matter. Offer people a weapon, keep offering it, don't offer alternatives, eventually people will use it and be shaped by the use of it.

Tilley is proposing a framework of regularities and triggers that rejects such troubling dynamics a priori.

Page 9 discusses car parks, but skips the additional complications due to the importance of value for money. If the council has one hundred thousand pounds to spend on cutting car park crime, how should the council spend it? CCTV is in competition with other mechanisms, such as foot patrols and fencing to limit access. We are not asking how much protection a fixed number of cameras buys. We are asking how much protection a fixed expenditure on cameras buys. Administrative skill at controlling the cost to the public purse is an important factor.

Tilley's overview potters along, I've already finished page 9 without feeling that I've reached the point. However, at the bottom of page 10 I hit pay dirt and think its worth copy-typing the paragraph

With regard to formative evaluation I have been involved over the past 18 months with the Home Office Crime Reduction Programme. This involves the expenditure of £250m over three years, 10 per cent of which has been set aside for evaluation purposes. The programme is divided into a number of separate themes including domestic violence, burglary reduction, targeted policing, sentencing, schools work etc. Bidders have sent in proposals for initiatives they would like to have funded. A number of academics have then gone to look at these bids and to discuss them with those who have submitted them with a view to making suggestions for ways in which the bids might be refined. It has turned out that much of the time bidders have put in proposals which have not been thought through in realist terms. That is, they have identified a problem and then proposed a set of standard measures drawn from an orthodox repertoire, often with little consideration about how they are expected to work through in practise in circumstances in which the initiative is being introduced. The academics have found themselves involved in realist theory construction in relation to the bids that have been submitted. This has involved, in effect, critical discussion of the expected ways in which measure that might be introduced will produce their impact. It has also involved drawing bidders' attention to the findings of previous evaluation and other research about offending patterns. This experience has led me to think that there is a strong role to be played by realist evaluation in programme assessment and development. Some of this is anticipated in the 'theories of change' approach. The difference is that a realist caste would stress the need to attend specifically to the contexts and mechanisms for the particular programme.

Looking back to his account of his car park work he asks how close circuit television might affect rates car park crime. The council is bidding for money for cameras and Tilley says that you need realists asking: how is this actually going to help?

One mechanism is that criminals might be caught in the act. So some-one must be watching the pictures and call the police who must turn up promptly. You can easily see that the devil is in the details. Will the council watch man have the police telephone number to hand. Will the police know which car park is which. The police will turn up while the council watch man sees the criminals walk away. Can he say to the police "they went that-a-way"? How? Phone? Radio?

The impression that Tilley gives is, well, how to put this, the bidders don't park their own cars in those car parks. They are not committed to the success of the scheme. So they go through the motions. Apply for money, spend money, apply for money, spend money,... How does it actually work? "Not my job gov, sorry."

Tilley vision seems to be to legitimise this. There will be a separate caste of realists, who also don't park their cars in those car parks, but somehow manage to keep their eye on the ball.

< What do you do when you fall far from help? | BBC White season: 'Rivers of Blood' >
Tilley's "Realistic Evaluation" | 8 comments (8 topical, 0 hidden) | Trackback
Lee? Check your cookies. by Rogerborg (4.00 / 1) #1 Sat Mar 08, 2008 at 12:55:05 PM EST
Had anyone heard of this Tilley character before nuLabor got their feet under the table?  He seems perfectly to embody their metrics ├╝ber alles philosophy.

Metus amatores matrum compescit, non clementia.
So what do you recommend instead? by TheophileEscargot (2.00 / 0) #2 Sun Mar 09, 2008 at 01:34:27 AM EST
If you were asked to evaluate whether CCTV should be introduced into car parks, how would you do it better?
It is unlikely that the good of a snail should reside in its shell: so is it likely that the good of a man should?
What's the point? by Alan Crowe (4.00 / 1) #3 Sun Mar 09, 2008 at 04:38:11 AM EST

One must be wary of lost purposes

Start at the beginning. What do we want? When we return to our car we want to find that the window hasn't been broken and our bag hasn't been stolen.

How is that going to work? Well, there need to be security measures, cameras and the like. We can start to construct a little story here. Fred from Somewhere District Council is in charge and has cameras installed. We picture Fred taking an interest. Crime is down in one car park but not in another. Fred goes to look. The camera is poorly sited, so Fred has it moved. Then crime falls. Well, maybe.

Debugging isn't just for computer programmers, it is at the heart of all successful, non-trivial, goal directed activity. Fred decides what he wants to achieve, Fred decides how he wants to do it and he gives it a bash. Fred cannot just stop there and leave follow to the clever guys from the Home Office. Fred has to check on how well it has worked, diagnose problems, make adjustments.

The key issue in public policy is whether the system has workably short feedback loops. If I were parachuted in by the Home Office to look at CCTV and car parks I would want to trace the feedback mechanisms. Is any-one keeping score? What happens when a member of the public writes in to complain about a blind spot? What happens if the police turn up to the wrong car park? Is it something that people moan about in the pub after work, or does something get done. (Perhaps the police say what they call the carparks and the council use those names.)

I while ago I saw two lads trying car doors and I phoned the police. The police picked me up and we had a ride round in the car. The police men were surprised when I told them about the lane down to the river. The police lacked the local knowledge to be effective.

Imagine if programmers didn't debug, but just handed first drafts to the evaluation department, who passed the code onto the customers, augmented with a warnings about bugs. You can see how the bureaucratic infighting would develop. There would be those who opposed evaluation, it doesn't actually help, and those in favour of evaluation, it is obviously essential, even if there are further issues of organisational structure that also require attention.

Is having a separate evaluation function part of the solution or part of the problem?

[ Parent ]
Conflicts of interest, economies of scale by TheophileEscargot (2.00 / 0) #4 Sun Mar 09, 2008 at 07:29:41 AM EST
If you have the same people evaluating what they're implementing, that creates a conflict of interest. They have an incentive to misreport whatever makes their live easiest, or their empire biggest, as the most successful policy.

Real life is generally nothing like computer programming. But even there, successful companies have independent group of testers or QA or user-acceptance testers to make sure the developers aren't just serving their own interests.

Also, there are significant economies of scale and benefits of standardization in these cases. If you can bulk buy a large number of CCTV cameras, you can negotiate a much discount than if you have individual car park managers ordering them themselves. Maintenance costs will also be much lower if you've standardized.
It is unlikely that the good of a snail should reside in its shell: so is it likely that the good of a man should?

[ Parent ]
True at all levels by Alan Crowe (2.00 / 0) #5 Sun Mar 09, 2008 at 08:06:23 AM EST
If you have the same people evaluating what they're implementing, that creates a conflict of interest.
Is this an argument for Tilley's favourable evaluation of Realistic Evaluation or against it?

[ Parent ]
Cheating the performance metrics by Alan Crowe (2.00 / 0) #7 Mon Mar 10, 2008 at 10:44:37 AM EST
Fiddling the figures is the way that middle managers fight senior managers or senior managers fight owners. Ordinary working people want a job they can believe in. Give them a performance metric and they will compute it honestly and try to improve their score.

That is perhaps a little rose tinted. Various things go wrong. First is the coercive incentive scheme. The Punished by rewards matters. If there is real money riding on it, some of the workers will do whatever is necessary to get "their" money. The other workers will, at first, disapprove.

How do management respond? A coercive incentive scheme with a metric you can fiddle is a good marker for management that doesn't care. We can fill in the rest of the story. Workers try to earn their bonus honestly by suggesting improvements. Management get in the way. Morale plummets. Workers become cynical and embrace fiddling the figures.

[ Parent ]
Two from Four thinking by Alan Crowe (2.00 / 0) #8 Mon Mar 10, 2008 at 11:11:17 AM EST

I'm intrigued by the idea of two from four thinking. I think it is an error of thought that comes so naturally that I'm having trouble spotting it and building my collection of examples.

I think I've found an example here. It feels ever so natural to discuss whether evaluation should take place locally, or whether it should take place nationally, instead. This is two-from-four thinking. Whether or not to do local evaluation is one binary choice and whether or not to do national evaluation is another binary choice, so there are 2×2=4 possibilities.

When you have a strong theory you can rely on actions having their intended effects. I like the story about the accuracy of Vietnam era strategic bombing. It was 100% accurate. The bombs hit the ground every time. If hitting the ground is good enough you do not have to evaluate, you can depend on gravity.

If you have any ambition towards Kaizen you must have local evaluation. That doesn't let you get away without national evaluation. How would you get warning that your Kaizen strategy is failing?

What happens about the duplication? This seems familiar. If you manage a company, you are not waiting for the auditors to tell you your profit. Your own accountants tell you that. The auditors come and they confirm the numbers. Interestingly, they don't do a straight duplication. They look to see if the company has procedures in place, capable of coming up with the profit figure. They do some sampling to check that the procedures are being followed. Some columns of numbers get added up a second time. An important part of the audit function is checking that the management accountants are not missing the wood for the trees. The owners need to know if the management accountants have had their noses too close to the grind stone and produced accurate figures that lack relevance.

[ Parent ]
Don't work like that here. by ammoniacal (2.00 / 0) #6 Sun Mar 09, 2008 at 08:19:26 PM EST
If wifey calls Plod, and hubby's bearing the wounds, she's nicked.

"To this day that was the most bullshit caesar salad I have every experienced..." - triggerfinger

Tilley's "Realistic Evaluation" | 8 comments (8 topical, 0 hidden) | Trackback