Wednesday, October 30, 2013

Adding Appsec to Agile: Security Stories, Evil User Stories and Abuse(r) Stories

Because Agile development teams work from a backlog of stories, one way to inject application security into software development is by writing up application security risks and activities as stories, making them explicit and adding them to the backlog so that application security work can be managed, estimated, prioritized and done like everything else that the team has to do.

Security Stories

SAFECode has tried to do this by writing a set of common, non-functional Security Stories following the well-known

“As a [type of user] I want {something} so that {reason}”
template.

These stories are not customer- or user-focused: not the kind that a Product Owner would understand or care about. Instead, they are meant for the development team (architects, developers and testers). Example:

As a(n) architect/developer, I want to ensure AND as QA, I want to verify that sensitive data is kept restricted to actors authorized to access it.

There are stories to prevent/check for the common security vulnerabilities in applications: XSS, path traversal, remote execution, CSRF, OS command injection, SQL injection, password brute forcing. Checks for information exposure through error messages, proper use of encryption, authentication and session management, transport layer security, restricted uploads and URL redirection to un-trusted sites; and basic code quality issues: NULL pointer checking, boundary checking, numeric conversion, initialization, thread/process synchronization, exception handling, use of unsafe/restricted functions.

SAFECode also includes a list of secure development practices (operational tasks) for the team that includes making sure that you’re using the latest compiler, patching the run-time and libraries, static analysis, vulnerability scanning, code reviews of high-risk code, tracking and fixing security bugs; and more advanced practices that require help from security experts like fuzzing, threat modeling, pen tests, environmental hardening.

Altogether this is a good list of problems that need to be watched out for and things that should be done on most projects. But although SAFECode’s stories look like stories, they can’t be used as stories by the team.

These Security Stories are non-functional requirements (NFRs) and technical constraints that (like requirements for scalability and maintainability and supportability) need to be considered in the design of the system, and may need to be included as part of the definition of done and conditions of acceptance for every user story that the team works on.

Security Stories can’t be pulled from the backlog and delivered like other stories and removed from the backlog when they are done, because they are never “done”. The team has to keep worrying about them throughout the life of the project and of the system.

As Rohit Sethi points out, asking developers to juggle long lists of technical constraints like this is not practical:

If you start adding in other NFR constraints, such as accessibility, the list of constraints can quickly grow overwhelming to developers. Once the list grows unwieldy, our experience is that developers tend to ignore the list entirely. They instead rely on their own memories to apply NFR constraints. Since the number of NFRs continues to grow in increasingly specialized domains such as application security, the cognitive burden on developers’ memories is substantial.

OWASP Evil User Stories – Hacking the Backlog

Someone at OWASP has suggested an alternative, much smaller set of non-functional Evil User Stories that can be "hacked" into the backlog:

A way for a security guy to get security on the agenda of the development team is by “hacking the backlog”. The way to do this is by crafting Evil User Stories, a few general negative cases that the team needs to consider when they implement other stories.

Example #1. "As a hacker, I can send bad data in URLs, so I can access data and functions for which I'm not authorized."

Example #2. "As a hacker, I can send bad data in the content of requests, so I can access data and functions for which I'm not authorized."

Example #3. "As a hacker, I can send bad data in HTTP headers, so I can access data and functions for which I'm not authorized."

Example #4. "As a hacker, I can read and even modify all data that is input and output by your application."

Thinking like a Bad Guy – Abuse Cases and Abuser Stories

Another way to beef up security in software development is to get the team to carefully look at the system they are building from the bad guy's perspective.

In “Misuse and Abuse Cases: Getting Past the Positive”, Dr. Gary McGraw at Cigital talks about the importance of anticipating things going wrong, and thinking about behaviour that the system needs to prevent. Assume that the customer/user is not going to behave, or is actively out to attack the application. Question all of the assumptions in the design (the can’ts and won’ts), especially trust conditions – what if the bad guy can be anywhere along the path of an action (for example, using an attack proxy between the client and the server)?

Abuse Cases are created by security experts working with the team as part of a critical review – either of the design or of an existing application. The goal of a review like this is to understand how the system behaves under attack/failure conditions, and document any weaknesses or gaps that need to be addressed.

At Agile 2013 Judy Neher presented a hands-on workshop on how to write Abuser Stories, a lighter-weight, Agile practice which makes “thinking like a bad guy” part of the team’s job of defining and refining user requirements.

Take a story, and as part of elaborating the story and listing the scenarios, step back and look at the story through a security lens. Don’t just think of what the user wants to do and can do - think about what they don’t want to do and can’t do. Get the same people who are working on the story to “put their black hats on” and think evil for a little while, brainstorm to come up with negative cases.

As {some kind of bad guy} I want to {do some bad thing}…

The {bad guy} doesn’t have to be a hacker. They could be an insider with a grudge or a selfish customer who is willing to take advantage of other users, or an admin user who needs to be protected from making expensive mistakes, or an external system that may not always function correctly.

Ask questions like: How do I know who the user is and that I can trust them? Who is allowed to do what, and where are the authorization checks applied? Look for holes in multi-step workflows – what happens if somebody bypasses a check or tries to skip a step or do something out of sequence? What happens if an action or a check times-out or blocks or fails – what access should be allowed, what kind of information should be shown, what kind shouldn’t be? Are we interacting with children? Are we dealing with money? With dangerous command-and-control/admin functions? With confidential or pirvate data?

Look closer at the data. Where is it coming from? Can I trust it? Is the source authenticated? Where is it validated – do I have to check it myself? Where is it stored (does it have to be stored)? If it has to be stored, should it be encrypted or masked (including in log files)? Who should be able to see it? Who shouldn’t be able to see it? Who can change it, and to the changes need to be audited? Do we need to make sure the data hasn't been tampered with (checksum, HMAC, digital signature)?

Use this exercise to come up with refutation criteria (user can do this, but can’t do that; they can see this but they can’t see that), instead of, or as part of the conditions of acceptance for the story. Prioritize these cases based on risk, add the cases that you agree need to be taken care of as scenarios to the current story, or as new stories to the backlog if they are big enough.

“Thinking like a bad guy” as you are working on a story seems more useful and practical than other story-based approaches.

It doesn’t take a lot of time, and it’s not expensive. You don’t need to write Abuser Stories for every user Story and the more Abuser Stories that you do, the easier it will get – you'll get better at it, and you’ll keep running into the same kinds of problems that can be solved with the same patterns.

You end up with something concrete and functional and actionable, work that has to be done and can be tested. Concrete, actionable cases like this are easier for the team to understand and appreciate – including the Product Owner, which is critical in Scrum, because the Product Owner decides what is important and what gets done. And because Abuser Stories are done in phase, by the people who are working on the stories already (rather than a separate activity that needs to be setup and scheduled) they are more likely to get done.

Simple, quick, informal threat modeling like this isn’t enough to make a system secure – the team won’t be able to find and plug all of the security holes in the system this way, even if the developers are well-trained in secure software development and take their work seriously. Abuser Stories are good for identifying business logic vulnerabilities, reviewing security features (authentication, access control, auditing, password management, licensing) improving error handling and basic validation, and keeping onside of privacy regulations.

Effective software security involves a lot more work than this: choosing a good framework and using it properly, watching out for changes to the system's attack surface, carefully reviewing high-risk code for design and coding errors, writing good defensive code as much as possible, using static analysis to catch common coding mistakes, and regular security testing (pen testing and dynamic analysis). But getting developers and testers to think like a bad guy as they build a system should go a long way to improving the security and robustness of your app.

Thursday, October 24, 2013

Making Devops work outside of Webops

I've spent the last 3 years or so learning more about devops. I went to Velocity and Devopsdays and a bunch of other conferences that included devops stuff (like the last couple of OWASP USA conferences and this year's Agile conference). I've been following the devops forums and news and reading devops books and trying out devops tools and Continuous Delivery, talking to smart people who work at successful devops shops, encouraging people in my organization to adopt some of these ideas and tools where they make sense. Looking for practices and patterns and technology we can take and apply to the work that we do, which is in an enterprise, B2B environment.

A problem for us is that devops today is still mostly where it started, rooted in Web Ops with a focus on building and scaling online shops and communities and Cloud services – except maybe where some enterprise technology vendors have jumped on the devops bandwagon to re-brand their tools.

Is there really that much that a well-run highly-regulated enterprise IT organization hooked into hundreds or thousands of other enterprises can learn from a technology startup trying to launch a new online social community or a multi-player online game, or even from larger, more mature devops shops like Etsy or Netflix? Do the same rules and ideas apply?

The answer is: yes sometimes, but no, not always.

There are some important factors that separate enterprises from most devops shops today.

Platform heterogeneity and the need to support legacy systems and all of the operational inter-dependencies between systems is one – you can’t pretend that you can take care of your configuration management problems by using Puppet or Chef in an enterprise that has been built up over many years through mergers and acquisitions and that has to support thousands of different applications on dozens of different technology platforms. Many of those apps are third party apps that you don’t have control over. Some of those platforms are legacy systems that aren't supported any more. Some of those configs are one-off snow flakes because that’s the only way that somebody could get things to work.

Governance and regulatory compliance (and all the paperwork and hassle that goes with this) is another. Even devops shops don’t handle their highly-regulated core business functions the same as they do the rest of their code (a good example is how Etsy meets PCI compliance).

There are two other important factors that separate many enterprises such as large financial institutions from the way that a devops shop works: the need for speed in change, and the cost of failure.

The Need for Speed

If “How can I change things faster?” is the question, devops looks like the answer.

Devops enables – and emphasizes – rapid, continuous change, through relentless automation, breaking down walls between developers and operations, and through practices like Continuous Deployment, where developers push out code changes to production several times a day.

Being able to move this quickly is important in early stages of iterative design and development, for online startups that need to build a critical mass of customers before they run out of money, and other organizations experiencing hyper growth. Every organization has some systems that need to be changed often, and can be changed with minimal impact: CRM systems, analytics and internal management reporting for example. And as James Urqhuart explains, optimizing for change makes sense when you need to change technology often.

But there are other systems that you don’t need to or you can’t change every day or every week or every month: ERP and accounting systems, payment handling, B2B transactional systems, industrial control. Where you don’t need to and don’t want to run experiments on your customers to try out new features or constantly refine the details of their user experience because the system already works and lots of people are depending on it to work a certain way and what’s really important is keeping the system working properly and keeping operational costs down. Or where change is risky and expensive because of regulatory and compliance requirements and operational interdependencies with other systems and other organizations. And where the cost of failure is high.

Change, even when you optimize for it, always comes with the risk of failure. The 2013 State of Devops Report found that high performing devops shops deploy code 30x more frequently, with “double the change success rate”. By themselves these figures are impressive. Taken together – they aren’t good enough. Changing more often still means failing more often than an organization which moves more slowly and more cautiously, and not every organization can afford to fail more often.

The Cost of Failure

Most online businesses exist in a simpler, more innocent world where change is straightforward – it’s your code and your Cloud so you can make a change and push it out without worrying about dependencies on other systems and the people who use them or how to coordinate a roll-out globally across different companies and business lines – and where the consequences of failure are really not that high.

If you’re not charging anything (Facebook, Twitter) or next to nothing (Netflix) for customers to use your service, and if the cost of failure to customers is not that much (they have to wait a little bit to tell people that their kitty just sneezed or to post a picture of their gold fish or watch a movie) then nobody has the right to expect too much when something goes wrong.

It’s a completely different world for financial services companies, where failure is not always an option – and “fail fast, fail often” works in early stage development, but not once customers are using your technology.

I’ve been told by a tech exec at a bank that Etsy (or any web company) wasn't a “serious” endeavor, that his bank works with “serious money” which means that they can’t “screw around” like web companies do. I've also seen web companies poo-poo the enterprise because they're "spoiled" with their small user base and non-24x7 working environments.

Until there is a shared understanding between those groups, the healthy and mature swapping of ideas and concepts is going to be slow.

John Allspaw, interview, Is the Entrerprise Ready for DevOps?

That bank exec, though he wasn't diplomatic about it, is right.

The cost and risk involved in a failure is several orders of magnitude different between a bank and an online consumer web business, even something as large as Etsy. In a presentation at the end of 2012, Etsy's CTO boasted that they are now handling “real money”, as much as “$1k per minute” at that time. While that’s real money to real customers at Etsy (and I am sure that this number is higher by now), it’s negligible compared to the value of transactions that any major financial institution handles.

There aren’t any mistakes that a company like Etsy or even Facebook could make that could compare with the impact of a big system failure at a major bank, or a credit card processor or a major stock exchange or financial clearing house or brokerage, or some other large financial institution.

This is not just because of the high value of transactions that are moving through these systems. It is also because of the chain reaction that such failures have on other institutions in the national and international system-of-systems that these organizations operate in – the impact on partner and customers’ systems, and on their customers and partners and so on. The costs of these failures can run into the millions or hundreds of millions of dollars, much more if you include lost opportunity costs and the downstream costs of responding to the failure (including IT costs for upgrading or replacing the entire system, which is a common response to a major failure), never mind the follow-on costs of increased regulatory oversight that is often demanded across an entire industry after a high-profile system failure.

Problems of this scale just don’t happen when an online e-business or social network fails, even if it fails spectacularly. It’s an inconvenience to customers, and it is a real cost to the online business, but failures don’t cascade down to other companies and industries, unless maybe Amazon’s AWS infrastructure fails big time and the online companies that depend on it are left hanging, which seems to happen a couple of times each year.

It’s not enough for many enterprises and even smaller B2B platforms to optimize for MTTR and try harder next time or to accept that roll-back is a myth and that “real men only roll forward” – and from the continuing stories of high-profile failures at online organizations this isn't enough for devops organizations once they reach a certain size either.

But you can still learn a lot from Devops

It’s not that devops won’t work in the enterprise. Just not devops as it is mostly described until now. Devops overplays the “everything needs to change faster and more often” card, and oversimplifies some of the other problems that many organizations face if they don’t or can’t run everything in the Cloud. But there is still a lot that to learn from devops leaders, even if their stories and their priorities and constraints and their business situations don’t match up.

We can certainly learn from them about how to run a scalable Web presence and about how to move stuff to the Cloud.

We can learn how to take more of the risk and cost out of change management by simplifying and standardizing configurations where possible and simplifying and automating testing and deployment steps as much as possible – even if we aren't going to change things every day.

But for now, probably the most valuable thing that devops brings to those of us who don't work in online Web shops isn't tools or practices. It’s that devops is creating new reasons and more opportunities for dev and Ops and management to engage with each other, as they try to figure out what this devops thing is and whether and how it makes sense in our organizations.

To get developers, managers and Ops together talking about configuration management and how to improve release and deployment and run-time alerting and application monitoring and run-time health checks and how developers and Ops can learn from failures together. Getting developers and Ops thinking about these problems and trying to solve them together, in collaborative and constructive ways that ITIL certainly never did. Spending more time on problem solving and less time on expensive bureaucracy and buck passing. That’s gotta be a good thing.

Thursday, October 17, 2013

Programming: Thinking or Typing, Thinking and Typing

“If you don’t think carefully, you might think that programming is just typing statements in a programming language.”
Ward Cunningham, in the Forward to The Pragmatic Programmer
Software development – design, solving problems, coming up with optimal new algorithms, learning a new language, refactoring messy code into something tight and elegant – requires hard thinking.

When you’re trying to do something that you've never done before – or nobody has ever done before. Or when you've done it before but you sure as hell aren't going to make the same mistakes again and you need time to think your way to a better way. Or when you’re trying to understand code that somebody else wrote so you can change it, or when you’re hunting down an ugly bug. All of this can take a lot of time, but in the end you won’t have a lot of code to show for it.

Then there’s all the other work in development – work that requires a lot of typing, and not as much thinking. When it’s clear what you need to do and how you need to do it, but you have an awful lot of code to pound out before the job is done. You've done it before and just have to do it again: another script, another screen, another report, and another after that. Or where most of the thinking has already been done for you: somebody has prepared the wireframes and told you exactly how the app will look and feel and flow, or speced out the API in detail, so all you need to do is type it in and try not to make too many mistakes.

Debugging is thinking. Fixing the bug and getting the fix tested and pushed out is mostly typing. Early stage design and development, technical spikes to check out technology and laying out the architecture, is hard thinking. Doing the 3rd or 4th or 100th screen or report is typing. UX design and prototyping: thinking. Pounding out CRUD maintenance and config screens: typing. Coming up with a cool idea for a mobile app is thinking. Getting it to work is typing. Solving common business problems requires a lot of typing. Optimizing business processes through software requires hard thinking.

Someone who is mostly thinking and someone who is just typing are doing very different kinds of work and need to be managed in different ways.

Sometimes Programming is just Typing

“We are typists first, and programmers second.”
Jeff Atwood, Programming Horror
Many business applications are essentially shallow. Lots of database tables and files with lots of data elements and lots of data, and lots of CRUD screens and lots of reports that are a lot like a lot of other screens and reports, and lots of integration work with lots of fields to be mapped between different points and then there are compliance constraints and operational dependencies to take care of. Long lists of functional requirements, lots of questions to ask to make sure that everyone understands the requirements, lots of details to remember and keep track of.

Banking, insurance, government, accounting, financial reporting and billing, inventory management and ERP systems, CRM systems, and back-office applications and other book-keeping and record-keeping systems are like this. So are a lot of web portals and online shops. Some maintenance work – platform upgrades and system integration work and compliance and tax changes – is like this too.

You’re building a house or a bridge or a shopping mall, or maybe renovating one. Big, often sprawling problems that may be expensive to solve. A lot of typing that needs to be done. But it’s something that’s been done many times before, and the work mostly involves familiar problems that you can solve with familiar patterns and proven tools and ways of working.

"I saw the code for your computer program yesterday. It looked easy. It’s just a bunch of typing. And half of the words were spelled wrong. And don’t get me started on your over-use of colons."

The Pointy Haired Boss sees some actual code

Once the design is in place most of the work is in understanding and dealing with all of the details, and managing and coordinating the people to get all of that code out of the door. This is classic project/program management: budgeting and planning, tracking costs and changes and managing hand offs. It’s about logistics and scale and consistency and efficiency, keeping the work from going off of the rails.

Think Think Think

Other problems, like designing a gaming engine or a trading algorithm, or logistics or online risk management or optimizing real-time control systems, require a lot more thinking than typing. These systems have highly-demanding non-technical requirements (scalability, real-time performance, reliability, data integrity and accuracy) and complex logic, but they are focused on solving a tight set of problems. A few smart people can get their head around these problems and figure most of them out. There’s still typing that needs to be done, especially around the outside, the framing and plumbing and wiring, but the core of the work is often done in a surprisingly small amount of code – especially after you throw away the failed experiments and prototypes.

This is where a lot of the magic in software comes from – the proprietary or patented algorithms and the design insight that lies at the heart of a successful system. The kind of work that takes a lot of research and a lot of prototyping, problem solving ability, and real technical chops or deep domain knowledge, or both.

Typing and Thinking are different kinds of work

Whether need to do a lot of typing or mostly thinking dictates how many people you need and how many people you want on the team, and what kind of people you want to do the work. It changes how people work together and how you have to manage them. Typing can be outsourced. Thinking can’t. You need to recognize what problems can be solved with typing and what can’t, and when thinking turns to typing.

Thinking work can and should be done by small teams of specialists working closely together – or by one genius working on their own. You don’t need, or want, a lot of people while you are trying to come up with the design or think through a hard problem, run experiments and iterate. Whoever is working on the problem needs to be immersed in the problem space, with time to explore alternatives, chances to make mistakes and learn and to just stare out the window when they get stuck.

This is where fundamental mistakes can be made: architecture-breaking, project-killing, career-ending errors. Picking the wrong technology platform. Getting real-time tolerances wrong. Taking too long to find (or never finding) nasty reliability problems. Picking the wrong people or trying to solve the wrong problem. Missing the landing spot.

Managing this kind of work involves getting the best people you can find, making sure that they have the right information and tools, keeping them focused, looking out for risks from outside, and keeping problems out of their way.

Thinking isn’t predictable. There’s no copy-and-paste because there’s nothing to copy and paste from. You can’t estimate it, because you don’t know what you don’t know. But you can put limits on it – try to come up with the best solution in the time available.

Typing is predictable. You can estimate it – and you have to. The trick is including all of the things that need to be typed in, and accounting for all of the little mistakes and variances along the way – because they will add up quickly. Sloppiness and short cuts, misunderstanding the requirements, skipping testing, copy-and-paste, the kinds of things which add to costs now and in the future.

Typing is journeyman work. You don’t need experts, just people who are competent, who understand the fundamentals of the language and tools and who will be careful and follow directions and who can patiently pound out all the code that’s needed – although a few senior developers can out-perform a much larger team, at least until they get bored. Managing a bunch of typists requires a different approach and different skills: you need to be a politician and diplomat, a logistician, a standards setter, an administrator and an economist. You’re managing project risks and people risks, not technical risks.

Over time, projects change from thinking to typing – once most of the hard “I am not sure what we need to do or how we’re doing to do it” problems are solved, once the unknowns are mostly known, now it’s about filling in the details and getting things running.

The amount of typing that you need to do expands as you get more customers and have to deal with more interfaces to more places and more customizations and more administrivia and support and compliance issues. The system keeps growing, but most of the problems are familiar and solvable. There’s lots of other code to look at and learn from and copy from. You need people who can pick up what’s going on and who can type fast.

Thinking and Typing

Thinking and typing are both important parts of software development.

In “Programming is Not Just Typing”, Brendan Enrick explains that the reason that Pair Programming works is because it lets people focus on typing and thinking at the same time:

Both guys are thinking, but about different things. One developer has the keyboard at any given time and keeps in his head the path he is on. (This guy is concerned with typing speed.) He types along the path thinking about the code he is currently writing not the structure of the app, but the code he is typing right now. For a short time his typing speed matters.

The programmer in the pair who is not actively typing is spending all of his time thinking. He keeps in his head the path that the typist is taking, not concerned with the syntax of the programming language. That is the other guy thinking about the language syntax. The one sitting back without the keyboard is the guide. He must make sure that the pair stays on the right path using the most efficient route to success.

There’s more to being a good developer than typing - and there's more to typing than just being able to press some keys. It means being good at the fundamentals: knowing the language well enough, knowing your tools and how to use them, knowing how to navigate through code, knowing how to write code – as well as being fast at the keyboard. Mastering the mechanics, knowing the tools and being able to type fast, so that you're fluent and fluid, are all essential for succeeding as a developer. Don’t diminish the importance of typing. And don’t let typing – not being able to type – get in the way of thinking.

Thursday, October 10, 2013

Don't You Know that Support is the Most Important Part of a Developer’s Job?

Agile development – because you are building working software faster and delivering it incrementally – forces development teams to face a common, fundamental problem: how to balance the work of developing new software with the need to support a system that is already being used in production, whether it’s the legacy system that you’re replacing, or the system that you are still building – and sometimes both.

This is especially a problem for Agile teams following Scrum. On the one hand, in order for the team to meet Sprint goals and commitments and to establish a velocity for future planning, the team is not supposed to be interrupted while they are doing their work. On the other hand, the point of working iteratively and incrementally in Scrum is to deliver working software early and frequently to the customer, who will want to use this software as soon as they can, and who will then need support and help using the software – help and support that needs to come from the people who wrote the software.

At some point, often still early in developing a system, these teams have to stop working in a bubble with their pretend Customer, and start working in the real world with real customers who have real demands.

Supporting Customers and Still Building New Software

This means that teams have to find a way to juggle support and maintenance work with design and development, to deal with rapidly changing priorities and interruptions and complaints and questions and the stress of fire fighting when things break, while still trying to deliver good quality software and hit deadlines.

It’s not easy to balance two completely different kinds of work with directly opposed goals and incentives and metrics. As Don Schueler explains in the “The Fragile Balance between Agile Development and Customer Support”, development teams – even Agile teams working closely with their Customer – are mostly inward-looking, internally focused on delivery and velocity and cost and code quality and technical concerns. Support teams are outward-looking, focused on customer relationships and customer experience and completeness and minimizing operational risk.

Development is about being predictable and efficient: deliver to schedule and keep development costs down. Support is about being responsive and effective: listen to the customer, answer questions, fit in unplanned work, figure out problems and fix things right away. Development work is about flow, continuity, predictability, velocity, and, if managed correctly, is mostly under control of the team. Support and maintenance work is interrupt-driven, immediate, inconsistent and unpredictable – a completely different way of working and thinking. Development work requires the team to be drawn together so that they can collaborate on common goals and the design. Most maintenance and support work is disjointed and disconnected, smaller tasks that can be done by people working independently. Development, even in high pressure projects, is measured in weeks or months. Support and maintenance work needs to be done in days or hours or sometimes minutes.

Agile Support Models: Maintenance Victims

One way that teams try to handle support and maintenance is by sacrificing someone from the team: offering up a “maintenance victim” who takes on the support burden for the rest of the team temporarily, letting the others focus on design and development work. This includes taking calls from Ops or directly from customers, looking at logs, solving problems, fixing bugs. This could mean staying after hours to help troubleshoot or repairing a production problem or putting out a fix, and being on call after hours and on weekends.

The rest of the team tries to pretend that this victim doesn’t exist. If the victim isn’t busy working on support issues or fixing bugs found in production, they might work on fixing other bugs or maybe some other low-priority development work, but they are subtracted from the team’s velocity – nobody depends on them to deliver anything important.

Teams generally rotate someone through support and triage responsibilities for one or two Sprints. This way everyone at some point “shares the pain” and gets some familiarity with support problems and operational issues. There are also positive sides to being sacrificed to support. Developers get a chance to learn more about the system and stretch some of their technical skills, and get off of the hamster wheel of Sprint-after-Sprint delivery for a bit. And they get a chance to play the hero, step in and fix something important and make the customer happy.

Kent Beck and Martin Fowler in Planning Extreme Programming extend this idea to larger organizations by creating a small production support team: 2-4 developers who volunteer to focus on fixing bugs and dealing with production problems. Developers spend a couple of Sprints in production support, then rotate back to development work. Beck and Fowler recommend staggering rotations, making sure that at least one developer is in the first rotation and another in the second so that at least one member of the support team always knows about what is going on and what problems are being worked on.

Sacrificing a maintenance victim or a team makes it possible for most of the rest of the team to move forward on development, while still meeting support commitments. This approach assumes that anyone on the team is capable of figuring out and fixing any problem in the system – that everyone is a cross-functional generalist. And this means that whoever is on this support rotation has to be good enough and experienced enough that they can deal with most issues without bringing in the rest of the team - you can’t rotate newbies through support and maintenance work, at least not without someone senior backing them up.

And you also have to be prepared for problems that are too big or too urgent for your maintenance victim to take care of on their own. Even with a dedicated team you may still need to build in some kind of slack or buffer to deal with emergencies and general helping out, so that you don’t keep blowing up Sprints. You can come up with a reasonable allowance based on “yesterday’s weather”, on how much support work the team has had to do over the last few weeks or months. If you can't make this work, if the entire team is spending too much time on support and fire fighting and pushing hot fixes, then you are doing something wrong and you have to get things under control before you build more software any ways.

Kanban instead of – or inside of – Scrum

Rather than trying to shoe horn maintenance and support into time boxes, some teams have found that Kanban is much better structured than Scrum or XP is to balance support, maintenance, and operations with new development work.

Kanban’s queuing model and use of task boards makes it easy to see what work needs to be done, what work is being done, who is doing it, what’s getting in the way, and when anything changes.

Kanban makes it easier to track and manage different kinds of work that requires different kinds of skills and that don’t always fit nicely into a 1-week or 2-week time-box..

Kanban doesn’t pretend that you won’t be or can’t be interrupted – instead it helps you to manage interruptions and minimize their impact on the team. First, in Kanban you set limits on how much of different kinds of work the team can deal with at a time. This lets the team get control over work coming in, and stay focused on getting things done. Kanban’s queue-and-task model allows emergencies to pre-empt whatever work is in progress through escalation/priority lanes. And priorities can keep changing right up until the last minute – team members just pull the highest priority work item from the ready queue when they are free to take on more work, whether this is designing and developing a new feature, or fixing a bug, or dealing with a support issue.

Kanban helps teams focus more on immediate, tactical issues. It’s a better model to follow when you have more maintenance and support work than new design and development, or when you have to assert control over a major problem or manage something with a lot of moving pieces like the launch of a new system.

Devops Changes Everything

Devops, as followed by organizations like Etsy and Facebook and Netflix (where they go so far as to call it NoOps) tries to completely break down the boundaries between development, maintenance, support and operations. Devops engages developers directly and closely into the support, maintenance and operations of the systems that they build. Developers who work in these organizations are not just writing code – they are part of a team running an online service-based business, which means that support work is as important, and sometimes more important, than designing and writing more software.

In these organizations, developers are held personally responsible for their software, for getting it into production and making sure that it works. They are on call for problems with software that they worked on. They are actively involved in operations of the system, providing insight into how the system works and how it is running, in testing and configuring it and tuning it and troubleshooting problems.

Devops changes what developers work on and how they do it. They move away from project work and more towards fast feature development, fixing, tuning and hardening. Availability and reliability and performance and security and other operational factors become as important – or more important – than delivery schedules and velocity. Developers spend more time thinking about how to make the system work, how to simplify deployment and setup and about the information that people need to understand what’s going on inside the system, what metrics and tools might be useful, how to handle bad data and infrastructure failures, what could go wrong when they make a change and who they need to check with and what they need to test for.

Maintenance and Support – Responsibility and Feedback

Whether developers need to – or even should – take first line support calls from users, they at least need to be part of second level and third level support, where problems are investigated and solved.

This is not just because they are usually the only people who can actually figure out and fix many problems.

Putting aside moral hazard arguments about whether it’s ethically acceptable for developers not to take full responsibility for the consequences for their decisions and the quality of their work, there are compelling advantages to developers being directly involved in supporting and maintaining the software that they work on.

The most important is the quality of the feedback that developers get from supporting a real system – feedback that is too valuable for them to ignore.

Real feedback on what you did right in building the system, and what you got wrong. Feedback on what you thought the customer needed vs. what they really need. What features customers really find useful (and what they don`t). Where the design is weak. Where most of your problems are coming from (the 20% of the code where 80% of the bugs are hiding), where the weaknesses are in your testing and reviews, where you need to focus and where you need to improve. Valuable information into what you’re building and how you’re building it and how you plan and prioritize, and how you can get better.

When developers are called into fire fighting production incidents and Root Cause Analysis reviews they can learn enormous amounts about what it takes to build software for the real world. Thinking seriously about how problems happened and how to prevent them can change how you plan, design, build, test and deploy software; and how people work together as a team.

Farming all of this off to someone else, filtering it through a help desk or an offshore maintenance team, breaks these valuable feedback loops, with negative effects for everyone involved.

Peter Gillard-Moss explains how this happens:

In a startup, developers take care of problems themselves, well, because there isn`t anybody else to do it. But at some point things change:

“…managers decided that we were spending far too long investigating users’ problems and not long enough building the new features the business wanted. Developers needed to be more productive, and more productive meant developers developing more new features. To get developers to develop they need to be ‘in the zone’. They need headphones and big screens to glue their eyes to. They did not need petty interruptions like stupid users ringing up because they got a pop up saying their details will be resent when they tried to refresh.”

But by doing this, the development team became disconnected from the results of their work, and from their customers…

“A systems thinker would tell you this is wrong. You’ve gone from a system that connected a user to the team responsible with one degree of separation, to one that has three degrees of separation. Or think of it another way: the team producing the product, and responsible for improvements and fixes used to be one degree away from their end users, who use the product and are feeding back the product’s shortcomings and issues, but are now three degrees. And not even three degrees all of the time. The majority of the time the team won’t ever hear about most of the support issues. And most of the time the team won’t even have that much interaction with the team that does hear about most of the support issues.”

The result: Customers don’t get the support that they need. Developers don’t get the information that they need to understand how to make the system work better. A support team stuck in the middle with people just trying to keep things from getting worse and hoping to find a better job someday. It’s a self-reinforcing, negative spiral.

In our shop, support takes priority over development – always. Our senior developers work with operations to support the system, and are on call when we put new software in and on call if something goes wrong after hours. They can bring in anyone else from any team that they need for help. As a result, we have very few serious problems and these problems get fixed fast and fixed right. The experience that everyone gets from working in support helps them to design and write better, safer code. This has made the system more resilient and easier and less expensive to support and safer to setup and run and easier and safer to change. And it has made our organization better too. It’s brought developers and operations closer together, and closer to what’s important to the business.

Whether you call it “Agile” or not, there’s nothing more agile than a team that is working directly with customers, responding immediately to problems and changing requirements in a live system. While some developers and managers think of this as overhead, sustaining engineering and try to push it off to somebody else so that they can focus on “more strategic" work, others recognize that this is really the leading edge of software development, and the only way to run a successful software organization, and the only way to make software, and developers, better.

Thursday, October 3, 2013

Don’t let Somebody Else’s Technical Debt take you Under

There’s a lot written about technical debt: what technical debt is and the different kinds of technical debt, how to avoid taking on debt unnecessarily when designing and coding and changing code, how much technical debt is costing your organization, and why and how and how much and when to pay these debts off.

But all of this ignores massive amounts of technical debt and risk in your systems that have nothing to do with your developers taking short cuts in design or sloppy coding practices or not writing unit tests. What it doesn’t make clear is that there may be much more technical debt in all the code that you didn’t write, than in the code that you did.

In the same way that financial debt problems in Spain, Greece and Portugal are dragging down the global economy, debt problems in other people’s software can drag you down into the pit even if you've been doing a responsible job of managing debt in your own code.

Problems in software from Oracle or IBM or Microsoft, or all that nice free Open Source software that your developers keep downloading from the Internet, become your problems. If you are using other people’s software (and we’re all doing this), you have to pay for those other people’s mistakes and oversights, the decisions that they made to put time-to-delivery ahead of stability or security. You are at the mercy of their development and testing and security programs – or lack thereof. And you’re at the mercy of the third party and Open Source code that they used, and the people who wrote that software too.

Their bugs become your bugs. Their security holes become your security holes. Their bad decisions become your bad decisions.

The amount of code – and the amount of debt – involved can be huge, much bigger than the code that you’ve written yourself. According to a study by Sonatype, 80% of a typical java app is assembled from open source components and frameworks, and a big system can contain as many as 30 or more different libraries or other components.

But your exposure to third party debt is bigger than frameworks and libraries and controls. There is also all of the other software that your system depends on: the operating system, virtual machines and the rest of the run-time stack, your containers and caching technology and data stores, and all of the tools that you need to deploy and run the system.

In 2010, Gartner guesstimated that the total amount of global “IT Debt” – the cost of bringing all of this software up to date and fully supported in every organization – was around $500 billion, and that this could reach $1 trillion by 2015.

We keep taking more of this debt on all of the time. Aspect Security studied more than 113 million Open Source Software downloads, and found that more than 1/3 of open source libraries downloaded had known vulnerabilities.

Our exposure to problems in other people’s software is a serious enough – and common enough – problem that OWASP is now explicitly calling out organizations that use out of date software components as one of the Top 10 software security risks:

Virtually every application has these issues because most development teams don’t focus on ensuring their components/libraries are up to date. In many cases, the developers don’t even know all the components they are using, never mind their versions. Component dependencies make things even worse…

How much of other people’s debt are you carrying?

This kind of technical debt is much harder problem to understand and manage than debt in your own code. Because it is code that you don’t understand, code that you can’t look at to understand how bad it is, or code that you could look at but don’t. It’s often code that you don’t have control over, especially common platform technology shared across multiple systems, technology that everybody uses but nobody is responsible for. And it can be code that you don’t even know that you have – code that has been downloaded by somebody and has become part of your system without anyone else knowing about it.

To get some idea how big this debt problem could be, you need to audit your code for third party and Open Source packages and dependencies. Big companies can (and probably need to) use something like Black Duck or Palamida or OpenLogic to scan and build an inventory of Open Source and other libraries and components, to check on licensing issues and to keep up to date on reported vulnerabilities in this software.

Smaller companies can do this by hand, or try WhiteSource, a SaaS scanning platform that is free for startups and small companies, and that offers a free source code scanner for license discovery called JNinka; or check out the OWASP Dependency Track Project (an Open Source project to track third party software that is still in an early, pre-release state).

Keeping Up to Keep Debt Down

Once you understand how of other people’s software you have to worry about, it’s your responsibility to assess the risks and problems you have inherited, and to keep up with vulnerabilities and bug reports and with vendor patches and upgrades. This is harder than it sounds.

As OWASP explains:

In theory, it ought to be easy to figure out if you are currently using any vulnerable components or libraries. Unfortunately, vulnerability reports do not always specify exactly which versions of a component are vulnerable in a standard, searchable way. Further, not all libraries use an understandable version numbering system. Worst of all, not all vulnerabilities are reported to a central clearinghouse that is easy to search, although sites like CVE and NVD are becoming easier to search.

Determining if you are vulnerable requires searching these databases, as well as keeping abreast of project mailing lists and announcements for anything that might be a vulnerability. If one of your components does have a vulnerability, you should carefully evaluate whether you are actually vulnerable by checking to see if your code uses the part of the component with the vulnerability and whether the flaw could result in an impact you care about.

There are also some general-purpose services that can help. Consolidated security vulnerability feeds like SANS @Risk or the SecurityFocus Vulnerabilities feed provide frequent notification of security vulnerabilities in common software packages. And Industry-specific information sharing and analysis centers like FS-ISAC and IT-ISAC provide members with notifications of problems with common third party and open source software.

Patching and Patching and Upgrading

Knowing how big the problem is, is one thing. Fixing it is another.

Patching to the latest dot release is a hassle that we all try to put up with. It’s better to be safer than sorry with patching outside-facing technology even if the patches don’t always apply to how you use the software: Apache, Tomcat, web services libraries, client components. Back-end code can be – and usually has to be – managed more conservatively. If you really really have to install every Linux or Oracle RDBMS patch as soon as it comes out, there’s probably something wrong with your architecture.

Upgrades are a much bigger problem. There’s usually little upside, but a lot of costs and risks involved with upgrading to the latest major OS or DBMS release or VM or other critical software. This is the kind of work that you take on because you’re held hostage by your suppliers – because the only way that you are going to keep getting support is by moving to their latest and greatest release.

While you may gain some advantages in scalability or manageability or a feature or two, upgrade projects are often more about managing the potential downsides: functional regressions, compatibility problems, changes to operations procedures and dependencies on other systems. Most software gets bigger, not better – more features, even if they are features that you don’t need or don’t want, still means more work to install, configure and test when upgrading. You need to work through everything that has changed and re-do all of your testing and tuning, re-finding and checking old workarounds and finding new workarounds for new problems or improvements that you don’t want, updating documentation and scripts and procedures. And the testing that you have to do is the worst kind of testing: manual system testing and operational testing and stress testing, expensive while giving you low confidence that you will find all of the problems. Everybody needs to be dragged into this: development, QA, Ops, and other teams who share dependencies.

So after all of this work and cost you still need to be prepared to deal with problems in production. And you haven’t accomplished anything of clear business value. You’ve taken on short-term operational costs and risks in order to minimize other operational costs and risks in the future. It’s a low return on investment at best.

Open Source – Be a Lazy Customer, Join the Community, or Just Fork It

Almost everybody is using Open Source Software in development – it’s too compelling not to.

But most organizations do a bad job of managing this software:

While nearly 80 percent of companies rely on open-source components for their development efforts, more than three-quarters lack any meaningful controls over the usage of such libraries and frameworks, according to the annual Open Source Software Development Survey conducted by Sonatype, a manager of a large repository of open-source components.

Another vendor, WhiteSource Software, has found that 85% of all software projects contain out-of-date open source components.

It’s your responsibility to understand what risks you are exposed to and to minimize them when choosing and using Open Source Software. This means doing some detective work to ensure that a project is active and to see if it stays alive, periodically monitor check-ins and forum activity, watch out for the authors fading away or changing direction, for people forking off down separate paths.

Black Duck’s Ohloh database is one place to start: a public directory that has information on thousands of Open Source projects including the size of the code base, the size of the community, the latest commit, and licensing information. Or there’s Freecode (formerly Freshmeat) for information on Linux, Unix and cross-platform software and mobile applications, although these directories are not always comprehensive or up-to-date.

Another useful source of information is the database that Coverity maintains on bugs found by scanning Open Source projects for security and quality problems.

For every piece of Open Source Software that you decide to use

  1. you can be a user, take advantage of the free software and hope that the community will keep the project alive for you.
  2. you can take an active role, contribute fixes and improvements back to the community and see which changes are accepted. If they don’t accept your changes, then you will have to worry about reconciling them with any future updates that the community makes.
  3. if the software is important enough to you (and if the license permits it), you can choose to fork it and support it yourself. This is not a trivial commitment to make, even if you seem to have the necessary technical skills in-house. A lot of Open Source software is highly specialized and requires highly-specialized skills and knowledge – which is why it’s worth using in the first place. Even smart experienced developers will take time to figure it out well enough to extend it and fix it, and now you have to take this responsibility and these costs on over the long term.

You can bury your head in the sand or…

In the short-term, ignoring the hassles and costs and putting them off into the future sounds good. No payments until 2015. Of course, when the interest payments kick-in, you’ll wish you were somewhere – or somebody – else.

But consciously taking on debt and putting it off into the future can be a useful strategy – as long as you understand the trade-offs. You may be doing the best thing for your business today by consciously taking on debt and sticking with what you know and what works today, keeping everything stable and planning to upgrade later. As long as everyone understands that the longer you put upgrading off, the more that costs and risks could add up.

But this is a path that can lead you into a legacy trap if you put this work off for too long, leaving you with a system that nobody can understand and nobody can support properly.

The Tip of the Iceberg

Like the iceberg that sunk the Titanic, a lot of your technical risk may be hidden or ignored until it is too late. You need to understand how big the risks are and take responsible steps to manage them.

Large companies that have the resources and leverage to influence their vendors can be more proactive. For example, Veracode offers a third party scanning service for enterprises that want to do scanning of third party binary code for vulnerabilities and bugs.

The rest of us can at least check out the software that we are using and the people who wrote it, keep up with vulnerabilities and bugs, and recognize the risks of using somebody else’s software. Because we can’t pretend that the risks and costs aren’t out there somewhere, or that they aren't serious enough to take us down.

Site Meter