I've fuzzed the Hashicorp's Vault API. Here are my findings (1)

Setup

I’ll be using CATS to do API fuzzing. You can get a 1 minute intro to CATS here: Get Started in 1 minute tutorial. The fuzzing will have 2 parts:

  • part 1 (this article) is doing a blackbox fuzzing i.e. provide just access token without additional context
  • part 2 (future article) will do more context-driven fuzzing i.e. I’ll do some data setup before and do the fuzzing with additional context

Vault is freshly installed and I don’t have any setup or context for it.

I have CATS installed:

cats

I have Vault installed:

cats

I have the Vault OpenAPI file saved as api.json.

Let’s start Vault in dev mode:

vault server -dev

cats

Export the Root Token as an environment variable:

export token=hvs.9eagj2vkhh7VXm40oUux5Dxw

According to the Get Started in 1 minute tutorial, we are now ready to run CATS.

Run

Run CATS in blackbox mode:

cats --contract=api.json --server=http://localhost:8200/v1 -H "X-Vault-Token=$token" -b 

After 11 minutes I get the following results:

  • 26314 successful tests (success in blackbox mode means not 500)
  • 165 errors (i.e. 500)

cats

Findings

Read timeouts

Many requests done on /sys/pprof/profile result in read timeout. It’s not clear what happens, as there is nothing present in the logs. To reproduce this, I can take any of the failing tests for this path and do a replay: cats replay Test31142.

Socket hangups resulting in connection failures

Multiple requests are causing connection failures:

  • POST on /sys/leases/lookup with
    {
    "lease_id": null
    }
    
  • POST on /sys/leases/ with
    {
    "url_lease_id": "mY77mev5F2AZ7Mbd",
    "increment": 2,
    "lease_id": null
    }
    
  • POST on /sys/leases/revoke with
    {
    "url_lease_id": "uo6etjzzBKOsuN",
    "sync": true,
    "lease_id": null
    }
    

They all result in connection errors.

The logs show the following:

 [INFO]  http: panic serving 127.0.0.1:57259: interface conversion: interface {} is nil, not string
goroutine 54353 [running]:
net/http.(*conn).serve.func1()
        /opt/hostedtoolcache/go/1.19.3/x64/src/net/http/server.go:1850 +0xbf
panic({0x595efa0, 0xc0019c49f0})
...

Same connection failure, but with a different cause, by doing a DELETE on /identity/mfa/login-enforcement/2haK9t2O. This is the log:

[INFO]  http: panic serving 127.0.0.1:57352: runtime error: invalid memory address or nil pointer dereference
goroutine 54900 [running]:
net/http.(*conn).serve.func1()
        /opt/hostedtoolcache/go/1.19.3/x64/src/net/http/server.go:1850 +0xbf
panic({0x58dcca0, 0xa521930})
        /opt/hostedtoolcache/go/1.19.3/x64/src/runtime/panic.go:890 +0x262
...

Some confusing errors and behaviour

Example 1

The /identity/group/name/{name} endpoint has some strange and inconsistent behaviour with confusing errors:

  • I get 204 on a DELETE even though the name does not exist. I would expect a 404 here.
  • When I do a POST with a full payload, like:
    {
    "member_group_ids": [
      "yEBjQD3UJvZn5ySGXTT",
      "yEBjQD3UJvZn5ySGXTT"
    ],
    "metadata": {
      "key": "value",
      "anotherKey": "anotherValue"
    },
    "policies": [
      "OOOOOOOOOOO",
      "OOOOOOOOOOO"
    ],
    "id": "071",
    "type": "y6R7Vt9",
    "member_entity_ids": [
      "8kGuW9WE2ww4xI8X9bRn",
      "8kGuW9WE2ww4xI8X9bRn"
    ]
    }
    

I get a 400:

{
    "errors": [
      "group type cannot be changed"
    ]
  }

But again, the group doesn’t exist. I would expect a 404. And if the type cannot be changed, why I’m able to send it in the request?

And it gets weirder. When the type is removed by the RemoveFieldsFuzzer, I get a 500 error:

{
    "errors": [
      "1 error occurred:\n\t* invalid entity ID \"8kGuW9WE2ww4xI8X9bRn\"\n\n"
    ]
  }

I would expect something around 4XX. The group still doesn’t exist.

Example 2

If I do a POST on /sys/capabilities with:

{
  "path": [
    "DnsnmZ",
    "DnsnmZ"
  ],
  "paths": [
    "nxKLCAgp",
    "nxKLCAgp"
  ],
  "token": ""
}

I get a 500 with:

{
    "errors": [
      "1 error occurred:\n\t* no token found\n\n"
    ]
}

But if I send a value within the token field:

{
  "path": [
    "DnsnmZ",
    "DnsnmZ"
  ],
  "paths": [
    "nxKLCAgp",
    "nxKLCAgp"
  ],
  "token": "ZUXO3Mk"
}

I get a 400:

{
    "errors": [
      "1 error occurred:\n\t* invalid token\n\n"
    ]
}

Again, inconsistent. I would still go with a consistent 4XX response for all cases.

Unexpected HTTP response codes

Multiple endpoints will return 500 instead of 4XX (which I would consider more suitable for those cases). 500 should usually be reserved for unexpected behaviour during server side processing, rather than predictable business errors.

For example, a POST on /sys/config/cors with

{
  "allowed_headers": [
    "AaiFkW79i80SuaXTFT0",
    "AaiFkW79i80SuaXTFT0"
  ],
  "enable": true
}

Will return a 500:

{
    "errors": [
      "1 error occurred:\n\t* at least one origin or the wildcard must be provided\n\n"
    ]
}

This is clearly a validation issue, which the API user can correct.

A GET on /internal/counters/activity/export will return a 500:

{
    "errors": [
      "1 error occurred:\n\t* no data to export in provided time range\n\n"
    ]
  }

Bypassing authentication for a GET on /sys/internal/ui/namespaces will result in a 500 rather than a 403:

{
    "errors": [
      "client token empty"
    ]
}

For all these cases I would go for 4XX for consistency and also better monitoring of the service. If the API returns 500 for validation issues, it will be hard do differentiate between cases when something goes really wrong and 500 is a signal of a real processing issue.

A GET on /sys/policies/password/FZhyI/..%20;/ will result in a 500:

{
    "errors": [
      "1 error occurred:\n\t* failed to retrieve password policy\n\n"
    ]
}

This is confusing. Is the 500 because the policy was not found? Was there an issue while retrieving and processing it? Hard to say.

Rest of the errors are caused by namespace not being found. Some examples: cats replay Test37474, cats replay Test442. I would, again, expect a 404 as the namespace was not found.

Final thoughts

This is the end of part 1. Part 2 will continue with fuzzing with context i.e. creating a bunch of data (namespace, token, group etc) and use that as a context for fuzzing.

Most important issues are raised on GitHub:

Negative API Testing on Steroids

It’s hard to find good titles for articles. You want to sound interesting enough, but avoid clickbait or sensational, keep it short but also comprehensive. The on Steroids on this one is more like How to write negative tests for APIs: fast, ideally with no development effort and let you focus on the exploratory part, you know, the one that actually challenges your brain.

As stated in a previous article, my view is that API testing is predictable for more than 50% of the test cases. Independent of the business logic, you want to do the same negative scenarios: boundary values, invalid values, very large values, different types of injections and so on. Instead of starting all over again with the 124th microservice, why not automate this boring and predictable part and focus on the things that are specific to the business context and challenge your brain with creative work.

In the next minutes I’ll show you how easy is to use CATS to do negative testing. I’ll use the Get Started in 1 minute tutorial and use Vault as the API under test. Vault is freshly installed and I don’t have any setup or context for it.

Following the tutorial:

  • I have CATS installed:

cats

  • I have Vault installed:

cats

We are now ready to run CATS.

Let’s start Vault in dev mode:

vault server -dev

cats

Export the Root Token as an environment variable:

export token=hvs.9eagj2vkhh7VXm40oUux5Dxw

Run CATS in blackbox mode:

cats --contract=api.json --server=http://localhost:8200/v1 -H "X-Vault-Token=$token" -b 

cats

At the end we get: 26 529 tests in almost 11 minutes, out of which:

  • 26 364 are success (blackbox mode)
  • 165 are potential problems (i.e. response from the server was 500)

Let’s now open cats-report/index.html to better understand the errors.

As the report has 26k tests, it will take a few seconds to load. You can run CATS with --skipReportingForIgnored argument to only report errors.

Some findings from the report:

  • /identity/mfa/login-enforcement/{name} - lots of errors, CATS receives connection timeout exceptions
  • /sys/pprof/profile - lots of errors, CATS receives read timeout exceptions
  • many other errors related to invalid configuration or input data that return 500 instead of 4XX

Next step is to log these issues under the Vault project. I’ll update the article with references once done.

Running CATS in blackbox mode is the simplest and fastest way to get a sense on the level of negative testing coverage and how your APIs is handling unexpected input. Next article will present how context mode helps you uncover deeper issues.

We need better tools for Web and API Software Testing

We need better and smarter tools for Web & API Software Testing. Especially when it comes to the functional testing side. This area is mostly dominated by frameworks rather than products, particularly the open source space. What I mean by this is that you usually get the means to assemble a set of steps, following a given syntax and execute the result. And you get some value added services on top like: save your request/response, pretty formatting, UI interfaces, etc. But you rarely get any intelligence on top of it.

You would expect that after so many years of testing login screens, you will point your tool to the login page and let it do its magic. And when I say magic I’m not only referring to recognizing the fields, put data in and then submit. It’s also generating the expected test cases which are typical for such a functionality: invalid emails, short passwords, forms of injection, etc. and run those automatically. But no, you must tell the framework everytime to open the page, find elements, enter data inside the input fields, submit it, wait and so on. And create all the test cases again to support it. And start from the beginning next time. 90% of the activities and test cases will most likely be the same.

Same goes for API testing. Independent of the logic, business domain or type of API, you have a solid common ground of testing scenarios which will need to be run no matter what: send invalid values, right-left boundary testing, large values, special characters, injection payloads and so on. So you start creating all these test cases and use your favorite tooling/framework to model it. When the API changes, you need to change all these test cases also. Some changes might be trivial, but we know that’s rarely the case. What you get in the end is all the activities which are the most time-consuming will need to happen manually, while the tooling will do the most simple (oversimplifying a bit) aspects of it: submit the request and do some reporting. In the end everything is just another friendlier version of curl with some pretty UI, storage & management capabilities and some nice reporting.

(This is quite similar to ToDo & time management apps. You just get fancier way of doing the same stuff: keep a list of items and organize it and setting some deadlines/reminders. But usually the tool is not the problem, so using X instead of Y won’t make me a time management guru.)

So what’s the approach a software engineer has when dealing with repetitive work? It automates it! And this is how CATS started. From the frustration of testing the same stuff over and over again (especially in a microservices platform). The ultimate goal was to have something a bit more intelligent which can do its magic with minimum configuration. When people see intelligent they kind of associate it with some sort of AI and/or machine learning. But I would argue that just having some convention of configuration standards in place can give you enough intelligence.

I had 3 goals in mind when building CATS:

  • minimum configuration
  • no coding effort, but still meaningful and comprehensive
  • entire testing process must be automated (create, run and report test cases) - the intelligence part

So that in the end, instead of having QAs doing all that repetitive work for each new service or endpoint, let them focus on creative and exploratory testing scenarios anchored in the context they are operating.

You can find out more about CATS here: https://ludovicianul.github.io/2020/09/09/cats/. But as a summary, by only providing: auth data and maybe some context you can point it to your API endpoint and let it do its magic. As simple as that.

As with all the good things there are some caveats:

  • it only works with OpenAPI endpoints
  • it only works for JSON formats

There is also a limited set of capabilities offered through template payloads.

If you feel the same about repetitive work in testing APIs, take CATS for a spin and feel free to contribute.

Make sure you know which Unicode version is supported by your programming language version

While enhancing CATS I recently added a feature to send requests that include single and multi code point emojis. This is a single code point emoji: 🥶, which can be represented in Java as the \uD83E\uDD76 string. The test case is simple: inject emojis within strings and expect that the REST endpoint will sanitize the input and remove them entirely (I appreciate this might not be a valid case for all APIs, this is why the behaviour is configurable in CATS, but not the focus of this article).

I usually recommend that any REST endpoint should sanitize input before validating it and remove special characters. A typical regex for this would be [\p{C}\p{Z}\p{So}]+ (although you should enhance it to allow spaces between words), which means:

  • p{C} - match Unicode invisible Control Chars (\u000D - carriage return for example)
  • p{Z} - match Unicode whitespace and invisible separators (\u2028 - line separator for example)
  • p{So} - matches various symbols that are not math symbols, currency signs, or combining characters; this also includes emojis

I have a test service I use for testing new CATS fuzzers. The idea was to simply use the String’s replaceAll() method to remove all these characters from the String.

So let’s take the following simple code which aims to sanitize a given input:

    public static void main(String... args) {
        String input = "this is a great \uD83E\uDD76 article";
        String output = input.replaceAll("[\\p{C}\\p{So}]+", "");

        System.out.println("input = " + input);
        System.out.println("output = " + output);
    }

While running this with Java 11, I get the following output:

input = this is a great 🥶 article
output = this is a great  article

Which works as expected. The 🥶 emoji was removed from the String as expected.

Even though I have CATS compiled to Java 8, I mainly use JDK11+ for development. At some point I had CATS running in a CD pipeline with JRE8. The emoji test cases generated by the CATS Fuzzers, started to fail, even though they were successfully passing on my local box (and on other CD pipelines). I went through the log files, the request payloads were initially constructed and displayed ok, with the emoji properly printed, but while running some pattern matching on the string the result was printed as sometext?andanother. The ? is where the emoji was supposed to be. Further investigation led to the conclusion that what caused the mishandling of the emoji was the JRE version (which might be obvious for the 99.999% of Java devs out there). Which is actually expected as Java 8 is compatible with Unicode 6.2, while 🥶 is part of Unicode 11.

Going back to the previous example, if I run it with Java 8, I get the following output:

input = this is a great 🥶 article
output = this is a great ? article

Conclusions:

  • Even though a Java version can receive, write/store and forward the latest Unicode characters, any attempt to manipulate them might result in weird ? symbols if the Unicode char is not from the version supported by your JRE version
  • Independent on how you compile the code, it’s the JRE that decides how the Unicode chars are handled i.e. a Java program compiled as Java 8 will have different behaviour in JRE 8 vs JRE 14

An incomplete list of practices to improve security of your (micro)services

Software security is hard and complex. Many people think about it as something aside from the typical development process. It’s usually seen as a responsibility of some security people that only care about security and don't understand that we need to deliver business value fast in an already complicated microservices-event-driven-api-frist-ha-cloud ecosystem. I could add a lot more dashes to microservices-event-driven-api-frist-ha-cloud. And I think this is the main reason it might seem overwhelming to think early about security and all the possible cases something or someone can break your system. It’s yet another complex thing in an already complex environment. It’s not just all the technical complexities of modern architectures, it’s also all the additional stuff: need to go to market fast, hard deadlines, team chemistry issues, underperformers, too many processes, meetings, etc. And it’s a complex thing that will not break your system in day 1. It might take months/years until someone will find a vulnerability. Why focus on this from day 1? Well, you might be right. The chances of something happening from day 1 are low. And it’s very tempting to focus on something with immediate value (actual functional features), rather than mitigating some future possibilities. The thing is, when a security issue happens, it can bring your entire system down. And this will be very bad for you and your users.

I see it similar to airport security. We do all these checks, we scan people, we forbid them to take things onboard and so on, although 99.99..9% of people don’t plan to hijack the plane. It’s for the 0.00..1% of the cases that we have all these measures in place. Because the consequences are big.

So how do you balance between not over-engineering security and be paranoiac about everything while still focusing on the business value? You make it a mindset, rather than a separate concern. I’m not saying that everyone needs to become a security expert and know everything. I’m saying that people should develop secure software just like they develop software. They do it in a way that will minimize the probability of introducing vulnerabilities.

The best way to instill this mindset is through a set of standards and practices that will create habits. Going back to airport security, you don’t let all the decisions on each individual security person. “This person looks nice. Let them have the scissors, and a knife in their hand luggage”. “You sir look very dehydrated, you can take your big bottle of liquid with you in the plane!” You create a set of rules, procedures (i.e., standards) that will apply equally to everyone. And you also create a set of guidelines (i.e., practices) on how to handle specific situations: if you see something suspicious in a hand luggage, you inspect it separately.

In the next sections I’ll detail standards and practices that cover the entire SDLC. They are not meant to be self-sufficient for all sections (i.e., you might add a lot more to cover that section from a general good practices perspective). But they will make you questioning things and think about cases that are not maybe that obvious.

Where is Security focused

I’ll do a simple split of Security concerns into two main areas:

  • infrastructure security: anything related to how the application is being deployed and operating in production
  • application security: anything related to how the application is being implemented, with the specifics of the business context

There are plenty of resources on how to tackle both:

They are lengthy, comprehensive and include a lot of details and practices on how to tackle security in SDLC. It will be great if every developer will go through all these periodically in order to keep their information fresh. But in practice, this doesn’t happen quite often. I’ll try to summarize below which are the most important things to consider, agnostic of the business domain. It’s not a full list, nor a silver bullet. But it will establish a solid foundation which will minimize the possibility for security issues to happen.

Tackling Security

Infrastructure Security it’s more predictable to address, mainly due to the use of products or cloud services. They already have the security features build-it and implemented well i.e., if you use a Web Application Firewall, you trust the product to do its job, you won’t actually implement its logic. I’m not saying it’s easier, but you have more control.

Application Security it’s less predictable. You mainly rely on people skills to implement stuff securely. You need to make sure they don’t do stupid things like storing clear text passwords in source files.

Below is a list of the most important practices which I think will help you build a security mindset. It’s intended for the regular developer. When I say regular, I just mean people actually implementing, rather than all the others focused on designing, planning or managing. They are all focused on Application Security for building (REST) APIs. At a first glance they might not seem all directly related to security. But in the end, they will minimize the probability to introduce security issues.

Majority of examples will use Java.

Standards

As mentioned above, the usage of standards is the main mechanism to build a mindset. All projects should have a set fo standards. Not everyone is a fan of standards and feel they limit people’s choices and creativity. But I think it’s an easy way to get consistency, especially when having many teams working on the same platform. It allows both easier onboarding for new joiners and limit the possibility of introducing bugs or inconsistencies or argue for stupid things (spaces vs tabs ;)). It gives you more time for meaningful discussions and debates. Standards do not have to be very detailed, at least not in all areas. The majority of the standards should state principles and choices you’ve made based on existing sets of good practices.

Documentation

Key things to consider:

  • document your code interfaces and API contracts
  • define your documentation strategy:
    • what is your overall documentation strategy?
    • what do you put in the README.md file of the project?
    • do you need to update a wider documentation?
    • what tool do you use for diagrams?
    • do you use lightweight architecture decision records?
    • do you store the documentation along with the project in Git? or maybe use a separate tool?
    • if you store it within the project, what is the recommended folder structure?

General (micro)services design guidelines

Key things to consider:

  • use a blueprint/template/archetype as a starting point for all your (micro)services
  • have the blueprint already bundled with all the common libraries, plugins, etc. and aligned to the standards
  • each (micro)service must start with one command
  • (micro)services will process data only through APIs/events; there is no back-door
  • (micro)services are self-contained
  • all (micro)services are 12 factor apps and even more

Code formatting/styling

Just choose one and apply it consistently. Auto-format before commit if possible.

Naming conventions

Just choose one and apply it consistently.

API standards

Key things to consider:

  • follow REST naming practices (nouns, plurals, the usual stuff) - pick one, the internet is full of guidelines, but be consistent
  • be consistent with the naming; this applies for everything, not only the endpoints: payload object naming, properties etc. camelCase, snake_case, kebab-case/hyphen-case etc. Again, just choose something, but be consistent
  • make POST, PUT, PATCH return bodies with meaningful responses
  • use meaningful HTTP status codes, rather than 400 for everything that goes wrong
  • all endpoints must return meaningful error cases
  • use an error catalogue (more details in the Error Handling section)
  • consider something like OpenAPI and consider also doing contract-first development i.e., write the OpenAPI contract initially, socialize it with your (internal) consumers; this also enables better parallel development
  • document your OpenAPI contracts with meaningful descriptions and examples
  • all (internal) APIs must use CorrelationId/TraceId headers
  • all API inputs must be very restrictive by default
  • all APIs (internal or external) must be authenticated and ideally also with authorization in place
  • all APIs must re-use the same common data structures; either generic ones like Address, Person, Country, etc, but also define business specific ones
  • all APIs (internal or external) are exposed over HTTPS only
  • for the relevant APIs consider returning security headers within the response like: Cache-Control: no-store, Content-Security-Policy: frame-ancestors 'none', Content-Type, Strict-Transport-Security, X-Content-Type-Options: nosniff, X-Frame-Options: DENY
  • internal APIs do not communicate to each others via the internet (unless this is something deliberate or required by the architecture)
  • do not expose management endpoints over the internet; if this is something required, use authentication
  • make sure all APIs are enforcing strict validation for the received requests: do not allow undocumented JSON fields, reject malformed JSONs, etc
  • make proper use of data types; don’t have everything as a String
  • use enumerated values whenever possible
  • add length restrictions for strings and min/max for numbers
  • add patterns restricting input for each string
  • for some properties it’s easier to find patterns as they have clear definitions; a country code will always follow the [A-Z]+; for others, it’s a bit more difficult; a lastName property needs to be quite loose, considering all names in all languages; the recommendation is at least to prevent strange characters like the Unicode control chars, separators or symbol; a recommended pattern object is the following: ^[^\p{C}\p{Z}\p{So}]*[^\p{C}\p{so}]+[^\p{C}\p{Z}\p{So}]*$; this doesn’t mean that you are now protected from any type of injection; you still need to have a good understanding where the data goes and how it is processed, but at least you won’t get an emoji breaking your system

Logging standards

Key things to consider:

  • logging format: comma separated key=value pairs? json objects? choose something which is friendly to your tooling
  • always include the CorrelationId/TraceId in each log line; this will make it easier for tools to create dashboards
  • include information in logs that will make it easier to understand what’s happening: for which entity? business area? is it success? failure?
  • some good practices
  • use an abstraction over the actual logging implementation; for example in Java: slf4j with logback as implementation
  • treat logging as a cross-cutting-concern; leverage Aspects; log within methods only exceptionally; this will limit people from logging sensitive stuff
  • don’t treat logging like let's log everything and see if we needed it afterwards and dump full requests/responses; be deliberate in what you log, even when logging with debug or lower levels
  • more on Logging Data

Data standards

Key things to consider:

  • use existing ISO standards for widely known objects: Currencies, Dates, Amounts just to name a few
  • define business specific objects to be re-used
  • apply these standards for API objects, database entities and events

Processing Data

Key things toc consider:

  • sanitize data before processing it; this is a good sanitization regex ^[^\p{C}\p{Z}\p{So}]*[^\p{C}\p{so}]+[^\p{C}\p{Z}\p{So}]*$; it won’t prevent all problems, but it will strip weird chars that can cause your system to crash
  • make sure that you don’t transmit data from input towards internal elevated access operations like database queries, command line execution etc.; use parametrized queries for DB, be very specific around what you get and what you pass forward
  • favor whitelisting instead of blacklisting when you need to make decisions or when plan to restrict processing for specific input
  • overall favor defensive programming practices
  • make sure you use efficient XML parsers that are not vulnerable to XXE or similar attacks; ideally do not accept XML as input unless forced by the context

Logging Data

Key things to consider:

  • don’t log sensitive data; if you still need it for some reason, mask/obfuscate the data; what sensitive means depends on your business and regulations
  • create/use a library that masks by default the most sensitive data within your platform; for example if you’re processing payments, card numbers must be masked by default; you shouldn’t leave this decision to each individual
  • consider extending the library each time new sensitive data is added; you must also balance performance when adding too much data
  • the logging library must also allow specific configuration so that each individual service can mask additional data without extending the library
  • the logging library must provide on-demand sanitization (i.e., by calling specific methods); this will make sure the same sanitization techniques are applied for all cases
  • the logging library must sanitize data before logging it (for example by removing all the characters matching \p{Z}\p{C}\p{So})
  • the logging library must also remove CR and LF characters in order to prevent CRLF injection
  • have a clear log archiving strategy

Storing Data

Key things to consider:

  • data must not be store in case you need it; you must only store data that is relevant in current context or foreseeable future
  • storing data introduces compliance obligations; make sure you are aware of those
  • some data cannot be stored in clear (one example is credit card numbers); use hardware or software HSM for encryption
  • don’t store secrets (passwords, encryption keys, ssh keys, private keys) in version control on plain text files; use dedicated products or services for this like Vaults, HSMs
  • use salt and/or pepper when encrypting or hashing sensitive data; this will prevent brute-force attacks
  • consider building (or using) a centralized service that will tokenize sensitive data
  • you should tokenize any data that is under some sort of regulation: card data, PII data, etc.; use tokens instead of the actual data in all (micro)services and detokenize only when needed; this will minimize the compliance footprint and will also give better control around the data
  • enhance the security of the tokenization solution; do not allow external access to its APIs

Events/messaging standards

Key things to consider:

  • create an event catalogue so that everyone is aware of the purpose of each event
  • use event schemas for validation
  • avoid using generic events where you dump everything; you might leak sensitive information without wanting it
  • consider exchanging Tokens instead of the actual data for sensitive information

Configuration handling

Key things to consider:

  • avoid hardcoding configuration in source files
  • consider using centralized configuration management
  • segregate configuration by environment
  • do not store secrets (passwords, api keys, ssh keys, private keys, etc) in source files or in version control; use proper Secrets Vault systems
  • do not leave default credentials for any deployable unit (either cloud service, off the shelf products, or your own (micro)services)
  • do not put test-only code or configuration in production
  • don’t build test only backdoors inside your (micro)service
  • use version control to track configuration changes
  • have mechanisms in place for configuration integrity checking

Error handling

Key things to consider:

  • consider treating exception and errors as a cross-cutting concern; leverage Aspects, use something like ControllerAdvices or similar
  • consider embedding the logic for the most common exceptions/errors (validation issues, resource not found, malformed messages) into a shared library; this will make the interaction between (micro)services predictable and with less friction
  • use an error catalogue
  • use error codes (e.g. MICRO-4221 - bad request due to structural validation, MICRO-4222 - bad request due to business validation)
  • do not leak internal state in responses; avoid passing e.getMessage(); each error returned must be deliberately created from the root cause, but without leaking internal data
  • use a catch-all mechanism in order to avoid leaking internal state for unexpected exceptions; you can just catch Exception in the global error handler and return a 500
  • return the same object for all errors to enable a consistent experience
  • document all error cases in your API documentation with the appropriate HTTP Status code; if you use OpenAPI, document all possible HTTP status codes, even if they return the same OpenAPI object

Branching strategy and commits

Key things to consider:

  • use a simple branching strategy; trunk-based, github-flow, etc.; just pick one
  • use meaningful names for your repos and branches
  • use descriptive commits; it will make it easier to trace changes in the future
  • use small commits to better isolate changes
  • use smart commits i.e., provide a link to the task from the task management system
  • consider using pre-commit hooks to validate the commits
  • do not include sensitive information in commit messages
  • pay attention when enabling remote access to your repos; especially when repos are hosted in cloud

Code review

Key things to consider:

  • do code reviews (be kind, assertive, specific, all the good stuff)
  • let the boring stuff to the tools and focus on the functional aspects and alignment to standards and practices
  • if you find the same issue repeated over and over, add it within the standards
  • consider using checklists, at least initially until people make it a habit on focusing on the same stuff

Tooling and 3rd party libraries

Key things to consider:

  • have a process in place for introducing new tooling; do a trade-off analysis and present it in a wider group to get acceptance/agreement and make sure you address wider cases
  • when selecting open source software pay attention to the license(s)
  • create a list with licenses that can be used without asking, licenses that needs to be discussed and licenses which are not allowed to be used
  • don’t take the first (or latest) shiny tool/library/product you find; consider things like: is it stable?, is it maintained? does it have a track record?
  • consider using tools such as OWASP Dependency Check, License Plugin or even more complex tools such as Black Duck
  • create a list with the agreed tooling/libraries where people can choose from
  • update your dependencies frequently

Code Analysis

Key things to consider:

  • use one or multiple tools to analyze your code
  • you must have (at least) one tool focused on the general coding practices and (at least) one focused on security practices
  • some good tools for general code analysis (Java): Sonarqube, PMD, SpotBugs
  • some good tools for security code analysis: Veracode, Checkmarx, Sonarqube
  • you don’t need to agree with all the practices that are part of the standard rule sets of these tools (although usually they are aligned with industry recommendations); you can create a subset of rules tailored to your context

Testing

Key things to consider:

  • automate testing at all levels: unit, integration, component, API, end-to-end, etc.
  • focus on negative and boundary testing, not only on happy scenarios; CATS is a good option for API testing
  • don’t ignore failing tests, even those failing intermittently; they might hide a serious underlying issues
  • tests must be resilient and self-sufficient
  • tests must use a similar and predictable approach
  • tests must not depend on complicated external setup; they must either be self-sufficient by mocking dependencies, using in-memory setups or testcontainers or just depend on the (micro)service being deployed; any other steps will just complicate the setup and introduce complexity
  • consider adding some security testing inside the pipeline
  • consider mutation testing

CI/CD

Key things to consider:

  • include Quality Gates for the most important stuff; they must act as checkpoints and fail the build if they are not met
  • Quality Gates must be inline with these standards and automate the process of checking that each (micro)service is aligned
  • a sample CI/CD pipeline might look like this:
    • compile and build
    • check formatting
    • run tests and check coverage
    • run mutation testing
    • run code analysis
    • run secure code analysis
    • check 3rd party libraries for vulnerabilities
    • check 3rd party library licenses
    • deploy
    • run API tests
    • run other types of testing
  • this might seem too much (or lengthy), but for a microservice this is quite fast
  • script your pipeline
  • don’t couple the pipeline to the (micro)services
  • use a template pipeline for all (micro)services

Authentication and Authorisation

Key things to consider:

  • don’t roll your own authentication and authorisation; use standards products and services
  • authenticate all your APIs, internal and external; just pick something proven
  • use separate authentication and authorisation mechanism for external and internal calls i.e., use one set of credentials/mechanism to authenticate external calls and a separate one for internal calls
  • credentials are always encrypted both in-flight and at-rest
  • use HTTPS for all APIs, internal or external
  • do not accept authentication credentials via HTTP GET; use only HTTP headers or HTTP POST/PUT
  • do not log credentials not even when debug on; have your logging library also act as catch all for credentials
  • make sure your authorisation and authentication mechanism allows granular control and management i.e., you can restrict number of calls per operation, revoke access, issue additional credentials, etc.
  • consider using a centralized Identity Provider and common libraries
  • use enhanced security controls for highly sensitive APIs/services (mutual TLS for APIs, MFA for access to services)
  • use nonces to prevent replay attacks
  • always design and build with the least privilege principle in mind

General Security Practices

Key things to consider:

  • don’t ever roll your own encryption; you cannot reinvent the wheel in this space
  • use industry recommended algorithms: AES 256; RSA 2048+, SHA-2 512.
  • use TLS 1.3+ for transport security
  • use salt and/or pepper when encrypting or hashing sensitive data; this will prevent brute-force attacks
  • check your programming language practices for dealing with sensitive information; for example in Java you must use byte[] rather than String to handle password, card numbers, social security numbers, etc.; you must minimize the time the data stays in memory and clear the objcts after use

Quality attributes

As we’ve seen above, SDLC standards and practices are not always directly related to security. Same applies for quality attributes. Shortcomings in current design and approach can cause your application to go down, even if it is not caused by a true security problem.

Key things to consider for Performance:

  • use pooling for connection to expensive resources like DB, APIs, etc.
  • use thread pools
  • use caching
  • use proper collections when manipulating data
  • use parallel programming if applicable
  • make sure you understand how your ORM generates queries
  • avoid loading big resources in memory, use data streams
  • baseline your performance per (micro)service instance so that you know when to scale
  • do regular load and performance testing

Key things to consider for Resilience:

  • use circuit breakers, retries, timeouts, rate-limiting
  • have clear fallback strategies when dependent APIs are not available
  • some great resources on the topic: Resilient Systems Part 1 and Resilient System Part 2
  • make all APIs Idempotent
  • don’t store state within one (micro)service instance; use a distributed cache for that

Key things to consider for Availability and Scalability:

  • don’t make your (micro)services design limit horizontal scaling
  • plan for failure, have automated mechanisms in place for auto-scaling based on load
  • consider sharding, read-only replicas
  • use multi-region deployments

Key things to consider for Observability and Monitoring:

  • all (micro)services must expose health endpoints covering both application and the underlying container
  • the health endpoint must return information about all its dependencies: db, encryption service, APIs it connects to, event bus, etc.
  • leverage the standardized logging to create meaningful operational dashboards

Automate

Automate everything. Automation makes it predictable and consistent. The CI/CD pipeline should be the place where you automate all checks that will assess your (micro)service from a quality perspective. Tools like Semgrep can bring automation with less effort for standards not obviously suited for automation.

Conclusion

This isn’t a final list, it’s more like a brain dump. It’s a starting point for building a security mindset. Once you apply all these, you are ready to deep dive. Applying all these practices won’t give you only security benefits, but also more structure and alignment. This is particularly important in systems developing too fast, either brand new or legacy. You don’t need to go with all these from day 1, it might seem overwhelming especially if you are not used to following common standards and think it will limit your options. But maybe you can try it for a while and see what happens!