- theres a copypasta section at the bottom to cleanup
- AAA: aws architecture center
- AAA: IBM the GOAT of documentation
- AWS leadership principles
- azure architecture center
- GCP patterns for scalable and resilient apps
- gcp: what is multicloud
- github architectural decision records
- how to make effective service blueprints
- learning domain driven design
- microservice architecture patterns and best practices
- pattern: action pattern
- pattern: circuit breaker
- pattern: fan out/in integration
- pattern: list of software architecture styles and patterns
- rapid prototyping
- rethinking service blueprints for agile delivery
- roy fielding
- selecting a rapid prototyping process
- Selina Liu: microservices @ airbnb
- shared nothing architecture
- state of devops 2021 (PDF)
- TCO: serverless vs traditional clouds
- TOGAF capability framework
- towards a human-centered workshop design
- zachman framework
- master thesis by tobias blum; creator of the human-centered api methdology
- the document, diagram, or vision that encompasses all of the characteristics of your infrastructure that supports your hosted applications
- constraints and assumptions that address the
insert here
of systems and resources that support and host your business automation workflows- requirements
- availability
- manageability
- performance
- recoverability
- security: data security patterns and a clear mapping of these patterns to Cloud security controls
- existing application that must run on premises, that uses databases or files, or that must perform backups, and you want to use cloud resources and scalability.
- fast, local access to data, but you also want to take advantage of cloud compute and analytics engines.
- years’ worth of security and compliance requirements, supported by processes and procedures in disparate systems on premises. And you want to use cloud management and monitoring capabilities, ideally from a single pane of glass.
- many physical locations to manage with data and applications and you want reliable connectivity and simplified maintenance.
- opensource: often conflicts with responsibilities of proprietary cloudnative services
- configuring VMS
- updating operating systems
- installing application runtime
- build and deploy pipelines
- scaling and load balancing
- monitor and observe
- data storage
- multicloud
- requires attention to the network latency introduced for request pipelines that traverse cloud networking boundaries
- pay attention to vendor lock-in if utilizing proprietary services
- cost aside, its all about
- where you want to spend your time
- level of server/host/application customization/control required to fulfill business needs
- technical maturity of team and access to subject matter experts
- you manage all aspects of scalability, availablility, fault-tolerance, softare maintenance and host configuration
- use cases/considerations
- predictable, consistent workloads with long-running/intensive computations
- custom host configurations or software implementations
- ability to manage, configure and maintain the underlying software
- abstracting away the compute infrastructure to the point you have no responsibilties for servers on which your code runs
- key characteristics: any cloud providers fully managed service can be part of a serverless architecture
- infrastructure externally managed
- high availability
- horizontaly scales
- redundant and fault tolerant
- use cases/considerations
- purpose driven architectures and application flexibility
- quick to market
- serverless architecture: thinking in terms of patterns and applications, rather than in terms of individual functions or resources
- migration strategies
- types of compute and data stores you can select
- application architecture patterns you can use
- application design: choose services and patterns that suit your workloads based on characteristics such as
- expected throughput
- service limits
- cost
- SLO, SLAs
- cost comparisons with other architecture models
- infrastructure cost to run workloads: e.g. server provisioning vs invocation costs
- upfront development cost: dev effort to plan, architect and provision resources
- maintainence cost:
- value comparisons with other architecture models
- increased speed and agility: of course this comes after the initial learning curve
- cost allocation: much easier to allocate costs to customers and events vs to servers
- i.e. the costs occur when events occur
- think through how to observe and react to events in your distributed services
- each component should scale independently
- implement circuit breakers to restrict the blast radius of failed services
- two broad domains: how do you
- implement compute infrastructure:
- capacity processes and cost models: reflects the three general ways of operating infrastructure
- server based
- containerized
- APIs and microservices
- capacity processes and cost models: reflects the three general ways of operating infrastructure
- approach application dev and deployment: operational processes and development models
- simple move to the cloud: but lacks flexible build and deploy processes
- api driven microservice-based applications with the most flexibility but requires the most rewrite of legacy tech stac
- implement compute infrastructure:
- considerations: its all about documentation, planning and strategy
- what does each application do and how are the components organized
- how can you break up data based on CQRS? you have to strangle the database along with the microservices
- what belongs to the control plane?
- what belongs to the data plane?
- once the data is distributed, which set of db engines match the throughput, consistency, access patterns, etc reqirements
- how does the application scale, and which components drive the capacity you need?
- should you migrate leaf components first? or those with the most demanding capacity requirements
- do you have scheduled based tasks?
- do you have workers listening to a queue?
- where can you refactor/enhance functionality without impacting the current implementing
- perfect for load balancers and API gateway
- refactor/migration: service by service is moved to the cloud into a new architecture
- lift and shift: aka rehost; copy pasta legacy into cloud services to save on hosting costs
- the idea is you take the entire legacy set of applications and move it into the cloud
- the core benefit is reducing infrastructure costs (no more on premise) while taking advantage of cloud compute reliability & availability
- replatform: lift and shift then replacement/refactor
- this is more incremental then a pure lift and shift
- you will need to connect the remaining legacy services with the new cloud services until everything is fully migrated
- leap frog: from legacy on-premise monoliths straight to serverless cloud architecture
- organic: migration with a lift-and-shift approach
- e.g. servers to EC2s, perhaps some ECS/lambdas thrown in, but limited rewrites
- the goal is to get things into the cloud, and experiment with serverless & microservices
- strangler: incrementally and systematically decomposes monolithic applications by creating APIs and building event-driven components that gradually replace components of the legacy application.
- Distinct API endpoints point to old and to new components and safe deployment options (such as canary deployments) let you point back to the legacy version with very little risk.
- strategy for deploying new application versions and infrastructure are equally important
- QA, qa, Qa, qA ;)~
- standardization and optimation
- auditing and reacting to changes
- halt/rollback mechanisms
- managing configuration changes
- similar in terraform how you can consume deployed artifacts (e.g. ARN, ips, etc)
- bake it into the deployment package: refrain from this as much as possible
- 12factor it: the easiest, but should atleast be encrypted; however its difficult to share/update across nodes
- load at runtime from some secrets manager: the most robust (and complex) option
- similar in terraform how you can consume deployed artifacts (e.g. ARN, ips, etc)
- all at once: 100% of traffic goes to the new version
- traffic shifting strategies
- canary deployment: e.g. 20% goes to new versio for 20 mins, and then once validated 100% goes to new version
- linear deployment: e.g. 10% of traffic goes to new version every 20 minutes until 100% is shifted
- testing: you generally need a test account that mirrors the production account
- local tests within the dev environment
- unit tests focusing on business logic
- cloud native code is generally more difficult to test locally (see localstack)
- integration tests: targeting remote test accounts with prod parity
- automated integration and accepted tests against other envornments providing gates for production deployments
- To design for availability and durability
- understand the requirements of your workloads
- the level of service that you've promised to your business operation teams.
- business continuity: the ability of an organization to provide useful services in the event of failure
- service level agreements that outline acceptable downtime or promised uptime
- always includes audit and compliance objectives if stipulated by industry regulation/laws
- RTO: recovery time objective; maximum allowed time to restore a system to operational status
- i.e. if RTO is 6 hrs, engineers have up to 6 hrs to restore a failed system
- RPO: recovery point objective; the maximum allowed time since the last data recovery point (backup)
- i.e. if you backup every 10 hours, the greatest amount of data you can lose is the last 10 hours
- tier 1 SLA: Mission critical services: Outages have a severe impact on business operations.
- usually require four to 5 nines or better of availability.
- Usually include customer facing sites, revenue, or order processing systems
- Tier 2 SLA: Business critical services: Outages have a significant impact on business during operational hours.
- Usually three nines of availability
- Short outages may affect revenue but are not catastrophic
- Any application that isn't driving revenue but that supports Tier 1 applications
- Tier 3 SLA: Business-operational, non-critical services: Workloads support internal business operations but are not customer facing.
- Two nines of availability
- Low impact workloads such as analytics
- Tier 4 SLA: Administrative non-critical services: Office productivity. There is no impact to customers if these workloads are down.
- Uptime of 90 percent or better
- Internal applications
- the percentage of time a system is accessible.
- availability is expressed as a percentage of uptime in a given year or as a number of nines
- one nine: 90% uptime, 36.53 downtime
- two nines: 99% uptime, 3.65 days downtime
- three nines: 99.9% uptime, 8.77 hours downtime
- 3 1/2 nines: 99.95% uptime, 4.38 hours downtime
- 4 nines: 99.99% uptime, 52.60 minutes downtime
- 4 1/2 nines: 99.995% uptime, 26.30 minutes downtime
- 5 nines: 99.999% uptime, 5.26 minutes downtime
- availbility types
- active-passive: one of two resources are considered the primary, the other secondary and becomes primary if the current one fails
- challenges:
- statefulness: acceptable since theres only one primary
- availability: if failover fails, your resources wil be unreachable
- scalability: since only one resource is considered primary, it will bear the brunt of the load
- since the passive resource doesnt share the load; you need vertical scaling to accomodate increased demand
- stop the passive resource > resize and restart > make primary and shift traffic> stop, resize and restart the new secondary
- you repeat this process when demand decreases
- challenges:
- active-active: more than one of many resources are considered primary
- challenges
- statefulness: managing state across resources can be difficult and should generally be stateless
- push state into a load balancer that fronts the fleet of resources
- scalability: since multiple resources share load you need horizontal scaling
- automation is key; adding and removing resources should match demand in near real-time
- statefulness: managing state across resources can be difficult and should generally be stateless
- challenges
- active-passive: one of two resources are considered the primary, the other secondary and becomes primary if the current one fails
- strategies
- horizontal scaling: in/out; increasing the total number of load balanced resources
- vertical scaling: up/down; increasing perf characteristics of existing resources
- reducing or managing failures and minimizing downtime through the implementation of redundant components, deployment of parallel components to distribute traffic load, and elimination of single points of failure.
- strategize for the current scale requirements in addition to the anticipated growth
- know the capabilities and service limits of the services that you’re integrating
- especially the service limits, e.g. api gateay 10mb vs SQS 256kb
- select patterns that optimize your application for the scale you need to support
- timeouts, retry behavior, throughput, payload size, etc
- know the capabilities and service limits of the services that you’re integrating
- elasticity: horizontal scaling + automation to match demand
- both compute + data
- organic demand
- merger and acquisition: increases dramatically within a short period
- management
- performance
- security
- increased need for more effective load tests since you can scale/tweak the perf of individual components
- you need to have a solid understanding of peak load capacity
- best practices
- load test with authentic data with appropriate volumes that exercise each integration point with production access patterns
- spin a production like environment, never load test locally
- identify the bottlenecks/failure points, modify perf characteristcs and iterate
- this may move the bottleneck to other integration points
- you need to know where the bottleneck is most appropriate, based on your workload and business requirements
- this may move the bottleneck to other integration points
- choose a percentile that reflects the business need when monitoring
- build in error handling logic to handle failures that are outliers
- dont mock services you cant control
- when you're load testing, you need the real beef
- components should output appropriate signals
- monitor by percentile, and not just avg/raw numbers
- log efficiently and effectively
- systems that can perform its responsibilities even in the presence of failures
- the capability of a system to handle failures (hardware, networking, regional power loss) while protecting data from deletion or loss.
- durable storage: stores data reliably without data loss even during component failures
- strategy for increasing durability and availabliity
- its all about duplicating data & servers across infrastructure in isolated geographic locations
- challenges
- replication process: keeping data, configuration, etc in sync across primary and secondary resources
- failover: redirecting traffic from primary to secondary on failure
- DNS strategy: updating an IP addr to point to a different DNS record
- be careful of DNS caching and the time it takes to propagate DNS changes
- Load balancing strategy: the IP points to a load balancer that routes requests to health checked resources
- DNS strategy: updating an IP addr to point to a different DNS record
- designing for 0 downtime
- unlike with high Availability, where some downtime is acceptable
- systems that continue to operate even when dependent services are faulty
- systems that operate through and recovery from disaster
- spectrum of strategies ranging from backup & Restore to multi-site active/active
- Backups are an essential component of your business resiliency and operational commitments.
- developing a backup strategy for services and data
- is there documentation on the data, applications and workloads in your environment
- What data is required to keep the business running?
- What additional data does the business want backed up?
- What purpose does this backup data serve?
- Is this for disaster recovery planning?
- Compliance and audit purposes?
- Daily business operations or data recovery?
- Ransomware recovery?
- How do you intend to use this backed-up data?
- Allow users to recover their own files?
- For development and testing?
- Long-term retention? what are the retention requirements for each workload?
- Regulatory compliance?
- just in case backups?
- Backup granularity: defined by the depth the backup or restore requires.
- File-level recovery: e.g. any configuration files needed to restore an application
- Application-level recovery: e.g. a specific application version
- Application data-level recovery: e.g. a specific database within MySQL
- volume-level recovery: e.g. any persistent volumes used across nodes
- instance-level recovery: e.g. any VMs or containers
- managed-service recovery: e.g. if using some serverless/managed service
- designed to restore/preserve a copy of data from a desired point in time
- goals
- Restore data in a timeframe meeting Recovery Objectives
- recovery: protect it in case of a disaster, accident, or malicious action.
- compliance: theres bunches of regulations depending on the industry and type of data stored
- Recovery Point Objectives: RPO; recovering to a specific point in time, usually measured in milliseconds
- Recovery Time Objectives: RTO; time it takes to recover, usually measured in minutes
- backups: creating efficient and cost-effective data copies
- incremental
- frequency of backups
- capacity limits
- expiration
- disaster recovery drills
- Using a pure backup and restore strategy is associated with longer recovery time and recovery point objectives
- results in longer downtime and greater loss of data between when the disaster event occurs and the recovery of the systems.
- the easiest and least expensive strategy to implement.
- use cases
- tier 3 and 4 SLAs
- container: a standard unit of software; the highest level of abstraction
- the container runtime shares the hardware OS kernel
- you create isolation via file system layers
- provides the best utilization of the underlying hardware
- you can share OS libraries and software, or isolate them
- hypervisor: soft/firmware that enables sharing physical hardware resources across one/more virtual machines
- virtual machine: emulates a physical server; virtualize the operating system and better utilize resources
- each application can now be isolated from other applications at the VM level
- the virtualization layer is still heavy as it requires a complete operating system
- which is potentially redundant as isolation requires redundant copies of OS, and OS libraries
- baremetal: the most inefficient; hardware costs are the same regardless of resource utilization
- applications fight for the same set of resources
- libraries are generally shared across applications and require synchronization
- from baremetal, vms to containers in increasing levels of abstractions
- baremetal
- server hardware
- vms
- operating system
- containers
- containerization platform
- libraries
- applications
- baremetal
- syscall: inspect api calls made at the namespace level
- OOM killer: out of memory killer: algorithm for finding the oldest & largest process in the system, and kills it to freeup memory for the other processes
- its difficult to find out which process has been killed by the OOM killer algorithm
- stopping a process doesnt freeup the memory allocated to it, you have to kill it
- cgroup: control groups; deals with CPU and Memory
- orchestration: service that coordinates container deployment, placement strategy, failure, and resource utilization across mutiple hosts
- scheduling and placement of containers
- automatic scaling in/out based on some strategy
- self-healing services by autoamtically removing unhealthy containers and deploying new ones
- integration with cloud and other services, e.g. networking and persistence
- security, monitoring and logging for the fleet
- dont install openssh/etc in a container
- instead log into a parent namespace and start a shell
- openssh: bypasses PEM modules, user-management, cert & key management is a whole-nother issue, etc
- forking (executing) too many small processes that run then die can eat up cpu time becasue the scheduler cant keep up with the start/stop
- PIN CPUs to a namespace to provide a clear boundary at the container level
- this is difficult to get right, google it
- stay away from page swapping for performance reasons
- instead focus on proper memory management
- always thinka bout the container architecture
- user-space: where your app lives: full container isolation
- kernel: shared space: all containers in the system share these resources; be aware of noisy neighbors
- dont believe all the continer isolation stuff, its still possible for a container to break out of this boundary into kernel land
- all isolation occurs at the kernel level, protect the king
- platform: this is the physical hardware
- consider which containers share a network namespace
- sharing network namespace increases perf (no need to go through all the tcp handshakes and things)
- ^ but also reduces your security posture (you dont go through all the tcp handshakes and things)
- managing users from the user-namespace is possible, but difficult to get right
- in general
- small containers, and always start from scratch
- starting from scratch forces you to choose your syslibraries wisely to match the needs of your application
- nologin from the outside world (no openssh, always exec sh for access)
- north-side communication must be guarded
- container technology !== network encryption: always implement e2e encryption at the application level
- small containers, and always start from scratch
- risk process frameworks
- ISO 31000: risk management
- risk mitigation: considering which containers are on the same platform (hardware) & use the same namespaces
- risk assessment: isolation, segmentation and management of apps in containers, the containers themselves, and the system (kernel) and platform (hardware)
- confidentiality: its all about segregation of communication
- container to container
- process to process
- container to outside
- access
- Who/When/Where
- Logging
- Start/Stop
- Content of the container image
- integrity
- kernel 2: cgroups & namespaces
- kernel 3 and 4: namespaces v2 (you should focus here)
- availability
- Resource Usage: cpu, memory, data compression
- memory management is handled globally; you need to ensure a container doenst consume all of the systems memory
- Noisy Neighbor effect: in multitenant systems with shared resources, the activity of one tenant can negatively impact another tenant's share of resources
- Resource Usage: cpu, memory, data compression
- confidentiality: its all about segregation of communication
- namespaces: you need to use the clone systemcall to create a namespace for a container
- types
- PID-namespace: isolated process namespaces per PID
- each process has a global and local PID
- the root namespace can see a child processes global and local pid
- inside a container only the processes local PID is visible
- the first process in a namespace has local id PID0, and its incremented by 1 and linked
- killing a process kills its process tree
- CPU/Memory-namespace (cgroup): its all about memory management and CPU scheduler
- policy-based scheduling: kernel system scheduler has a policy for executing tasks on a specific cpu
- CPU Affinity: pins a task to a specific CPU, instead of using policy-based scheduling
- memory limitation: Out-of-memory OOM killer; do you kill processes that exceed their specified memory limit?
- dirty/used/empty pages
- context: memory is attached directly to a specific CPU, switching/migration takes time
- context switching: the cpu scheduler moves from one cpu to another
- context migration: the cpu scheduler decides a process needs to move from one cpu to another cpu
- the memory used by a container IS NOT migrated with the process, so this impacts performance
- cpu threads
- Network-namespace: can be shared across PID-namespaces for communication across processes
- puts network intefaces into namespaces: processes in a namespace can only use the available interfaces for connectivity
- all processes in the network-namspace can talk to the inteface
- kernel responsibilities: occurs at the kernel level, and thus global to all containers in the system
- routing/fowarding/filters/bridging still happens in the kernel
- TCP/UDP/ICMP stacks
- puts network intefaces into namespaces: processes in a namespace can only use the available interfaces for connectivity
- User-namespace: allows containers to have their own user-ids, which are mapped to global userids
- mapping table for users for local and global context
- FS/Mount-namespace: the filesystem namespace
- mapping table for filesystem paths
- from the container perspective: it looks like a local path
- from the kernel perspective: its an overlay to a system path
- use syscall for open to inspect it
- mapping table for filesystem paths
- IPC-namespace: system file/inter-process communication; two containers shouldnt use IPC for comms, theres always a better way
- UTS-namespace: allows a container to have its own hostname
- PID-namespace: isolated process namespaces per PID
- organization: a tree structure; processes of different namespace-tree-branches cant see other branches
- root > child x..y > a child can create a child
- but a child does not have visibility into a cousin, e.g. cousin X cant kill cousin Y
- root > child x..y > a child can create a child
- types
- uses events to communicate between and evoke decoupled services
- state and code are decoupled
- integration through a messaging layer enabling asynchronous connections
- events: an observable (change in state) that contains all the information required to take subsequent action
- event producers: entities that create and publish events, e.g. websites, apps, etc to unknown consumers usually through an event-bus like EventBridge
- event router: ingests, filters, and pushes events to known consumers through some other mechanism like SNS
- event consumers: subscribe to receive specific or monitor all events in a stream and act on those they are interested in
- messages vs streaming
- streaming
- core entity is the stream: a collection of messages for a specific period of time
- data remains in the stream for a period of time; each consumer must maintain a pointer
- messages in streams are reprocessed until success/timeout; you need to build logic into the consumer to skip corrupted messages
- messaging
- core entity is the message which varies
- messages are deleted once consumed
- configure retries and dead letter queues
- streaming
- allows components of distributed systems to communicate with each other
- decentralized, evolutionary design
- each service is paired with the appropriate programming language and technology
- each component/system can evolve separately
- smart endpoints, dumb pipes
- there is no enterprise service bus
- data is not transformed when going through pipes, but each endpoint should transform as needed
- independent products, not projects
- designed for failure
- services are resilient, redundant, and integration failure
- disposable and transient
- immutable services and infrastructure with graceful shutdowns
- start fast, and fail fast and release all file handles
- development and production parity
- microservice architecture: an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms, often an HTTP resource API. take a large, complex system and break it down into independent, decoupled services that are easy to manage and extend.
- serverless: abstracts away the infrastructure layer so you can focus on developing your core product
- networking boundary for communication between services
- dedicated infrastructure layer using an array of lightweight network proxies deployed as sidecars into containers
- its all about abstracting away the network configuration from the application code
- service to service communication: east-west
- dynamically configuring how services are connected
- observability
- determining how services are performing end-to-end
- as a request falls through multiple service boundaries, are there bottle necks?
- security & traffic management: authNZ
- securing comms between services
- controlling how packets are routed through the nextwork in terms of policies, prioritization and resilience
- client polling: provide a message ID on the initial request, and a status and getResults to check the status/get results of the request
- perhaps the most wasteful pattern
- webhooks: user defined http callbacks; the client is responsibile for providing the endpoint to be notified when work is complete instead of having to poll
- trusted: when you own both sides of the integration and can create a secure connection
- untrusted: when you only own one side of the integration
- websockets: create persistent connections between client and server
- ideal for streaming/requests that require more than one response
- homogeneous: EL; move the data in the same format from source to target
- foucs is on speeed of transfer, data protection & integrity, and automation for live data
- heterogenous: ETL; move the data in a different format from source to target
- focus is on transformation of the data to meet destination requirements & running calculations/ML for derived output & new attributes
- specific strategies for resolving common architecture problems
- dont get too hung up on program vs system patterns
- focus on reducing a services blast radius
- if service A fails, how many other services will be disrupted?
- whether or not you describe your architecture in terms of Tiers, there will always be tiers
- lower tiered services generally have a bigger blast radius; require the most stability and resilience
- never let lower tiered services call higher tiered services which may have reduce stability
- abstraction for the sake of abstraction increases complexity & cost for no value
- sets of practices that form the basis of design patterns
- favor composition over inheritance: dont overuse inheritance, composition should also be used to extend behavior
- identify application components that vary, and separate them from what stays the same
- HAS-A (a dog has an owner) is better than IS-A (a dog is a animal)
- objects get new behaviors by consuming (instead of inheriting from at compile time) objects that supply that behavior at runtime
- encapsulate what varies: anything expected to change often should be encapsulated behind some sort of interface
- its all about letting one part of a system vary independently of others, e.g. moving the brancing logic into a separate module that exports an interface
- all patterns implement this principle in some form or another
- program to interfaces: program to the most abstract behaviors as possible, allowing implementers of the interface the opportunity to change the specific behaviors as needed
- one class should not rely on a specific instance of another class
- objects should rely on interfaces, and shouldnt care which class implements that interface
- loose coupling between objects that interact
- objects should have little knowledge about implementation details of other objects
- changes to one should not impact another, if it does, you've got some tight coupling
- reduces the dependency between components
- design by contract: specify preconditions, postconditions & invariants; treat inputs and outputs the same way across implementations
- specific to the internals of a single application, restricted by the application architecture
- the structural composition of the software program
- the interactions between software elements
- during the process of writing software code, edevelopers encounter similar problems multiple times within a roject/compony
- design patterns give engineers a reusable way to solve recurring problems
- bad design:
- rigid: hard to change, e.g. due to dependency changes
- fragile: easy to break, e.g. when making a change bugs cascade
- immobile: difficult to reuse
- concepts not always relevant to good design principles
- enheritance: reuse features & behaviors
- encapsulation: hide & protect data
- polymorphism: code that works by behavior, and works with types/subtypes with a similar interface
- abstraction: hide implementation details, and depend instead on declarative interfaces
- by Martin Robert Martin
- S: single responsibility: all about limiting the impact of change
- an object should only have one reason to change
- each additional responsibility adds a reason for modifications to be required to the object as the nature of that responsibilty changes in the future
- O: classes should be open for extension, but closed for modification
- objects should provide default behavior, but enable consumers to override behaviors when needed
- L: Liskov substitution: subtypes should be substitutable for their base types
- i.e. subtypes should adhere to the interfaces of their base types
- i.e. subtypes behave like their base types
- I: interface segregation: interfaces should be small and cohesive, and shouldnt include methods they dont use
- polluted interface: continuing to add new functionality to an interface as it evolves
- causes unwanted dependencies between consumers of the the interfaces
- cohesion: how related a class/interfaces methods are to themselves
- if all consumers of a class/interface use the class/interface the same way, it generally means there is high cohesion
- else you should subtype the class/interface into additional subtypes, that only implements the behavior with high cohesion
- polluted interface: continuing to add new functionality to an interface as it evolves
- D: dependency inversion: high level modules should not depend on low level modules; both should depend on abstractions
- instead of coupling high level component -> low level component
- ^ you should instead program to an interface, and have both the high and low components depend on the same interface abstraction
- ^ the abstraction should not depend on details (implementation), but the implementation (low level component) should depend on the interface
- adapter
- aggregator
- builder
- chain of responsibility
- decorator
- domain driven development
- facade
- factory: abstracts implementation details of instantiating entities
- simple factory: encapsulating a complex if statement/similar into a separate module
- interface oriented design
- observer
- repository: abstract implementation details on how to retrieve data; logic for retrieving, validating and translating responses from a data service
- singleton
- strategy
- tryparse: pattern for trying some logic and returning the response, but on error returning the given substitute instead of throwing the exception
- event sourcing: storing events instead of state, enabling you to rehydrate/replay timelines
- cascading failure pattern: where a failure at one integration point, cascades to cause failures/disruptions throughtout the application stack in a layered architecture
- architectural style: way of designing processes & building systems to facilitate an end goal, e.g cloud native is an architectural style
- dependency graph:
- service types:
- data service: connects to a data source within a system
- business service: builds on top of data services; business domains that aggregate multiple data services to meet business objectives
- translation service: any abstraction/decoratation/encapsulation of a third party operation under your own interface
- edge service: serving data to other services based on the consuming services context
- platform: everything within a microservices environment constitutes its platform
- runtime: virtualization, baremetal services, containers, etc
- ancillary services: message queues, cache services, oauth, etc; some are first class, others arent
- operational (devops) components: log & metric aggregators, deployment services, etc
- diagnostic components: enable you to connect to the runtime env of a microservice and inspect, analyze, diagnose, troubleshoot & improve performance
- business process: logic that consumes one/more data domains to solve an issue
- data pipeline: aggregating data from multiple sources in distinct formats, transforming each to match a single interface, and then storing the transformed data into a single graph/db
- scheduler: manages retrieval from multiple data sources at different rates, and sending each into the matching data collector
- data collectors: services geared toward retrieving data from a single data source, serializing and storing the raw data (e.g. in s3)
- data convertors: convert raw serialized data into a common serialization format, with a defined interface and storing the new formatted data (e.g. back into s3)
- data processors: take the converted data, and process it for storing into the final db (e.g. in a knowledge graph)
- specific to data problems
- a distinct service for reading from data sources
- a distinct service for writing to data sources
- useful for operational & yielding problems
- all services in a system are within a single compile, yield and runtime environment
- stack services based on the flow of data, each stack/tier/layer exists at a different level of abstraction and is responsibile for a specific
- tier: abstracts services by what they are, instead of what they do, so still some duplication exists over service-oriented architectures
- the bottom tier(s) generally closer to the data and provides services to the tier above it
- data flows down from the requester until it reaches a lower tier that can respond to the request
- a common theme being the hierarchical and layered flow of data, generally unidirectional but in some patterns biderirectional (e.g. server sent events)
- tier: abstracts services by what they are, instead of what they do, so still some duplication exists over service-oriented architectures
- common to think of N-Tier architectures through
- presentation: UI and pure UI logic; usually the presentation of the business logic solution to external users
- business: the logic of the problem domain, what does the application do?
- data: database related stuff
- common goals
- increase the number of tiers depending on the complexity of the model & types of requests & optmizations required
- introduce extra layers of defense between attackers and your sensitive resources
- the most sensitive data should be at the bottom of the stack
- type of n-tier in which communication between tiers are forced to flow in a single direction, from top to bottom
- presentation: UI
- application: abstracts away implemnetation details to interact wiht the business layer
- business: business logic
- persistence: abstracts away the implementation details to interact with the database
- data: database
- core logic that can be extended via plugins (sidecars), and defines the interface of each plugin
- each plugin/sidecar then implements the contract
- extends the multi-tier pattern: divides an application into 3 tiers
- model: manages application data and data related functionality
- the central component of the pattern
- the dynamic data structure of the software application
- controls the data and logic of the applications
- view: displays application data and interacts with the user
- accesses the data in the model and presents it to the user
- controller: manages the interaction between the model & view layers
- handles input from the use rand mediates between the model and view
- listens to external inputs from the view/user and creates appropriate outputs
- interacts with the model by calling methods to generate appropriate responses
- communication: the messaging channel connecting the model, view and controller
- e.g. an event/callback notification system
- contain state information that is passed between the 3 tiers
- e.g. an external event from the user may be transmitted to the controller to update the view
- model: manages application data and data related functionality
- key concepts:
- advantages:
- disadvantages:
- examples:
- evolution of MVC
- model: see MVC
- view: see MVC
- presenter: 2 types
- passive view: all UI logic is in the presenter layer
- presenter responsible for dynamically creating a static view of the current state of the model
- and the view is responsible for rendering the static content given to it by the presenter
- supervising controller: rendering logic is in within the UI, but data transformation logic is within the presenter
- passive view: all UI logic is in the presenter layer
- model: business logic & data
- view model: interacts with the model
- view: interacts with the view model
- two-way databinding between the view data, and the model data
- extends the multi-tier pattern: two main components taking roles of service requester (client) and service provider (server)
- client: initiates & sends service requests to the server
- the client provides ports for each services it needs
- server: responds to request from the client potentially fulfilling the request/reasons why it cant
- the server provides ports for each service it provides
- communication: generally linked request/reply connectors from client/server M:1
- client: initiates & sends service requests to the server
- key concepts
- advantages:
- central computing of data: generally all files are stored in the central location for the network
- disadvantages:
- generally need a lot of compute power for the server to handle many requests
- examples:
- the internet operates on a client-server pattern
- extends the multi-tier pattern consisting of two components the controller and responser(s)
- controller: distributes work and copies of work-data among identical responders for processing
- aggregates the responses into a single composite value
- responsible for writing/storage of data & results from work
- manages how work is distirbuted among responders
- responders: processes & returns a slice of work given to it by the controller
- responsible for processing work and reading data
- communication:
- controller: distributes work and copies of work-data among identical responders for processing
- key concepts:
- advantages:
- analytic applications can be read from the responder component without changing the data content of the controller component
- responders can be taken offline and synced back without data loss
- disadvantages:
- single point of failure if the controller fails
- responders can be promoted to controller, but technical deficits exist in the time it takes to transition
- single point of failure if the controller fails
- examples:
- beyond tiers, which group by fn (what it is), this groups services by activity (what it does)
- heavily dependent on interface design & contracts and communication architecture patterns
- enterprise service bus: the controller responsible for orchestrating communication between services
- scoped units of services, that work in unison but scale independently to achieve a goal
- breaking endpoints into distinct units of work that can be scaled independently
- focus on data, business and function domains, analyze call patterns and dependency graphs, and determine boundaries between services that need to be scaled independently
- the extreme in service oriented architectures, with the removal of the enterprise service bus
- each service should be single purpose, stateless, independenty scalable and composable
- involves creating multiple applications (i.e. micro services) that work interdependently
- although each service can be developed and deployed independently, its functionality is interwove with other microservices
- each application is free to take on another architecture pattern if complex enough, or use a design pattern
- messaging:
- service discovery:
- key concepts
- separate CI/CD of each service: creates a streamlined delivery pipeline that increases scalability
- callable via APIs/Events
- apis: asyncrhonous/synchronous and typically over HTTP
- events: always asynchronous: useful to think of events as messages
- advantages:
- disadvantages:
- examples
- render data for frontend clients in a friendly and common format
- this layer is prime for a graphql service in a data access layer
- perform simple data transformations/aggregation
- permissions checks
- localization and other external business logic
- data fetching and hydration
- shared business logic consumed/sidecarred by many services
- distinct from presentation as this relates to internal business logic
- in complex situations it may be more appropriate to extract data access from the presentation layer
- encsuplate data sources into a uniform data layer
- Queues
- Messages
- Streams
- break a component into logical steps, convert each step into a service that can be reused and scaled
- core goal is to make services smaller, thereby more scalable as demand fluxtuates across service boundaries
- decoupled services: the degree of decoupling has significant implicatins on architecture, performance and scalability
- direct service calls: either sync/async
- producer/consumer: a single producer orchastrates the invocation of multiple consumers, and handles the responses
- pipeline architectures: a form of producer/consumer, however the producer doesnt expect a response from the consumer, but accepts input from the previous stage, does its thing, and passes its output as input to the next stage
- functional pipelines in a system; create services that fulfill the needs of a particular domain
- largely based on domain-driven design patterns, as that methodology lends itself to decomposing services by domains
- product domain
- inventory domain
- etc etc, really scoped to a particular application/biz/tech context
- services driven by the data itself and focus on serving data thats used by the system, as well as data specific logic
- usually the lowest level of decomposition
- start with the data model (from the perspective of the outside world), how its needed and how its consumed, what are the actions (not in terms of CRUD/REST, just plainspeak) that need to be performed on the model
- scope the service contract, dont worry about implementation details just yet
- define your service boundary and build APIs around your actions
- define a schema that supports the model (but doesnt have to match 1:1 to the model) and implement your datastore (db)
- breakdown complex business processes into discrete services each fulfilling a specific role in the overall system
- higher level of service for reusing business logic across other microservices
- enables you to encapsulate related domains, that depend on similar business processes
- business processes should never have direct access to datasources, but instead are given the data they need to operate on
- this is hard boundary between data domains & business domains
- identify each process you want to expose
- identify the data domain each process requires as input
- define the APIs for each business process
- business processes always change, and sometimes frequently, so encapsulate the actual business process logic into its own module that can be iterated on separately from the service contract & interface
- you build your decomposition model around the atomic transaction at the data domain level
- these are synchronous blocking calls, be sure this is what you need and understand the performance implications on the system as a whole (this could become a bottleneck)
- domains requiring atomic transactions should be in a single shared database in order to build the atomic service
- when eventual consistency isnt an acceptable model, e.g. within fintech when you need to guarantee ACID transactions across domains
- atomicity: each statement (CRUD) in a transaction is treated as a single unit; either the entire statement is executed, or none of it is executed
- consistency: transactions only make changes to tables in predefined, predictable ways
- isolation: ensures concurrent tansactions dont interfere/effect with one another
- durability: changes to the data made by successfully executed transaction will be saved, even in the event of system failure
- provide failure domains & rollbacks; blocking API calls until the previous API call is complete, synchronous APIs work best/asynchronous with a guarantee callback mechanism
- always have clearly defined transactions, especially the conditions which cause commits to be rolled back
- dont depend on distributed transactions, the complexity is far greater than just using a synchronous API
- migrate from a monolithic system into a microservices architecture
- carving/sharding functionality out of a monolith and into a micrservice endpoint
- the most common pattern for moving from monolith to microservice
- start with the monolith, and encapsulate/strangle each dependency, migrating each one-by-one to a microservice endpoint, then deprecate & remove the dependency from the monolith once the microservice is fully implemented
- top down approach: start at the API/service level, and work down to the data domain
- bottom up approach: start the at data domain, and move up to the API/service level
- promote separation of concerns; all about removing repetitive code from multiple microservices into a single embed that can be utilized by any service
- offload services (e.g. operational/security functions) into distinct components that can be deployed alongside dependent services as runtime dependencies (inversion of control)
- e.g. monitoring, logging, & security etc all lend themselves to this pattern, and can either be deployed as an embedded module (e.g. logging) or as a distinct microservice itself (e.g. oauth)
- this pattern generally shouldnt create more microservices, but instead create child processes to existing microservices
- solve orchestration & ingress needs across a system as a whole
- pattern for external clients communicating with your system
- if distinct clients have diverging demands/special business logic, use the edge pattern instead
- works best when all fronted services have similar demand profiles
- clients shouldnt be able to call any/all service, but instead call a single gateway which acts as a reverse proxy to all of the services you provide
- the gateway provides aggregation/buffer/facade/proxy/decoration/oauth/etc to the services behind its fence, and is responsible for mutating, limiting and proxying requests
- keep business logic out of the gateway, while it can be done, there are more appropriate patterns (see process aggregator) that should be responsible
- adhere to strict api version control and ensure all changes are passive (non-aggressive)
- implement clients (service wrappers) as distinct modules for services behind the gateway
- client specific API gateways for optimizing cost by creating distinct ingress APIs for clients with distinct business logic & demand profiles
- provides aggregation, consolidation, composition and complexity isolation away from the main API gateway used by the other clients
- directly addresses specific client scaling needs, e.g. if your customer is microsoft/apple, you likely want an edge service scoped to their needs
- identify the client, their needs & constraints, build contracts, interfaces and models
- generally recommended over pure api gateways, unless the cost of the additional hardware/service is too much
- aggregate multiple business processes into a distinct flow to solve specific goals
- usually the processes are dependent, and must be executed consecutively (and not in parallel) and can cause choke points as wait times increase
- whenever you have several business processes that must be called together and have a composite payload (multiple data domains)
- the aggregator interface should provide clients with a single API that subsequently calls each downstream process, assembles the responses from each, and returns a composite payload
- the aggregator is responsible for processing the responses and returning a unified response payload
- this is the main usecase, else you might as well use an aggrigator at the api-gateway level
- however, this can cause long blocking calls if too many business processes are aggregated
- determine the business processes
- determine the processing rules (should be encapsulated and an interface presented instead of hardcoding the logic directly in the service)
- design a consolidated model (for the composite payload) based on how the processing rules modify the responses
- design an API for the actions on that model (using either REST/Actions pattern)
- wire the service and implement the internal processing
- the most common pattern for all data domain based services
- single service, single database
- perfect when the scalability demands between the db and service are related (usually proportional)
- as the demand for the service increases, thus the demands on the db, and you can allocate resources effectively by monitoring both
- thus, each data domain should have its own dedicated datastore in this pattern
- can potentially isolate the data per region, for a more sophisticated architecture
- all data domains exist within a single database
- still, you should break the data domains into schemas/keystores/etc to keep them somewhat distinct at the data level
- definitely you should still treat the data distinctly at the application code level
- each user/service that consumes each distinct data domain should also have distinct credentials for accessing those services & data domains, even tho they are really accessing the same data store
- ensure you have proper segmentation at the data & service level
- this enables you to move to a single service database in the future with the least amount of friction
- you see this more in enterprises where there are contractual obligations/not enough resources to fully move to a true microservice architecture
- data distribution should be handled by the database, and not by code
- else synchronization/replication issues will eventually develop if application code is responsible for pushing data across regions/datacenters
- the most complex pattern of all the data patterns
- if implemented correctly, provides the most benefits for the use cases where its appropriate
- task based UI operations: complex write queries commit tasks, and complex read queries to support the system state at a given point in time
- eventually consistency is a must, as usually there are multiple and disparate data domains to be written to and read from
- event driven models work well, as you can react from triggers & events
- you must spend significant time in the design phase
- else is a nightmare to maintain
- if implemented correctly, provides the most benefits for the use cases where its appropriate
- data access patterns diverge from traditional CRUD, into multi-model patterns within specific bounded contexts or data domain
- multi-interface operations, write verus read
- query interfaces may transform & aggregate the actual data schema to represent the access pattern being modeled
- write interfaces may inject behavior and other characteristics based on a specific use case being modeled
- whenever CRUD becomes a bottleneck in complex write & read scenarios
- multi-interface operations, write verus read
- whenever you have long running transactions/complex workflows and a single blocking API call becomes unfeasable
- some problems/processes cannot be achieved in real time
- service API to trigger event, and events cacade asynchronously
- events can trigger from posting to a messaging queue
- super powerful in distributed systems
- how you run run systems, vs how you build it
- helps you answer:
- what is happening?
- when is it happening?
- where it is happening?
- why is it happening?
- who is involved?
- provide detail behavior of the runtime characteristics of the system
- the source logs need to have a clearly defined model and structure and be consistent across the entire system for effective aggregation and holistic analysis
- in distributed systems, its often usefult to link the logs across systems
- processing logs via aggregation requires each source log to have a consistent taxonomy, similar keys & formats
- log aggregation: each service generally right their own logs thats output for observability, but all logs should end up in a single stream of data
- aggregation involves retrieving the logged output, parsed, labeled/tagged, and stored in a time-ordered fashion
- the faster you can aggregate logs, the faster you can diagnose & trouble shoot issues
- correlation of logs via tracing identifiers requires uniform design across the entire system
- enables you to recreate call stacks from errant processes
- log indexing: enbales rapid searching, is almost required in high-value production systems
- understand whats going on at a system level
- ensure the taxonomy is structured across all system components
- dashboard design is critical: both system and user events should be injected into dashboards
- high-level dashboards: useful for spotting trends across the system as a whole
- detailed dashboards: useful for drilling down into specific system components to further investigatation
- especially if you can embed links to the log aggregation system
- trace alarms on dashboards are useful, that way oncall devs can see why they are being paged visually without having to run queries to figure out where issues are occuring
- ensure you have runbooks for all alarms
- the more call stacks span processes and networks, code traces are less valuable
- you need to implement a higher level trace ID that can be delivered across processes & services
- enables you to recreate the call stack by injecting a trace ID into every call
- the trace ID should be injected at the edge/entry point to your system, and span all the way down into the data layer, perhaps even stored in the database itself
- every log message should embed the trace Id through structured logging with common taxonomy
- always use an open-standards approach (usually specify the trace ID in the request header)
- critical when resources and services are moved across systems
- its main benefits is operational: when you need to reconfigure resources & services, its great to have a single place to manage the runtime environment of microservices
- use tooling that makes external configuration & environment variables easy to find and manipulate
- use consistent naming conventions across services and their configurations
- ensure secrets are kept separate from configurations
- configurations are generally injected into the service, or retrieved as part of the service startup routine
- services should prioritized externalized values over embedded (default) values
- mechanism for provider services to be found by consuming services in a dynamic runtime where service identifiers/locations/etc can and do change in response to system activity
- a central location should should exist that can be queried to find what services exist to acocmplish each task in the system,
- as services boot, they should advertise themselves to this datastore, describing their location and what services they offer
- not as extreme (implementation wise) as microservices
- currently restricted by the available cloud technology for advanced use-cases
- backend as a service: abstracts security (e.g. authnz), operational (e.g. logging), and database (e.g. RDS) implementation details away for the core application
- function as a service: simple utilities callable over a network
- extends the the client-server architecture, utilizing a decentralized system in which peers communicate with each other directly
- abstraction X...
- communication:
- clients can connect to each other for services and connect to the server generally to retrieve the list of available clients
- key concepts:
- clients are
peers
and share work + responsibilities
- clients are
- advantages:
- disadvantages:
- examples: