|
By Nkosinathi Sangweni.
At Kaleidocode, we spend a lot of our time in real delivery environments. That means real deadlines, real distributed systems, and real production failures. Not the clean kind that sit nicely in one service with one owner, but the messy kind: the same class of failure surfacing across subsystems, half-useful traces, repeated edge cases, and engineers burning hours to confirm what the logs already suggested. That was the pattern we wanted to break. As a consultancy, our value is not in having senior engineers repeatedly perform the first 80% of incident work by hand. Our value is in solving the harder problems: the failures that are genuinely new, the architecture decisions that need judgment, and the product work clients are actually paying us to move forward. Kaleidocode already operates as a full-service software consultancy across software delivery, testing, DevOps and AI systems, so the gap was obvious to us: too much high-value engineering time was being spent on repeatable production triage. So we built a workflow that starts where the pain starts: in the logs. We connected our logging and operational tooling into an agentic flow built around specialized subagents. One agent reads recent production failures and groups them into meaningful error clusters. Another ranks them by frequency, subsystem and urgency. Another investigates the likely cause by pulling in the surrounding code path, recent changes and known failure patterns. When the issue is narrow enough and the fix is bounded, a resolution agent generates a patch. A validation step checks the change. Then the workflow opens a pull request and produces a triage report that shows what was fixed, what is pending, and what still needs human eyes. The important part is that this is not one giant prompt pretending to be a platform. It is a sequence of narrow responsibilities. Triage is not investigation. Investigation is not resolution. Resolution is not validation. We found that once each step had a defined role, the system stopped behaving like a clever demo and started behaving like an engineering workflow. One report made the value obvious. In a single 24-hour window, the workflow processed 173 production errors across six subsystems. One issue alone accounted for 63% of the total. The system identified it, classified it, generated the relevant fix path, and surfaced the exact items that still needed human handling. Instead of asking an engineer to spend the morning reading through repeated failures, the workflow reduced the problem to a reviewable set of actions. That changed the conversation inside the team. We were no longer asking, “Who is free to go through these logs?” We were asking, “Which of these items actually deserves an engineer?” That distinction matters in consulting. Clients do not benefit when strong developers spend large parts of the week rewriting the same defensive fixes or producing the same summary by hand from production noise. They benefit when those developers are available for the work that requires judgment: reshaping an unreliable flow, solving a client-specific integration problem, challenging assumptions in the design, or delivering the next piece of roadmap value. The agentic workflow gave us a way to protect that time without pretending every production issue can be automated. Some errors still need people. Missing tracebacks, unclear ownership, infrastructure-level failures, or anything with broader architectural impact still gets escalated. The workflow does not try to be brave where it should be cautious. It takes away the repetitive operational drag, and lets us get on with the real engineering work.
0 Comments
Your comment will be posted after it is approved.
Leave a Reply. |
LearnWhat is happening at Kaleidocode? Archives
March 2026
Categories |
RSS Feed