~/blog/unihome.md

First production deployment

April 8, 20264 min read22 views

#unihome

My first production deployment didn’t start with confidence, it started with a tight deadline and a system that suddenly stopped working one day before release.

The project itself sounded simple:

Build a payment system for OSIS and MPK.
Around 70 users, each paying roughly IDR 250k (for specific event) and 15k every week.

That puts the expected transaction volume at around 15 million rupiah (for the first week). Not huge, but definitely not something you want to break in production.

First Mistake #

What I underestimated was how quickly things would become real.

On the first day alone, the system processed over 2 million rupiah in transactions. The next day, it reached about 4 million. "this wasn’t just a school project anymore" is the only thing in my mind. Every request hitting the backend carried actual value.

And naturally, things started breaking.

The biggest issue appeared at the worst possible time: H-1 before the launch. The original implementation relied on dynamic QRIS generation through a payment gateway. It worked flawlessly in the sandbox environment until it didn’t. Without any clear changes, the API suddenly refused to generate QRIS due to permission issues (even though dynamic QRIS were enabled in the dashboard).

With no time to wait for support, I had to think quickly.

Second Mistake #

The solution was to use static QRIS and simulate dynamic behavior by adding a unique value into each transaction amount. This value acted as an identifier, allowing the backend to map incoming payments to the correct user. Once a transaction was completed, the value could be reused for future payments with the same nominal amount.

To enhance the reliability, I implemented webhook-based verification with hash validation. Combined with a 15-minute expiration window, this approach created a lightweight but functional payment tracking system.

Of course, the first deployment didn’t go smoothly.

On that day, I deployed changes directly to production without a proper understanding of the environment. This led to a critical issue where webhook callbacks were received, but failed hash validation. As a result, legitimate transactions were not recognized by the system.

There’s a unique kind of pressure when your system fails to acknowledge real payments.

The Fix #

The immediate fix was manual intervention. I updated the affected records directly through MongoDB Compass to ensure users weren’t impacted. After stabilizing the situation, I corrected the validation logic and properly tested the flow before deploying again.

That incident alone was enough to change how I approached development.

By day 4, I introduced a separate development environment using a different subdomain. While still sharing the same infrastructure, this separation allowed me to test changes without risking production stability. I also implemented CORS policies to better control frontend access.

Around the same time another issue surfaced "rate limiting".

The initial implementation used IP-based limits, which seemed reasonable at first. However, in a school environment where most users share the same WIFI network, this quickly became a problem. Multiple users were unintentionally blocked because they appeared to come from the same public IP.

The solution was straightforward but important:

apply rate limiting based on user ID for authenticated users, while keeping IP-based limits for unauthenticated endpoints such as registration.

This ensured both usability and basic protection against abuse.

To improve visibility, I also added basic observability tools. Using Grafana and Loki, I tracked request logs per endpoint, making it easier to identify and debug issues. Additionally, I implemented a simple audit log system to record important actions and transactions within the same database.

These additions didn’t eliminate problems, but they made bug tracking much easier.

Looking back, the system was far from perfect. It was built under pressure, adjusted in real-time, and improved through trial and error. But it worked. It handled real users, real transactions, and real constraints.

End of the story #

It highlighted a key difference that often goes unnoticed:

Building something that works locally is one thing.
Building something that survives production is something else entirely.

This experience forced me to think beyond code to consider edge cases, user behavior, infrastructure limitations, and failure scenarios. It wasn’t just about solving problems, but solving them fast enough while the system was already in use.

For a first deployment, it was chaotic.

But it was exactly the kind of chaos that teaches the right lessons and makes me love programming.

EOF[^] GO_TO_TOP

~/comments