In this post, we'll walk through the end-to-end process of isolating logs for a sample Kubernetes service (myapp-service
), sending them to OpenSearch for real-time querying, and archiving them to Amazon S3 to keep your search cluster lean. We'll cover configuring Fluent Bit via Helm and Terraform, setting up OpenSearch indices, managing IAM roles, troubleshooting VPC networking, and writing a scalable Python archival script.
Whether you're managing critical service logs or optimizing storage, this guide will provide a robust, automated solution to archive service-specific logs to S3 and maintain your OpenSearch cluster's performance.
1. Fluent Bit Configuration on EKS
We used the AWS EKS Fluent Bit Helm chart (aws-for-fluent-bit
) managed via Terraform to filter and route logs:
resource "helm_release" "fluentbit" {
name = "fluentbit"
repository = "https://aws.github.io/eks-charts"
chart = "aws-for-fluent-bit"
namespace = "kube-system"
values = [<<-EOT
opensearch:
enabled: true
index: "app-logs-001"
tls: "On"
awsAuth: "Off"
host: "your-opensearch-domain.amazonaws.com"
awsRegion: "ap-south-1"
httpUser: "admin"
httpPasswd: "${module.opensearch_credentials.password}"
cloudWatchLogs:
enabled: false
filters:
extraFilters: |
[FILTER]
Name grep
Match kube.*
Regex kubernetes.container_name myapp-service
# disable built-in S3, we'll inject a custom output next
s3:
enabled: false
outputs:
extraOutputs: |
[OUTPUT]
Name s3
Match *
Match_Regex kubernetes.container_name myapp-service
region ap-south-1
bucket myapp-logs-archive
use_put_object On
total_file_size 100M
s3_key_format /service-logs/$TAG/%Y/%m/%d/%H/%M/%S
EOT]
}
Key Points:
- We applied a
grep
filter to select onlymyapp-service
logs - We disabled the chart's built-in S3 output and added a custom
[OUTPUT] s3
block capturing just those logs - All other logs still flow to OpenSearch under the original index
app-logs-001
2. OpenSearch Index Setup
We created a new index app-logs-002
with an explicit date mapping so Dashboards can filter by time:
PUT /app-logs-002
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"@timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
}
}
}
}
Configuring OpenSearch Dashboards:
In OpenSearch Dashboards under Stack Management → Data Views, we:
- Selected the wildcard pattern
app-logs-*
- Clicked Refresh fields and set
@timestamp
as the time filter - Saved and verified logs from
app-logs-002
appear in Discover
3. IAM Roles: IRSA vs. Node Instance Role
Our cluster did not use IRSA, so Fluent Bit inherited the worker node instance role. We identified and configured the appropriate permissions by:
- Inspecting the
aws-auth
ConfigMap to find the node IAM role undermapRoles:
- Attaching an inline policy granting
s3:PutObject
on our archive bucket:
aws iam put-role-policy \
--role-name eks-yourCluster-NodeInstanceRole \
--policy-name fluentbit-s3-put \
--policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::myapp-logs-archive/*"
}]
}'
No new IAM roles were needed—just an extension of the existing node role.
4. Networking: VPC Endpoint for S3
Our EC2 instance couldn't reach S3 initially (no NAT). We resolved this by:
- Creating a Gateway VPC Endpoint for
com.amazonaws.ap-south-1.s3
in our VPC - Associated it with the private subnets of our EC2
- Applied a bucket policy restricting access to that endpoint:
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "AllowFromVpcEndpoint",
"Effect": "Allow",
"Principal": "*",
"Action": ["s3:GetObject", "s3:PutObject", "s3:DeleteObject"],
"Resource": "arn:aws:s3:::myapp-logs-archive/*",
"Condition": {
"StringEquals": {
"aws:sourceVpce": "vpce-0abcdef1234567890"
}
}
}]
}
We confirmed connectivity via curl
, AWS CLI, and a small Python test.
5. Archival Script on EC2
We developed a Python script (archive_logs_to_s3.py
) that connects to OpenSearch, scrolls through logs, and uploads them to S3 in parallel batches.
Script Overview:
The script performs the following operations:
- Connects to OpenSearch with
opensearch-py
using elevated timeouts and retries - Scrolls through
app-logs-001
matchingmyapp-service
- Uploads batches in parallel to S3 under
service-logs/<DATE>/<TIME>/batch-XXXXXX.json
usingThreadPoolExecutor
# snippet from archive_logs_to_s3.py
resp = es.search(
index=ES_INDEX,
scroll="2m",
size=BATCH_SIZE,
body={"query": {"match_phrase": {"kubernetes.container_name": "myapp-service"}}},
request_timeout=120
)
for batch_no in count(1):
hits = resp["hits"]["hits"]
executor.submit(upload_batch, batch_no, hits)
resp = es.scroll(scroll_id=sid, scroll="2m", request_timeout=120)
Running the Script:
We executed the script in the background and monitored its progress:
nohup python3 -u archive_logs_to_s3.py > archive_full.log 2>&1 &
tail -f archive_full.log
And saw each batch upload successfully:
✅ Batch 1: 5,000 docs → s3://myapp-logs-archive/service-logs/2025-05-16/123456/batch-000001.json
...
6. Performance and Cost Analysis
Performance Metrics:
- Measured throughput (first 143 batches in 254 seconds): ~2,815 docs/s
- Projected run time: ~88M / 2,815 ≈ 8 hours 42 minutes
Cost Breakdown:
- S3 PUT requests: ~17,662 → $0.09 in API fees
- Data transfer: All in-region via gateway → no egress charges
Conclusion
By combining Helm/Terraform for infrastructure management, OpenSearch for real-time querying, IAM and VPC best practices for security, and a scalable Python script for archival, we created a robust pipeline to archive service-specific logs to S3. This approach keeps your search cluster performant, archives critical logs for compliance requirements, and minimizes AWS costs.
The solution demonstrates how thoughtful architecture and automation can solve common challenges in log management at scale. Feel free to adapt these examples to your own services and infrastructure. Happy archiving!