Application-layer DDoS attacks (L7 DDoS) have existed almost since the beginning of the Internet. Unsurprisingly, UNIX and Linux system-administration tools have evolved as well, allowing engineers to fight these attacks more efficiently. There are plenty of open-source tools intended to mitigate L7 DDoS, but in practice many of them still rely heavily on processing text files — which is no longer feasible given the scale of modern application-layer attacks.

While L7 DDoS targets computational resources, other types of malicious bots target different resources and also may cause denial of service. Examples include:

Web scraping bots led LWN to resources exhaustion
Shopping bots, which hoard inventory and make goods unavailable to real customers
Booking boots, which consume all available appointment slots, preventing legitimate users from booking (I personally struggled with this)
Security scanners and password-crackers bots

The AI boom amplifies the problem. At a recent startup-founders event, I had quick conversations with attendees: roughly two-thirds were building AI products, and their applications rely on data scraped from countless sources — from universities to LinkedIn. It’s no surprise that many companies, including well-funded ones, are building specialized proxy services to enable large-scale web scraping and make their bots invisible to protection systems.

With Tempesta FW 0.8, we provide a high-performance open-source foundation for data analysis to efficiently combat modern bot attacks. The upcoming 0.9 release, already available in the main development branch, introduces a comprehensive toolset for automated detection and blocking of sophisticated L7 DDoS attacks. WebShiled builds on this by offering an open-source and flexible toolkit to defend against diverse classes of web bots.

Teaser: this article discusses only open-source technologies and, in addition to many technical details, explains why open source can defend against even the most sophisticated attacks more effectively than large proprietary vendors such as Cloudflare or Akamai.

Tempesta Fingerprints

Tempesta FW in 0.8 introduces traffic filtering based on several types of client fingerprints.

Client fingerprinting has been around for a while and is widely used in application-level (L7) DDoS mitigation. JA3 is a popular fingerprinting method for client identification, including for DDoS protection. JA4 extends this further and provides hashes for TCP, TLS, HTTP, and even some destination-level characteristics.

JA3 relies on MD5 for hash calculation, while JA4 moves to SHA-256. Usually, cryptographically strong hash functions are used to prevent guessing of an alternate clear text for a particular signature. However, there is no need to guess network characteristics for client impersonalization — they can simply reproduce the required protocol stack (reference curl-impersonate as an example). Moreover, the original JA3 blog post notes that fuzzy hashing may be more beneficial, and also mentions MD5 performance issues.

Since cryptographic computations are expensive, especially under DDoS conditions, we want to avoid unnecessary CPU overhead. For this reason, we designed Tempesta Fingerprints (TF) — a lightweight client-fingerprinting mechanism with a fixed structure in the binary hash format. Currently we compute TF at the TLS (tft) and HTTP (tfh) layers (hence the suffixes `t` and `h` in the hash names). The hashes are commuted as follows:

TLS (tft)

3 bits: ALPN: “h2”, “http/1.1”, “http/1.1,h2”, “h2,http/1.1”
1 bit: set if handshake has unknown ALPN
1 bit: found vhost for SNI
1 bit: abbreviated handshake
1 bit: TLS version
alignment to 1 byte
2 bytes: sum * 11 + cipher_suite (11 is just a small prime, relatively far from a power of 2). This scheme represents the order of ciphersuites.
2 bytes: sum * 11 + extension_type
2 bytes: sum * 11 + elliptic_curve

HTTP (tfh)

1 bit: http version (h1 or h2)
5 bits: HTTP method (tfw_http_meth_t value)
5 bits: number of Cookie values (all bits set for 31 and more cookies, within one or several headers)
6 bits: number of headers (all bits set for 63 and more headers)
1 bit: has Referer
alignment to 3 bytes
4 bytes: sum * 11 + header, where header is a 4 bytes HTTP/1 header prefix or value from static of decoded dynamic table value for HTTP/2

The structure of the hash is simple and transparent, and you can even infer certain connection or request properties from it. More importantly, the hash encodes similarity. For example, if two requests differ only in their HTTP version, their hashes will be “close” to each other. Conversely, if two HTTP requests have nothing in common, their hashes will be very “far apart.”

This behavior is extremely useful for machine-learning classification, as it allows us to cluster clients efficiently. In contrast, the key property of a strong cryptographic hash function is that a single-bit change in the input produces a completely different output. This is ideal for cryptography, but undesirable for ML-driven similarity classification.

When TF is enabled, Tempesta FW computes TLS and HTTP fingerprints and logs them in the access log. This allows third-party systems — such as WAFs or bot-detection engines — to use these hashes to identify and classify clients (IP addresses are not reliable identifiers: they may change, and many legitimate users can share the same IP address). The classification system can then issue filtering rules back to Tempesta FW to accelerate WAF operations:

tft storage_size=1073741824 {
    hash 66cbda9cafc40009 0 0;
    hash 66cbda9cafc40010 10 1000;
}

jfh storage_size=1073741824 {
    hash b1c0008c0280 0 0;
    hash b1c0008c0281 10 1000;
}

The hash statements in the tft and tfh sections define 2 rate limits for a particular TF hash. The first limit is the maximum number of connections per second, applied to both TLS and HTTP. The second limit refers to the maximum number of TLS records or HTTP requests per second, respectively. A limit of zero means that clients presenting that fingerprint are blocked.

Typically, under normal circumstances, you would start Tempesta FW with an empty set of TF filtering rules, like:

tft storage_size=1073741824 {
}

tfh storage_size=1073741824 {
}

This empty configuration tells Tempesta FW that it should begin tracking client rate-limit statistics.

When a DDoS attack or unwanted bot activity occurs, you collect the TF hashes specific to the malicious traffic. Then you generate the corresponding tfh and/or tft configuration sections and reload Tempesta FW configuration. It is important to reload the configuration rather than restart Tempesta FW, so that legitimate client connections are not disrupted. At the moment the TF hash filtering rules are reloaded, Tempesta FW already has the necessary statistics for all currently active clients and can begin filtering malicious traffic immediately.

There may be millions of clients, and accounting for all of them is impractical. Instead, Tempesta FW allocates a fixed amount of memory — defined by the storage_size amount of memory for the client’s accounting. parameter — for client accounting. The accounting (rate-limiting) records are stored in an LRU structure, meaning that only the most active clients occupy accounting space; this design is extremely efficient for DDoS scenarios. For better performance, Tempesta DB uses 2MB huge pages to store accounting records, so storage_size must be a multiple of 2MB (2,097,152).

Access Logs in Clickhouse

There are several problems with traditional web-server access logs:

They produce an enormous number of records and therefore a huge volume of data. It’s common for a busy HTTP server to generate 100k+ log entries per second, and each entry can easily exceed 500 bytes depending on the application. Efficient storage and management of this data becomes a challenge.
This volume also creates performance issues. In a caching HTTP proxy, where most data is stored in RAM, the access log is typically the component that generates the highest number of disk writes. Small bursts can be buffered by the OS, but sustained high write rates quickly become a bottleneck. For example, we observed access-log writing becoming the bottleneck for a 100Gbps CDN node running Nginx.
The logs are written to be analyzed later, but grep & Ko aren’t efficient enough to quickly process huge logs and quickly react to an incident such as an L7 DDoS or a bot attack.

To address these problems, sophisticated log-shipping pipelines are often used. An HTTP server writes logs either to a file (e.g., Nginx) or to syslog via a UNIX domain socket (e.g., HAProxy). Then log shippers such as Filebeat + Logstash, Vector, or Fluent Bit collect the logs in an intermediate format and forward them to a database.

Many types of databases can serve as log destinations — we have seen ClickHouse, MongoDB, TimescaleDB, and InfluxDB used for this purpose. However, we settled on ClickHouse for two main reasons:

One of the best ingestion performance
Powerful analytical capabilities, which are critical for identifying security incidents

With an analytical database like ClickHouse storing access logs, you can derive nearly any form of analytics or performance monitoring: top page views, clusters of the most active users, the most visited pages, the slowest endpoints, and more. It’s essentially Google Analytics on your own infrastructure.

Tempesta FW Fast Log Shipping

We designed a new access-log shipping daemon, tfw_logger, from scratch to achieve maximum performance and scalability. The daemon’s configuration is described in our wiki, but now let’s look at the technical details of the solution.

Tempesta FW is a Linux-kernel HTTP accelerator that operates as part of the TCP/IP stack. Essentially, you can think of it as a protocol extension: the vanilla kernel handles IP, TCP, and TLS, and we extend this stack with HTTP processing. The Linux TCP/IP stack and Tempesta FW both run in the context of deferred interrupts — softirqd kernel threads. Each softirqd runs on a dedicated CPU.

tfw_logger also spawns threads with per-CPU affinity, so each kernel sofitrqd thread has a corresponding user-space tfw_logger thread. The local-CPU pair, softirqd and tfw_logger, communicate through mmap()-ed ring buffer, meaning all access-log events are passed from the kernel to user space with zero copying, and there is no contention on multi-core hardware.

The events are written in a binary format to avoid data conversion overhead and to reduce the total data volume. Each tfw_logger thread maintains a dedicated connection to ClickHouse and uses a write buffer to send data in sufficiently large batches.

The ring buffer may overflow if tfw_logger or ClickHouse cannot keep up (in benchmarks, ClickHouse is typically the bottleneck unless it has significantly more hardware resources than Tempesta FW). If this happens, Tempesta FW drops events, but the next successfully written event will include a counter indicating how many events were dropped.

:) SELECT DISTINCT timestamp,uri,user_agent,tfh,dropped_events
   FROM access_log LIMIT 15;

┌───────────────timestamp─┬─uri─┬─user_agent─┬─────────────────tfh─┬─dropped_events─┐
│ 2025-11-05 16:10:30.250 │ /   │ baremetal  │ 6576814795386782464 │             11 │
│ 2025-11-05 16:10:30.402 │ /   │ vm.web     │ 6576814795386782464 │              0 │
│ 2025-11-05 16:10:29.145 │ /   │ tempesta-1 │ 6576814795386782464 │              0 │
│ 2025-11-05 16:10:30.574 │ /   │ tempesta-2 │ 6576814795386782464 │              0 │
│ 2025-11-02 16:00:03.285 │ /   │ tempesta-2 │ 6576814795386782464 │              3 │
│ 2025-11-02 16:00:04.104 │ /   │ tempesta-1 │ 6576814795386782464 │              0 │
│ 2025-11-02 16:00:04.234 │ /   │ baremetal  │ 6576814795386782464 │              0 │
│ 2025-11-02 16:00:04.630 │ /   │ vm.web     │ 6576814795386782464 │              0 │
│ 2025-11-02 16:00:05.286 │ /   │ tempesta-2 │ 6576814795386782464 │           1380 │
│ 2025-11-02 16:00:06.104 │ /   │ tempesta-1 │ 6576814795386782464 │              0 │
│ 2025-11-02 16:00:06.234 │ /   │ baremetal  │ 6576814795386782464 │              0 │
│ 2025-11-02 16:00:06.632 │ /   │ vm.web     │ 6576814795386782464 │              0 │
│ 2025-11-02 16:00:07.284 │ /   │ tempesta-2 │ 6576814795386782464 │              0 │
│ 2025-11-02 16:00:08.103 │ /   │ tempesta-1 │ 6576814795386782464 │              0 │
│ 2025-11-02 16:00:08.235 │ /   │ baremetal  │ 6576814795386782464 │              0 │
└─────────────────────────┴─────┴────────────┴─────────────────────┴────────────────┘

ClickHouse running in the same VM as Tempesta FW and tfw_logger and having default configuration (i.e. all traces are switched on) can ingest about 34K records per second. This is a decent enough number for a 4-vCPU 8GB RAM KVM running on the performance cores of an i9-12900HK laptop. For this benchmark, we used "max_events": 100000 – the number of batched records for tfw_logger.

Identifying and Blocking Shopping Bots

Let’s look at a couple of bot problems and how we can solve them using analytics. We’ll start with shopping bots that hoard shopping carts. These bots often rotate through thousands of IP addresses, sometimes using residential proxies. Their User-Agent headers are usually identical to those of real browsers, making simple pattern-based detection ineffective.

One key difference from legitimate traffic is the large number of “add to cart” requests. We begin by identifying clients — based on their fingerprints and IP addresses — who made at least 20 requests in the last 10 minutes, and whose requests are 70% or more to /cart/:

:) SELECT tfh, tft, address, tot_n, cart_n,
          round((100. * cart_n) / tot_n, 2) AS cart_pct
   FROM (
       SELECT tfh, tft, address, count() AS tot_n,
              sum(if(startsWith(uri, '/cart/'), 1, 0)) AS cart_n
       FROM access_log
       WHERE timestamp >= now() - INTERVAL 10 MINUTE
       GROUP BY tfh, tft, address
   ) WHERE tot_n >= 20 AND (cart_n / tot_n) >= 0.7
   ORDER BY cart_pct DESC, cart_n DESC
   LIMIT 10;

┌──────────────────tfh─┬──────────────────tft─┬─address────────┬─tot_n─┬─cart_n─┬─cart_pct─┐
│ 10096416430024820544 │  7407100078318223377 │ ::ffff:x.x.x.x │    71 │     71 │      100 │
│ 12147574827842808512 │  7407100078318223377 │ ::ffff:x.x.x.x │    30 │     30 │      100 │
│ 15535994163971229248 │  7407183517036511248 │ ::ffff:x.x.x.x │    20 │     20 │      100 │
│  1563756704385729984 │ 10415425527688331280 │ ::ffff:x.x.x.x │    20 │     20 │      100 │
│  8706882686000311104 │  7407100078318223377 │ ::ffff:x.x.x.x │    20 │     20 │      100 │
│  4625652633599487168 │ 10415504629338275857 │ ::ffff:x.x.x.x │    20 │     20 │      100 │
│  9762503995733845952 │ 10415504629338275857 │ ::ffff:x.x.x.x │    20 │     20 │      100 │
│  4895812304220465472 │  7407100078318223377 │ ::ffff:x.x.x.x │    20 │     20 │      100 │
│ 12159293637520000448 │  7407183517036511248 │ ::ffff:x.x.x.x │    20 │     20 │      100 │
│ 15535994163971229248 │  7407183517036511248 │ ::ffff:x.x.x.x │    20 │     20 │      100 │
└──────────────────────┴──────────────────────┴────────────────┴───────┴────────┴──────────┘

The query result showed masked IP addresses, but all of them were different. The HTTP fingerprints were also all different. However, we noticed that the TLS fingerprints exhibited much lower variability, so we reran the query grouping clients only by TLS fingerprints.

To obtain more than one row, we adjusted the previous query by lowering the threshold to (cart_n / tot_n) >= 0.3. The result was as follows:

┌──────────────────tft─┬─tot_n─┬─cart_n─┬─cart_pct─┐
│  7407100078318223377 │ 32920 │  32081 │    97.45 │
│  7407183517036511248 │   130 │     63 │    48.46 │
│  6567475626450092048 │   140 │     48 │    34.29 │
│  6567475626450092080 │    45 │     15 │    33.33 │
│ 10415504629338275857 │  1262 │    406 │    32.17 │
└──────────────────────┴───────┴────────┴──────────┘

There is only one TLS fingerprint for which cart requests exceed 50%. This fingerprint generated significantly more requests than all the others. A practical explanation for this result is that, in many cases, vulnerable network devices (e.g., home routers with outdated firmware or software) are exploited and used as residential proxies. This gives the attacker thousands of IP addresses that are not classified as “data center IPs.”

Sometimes such devices expose a characteristic TLS fingerprint, while the proxy code operating at the application layer focuses on mimicking real browser HTTP fingerprints.

Security Scanning Bots

About half of all websites run on WordPress, so it was important for us to use WordPress to evaluate Tempesta FW solutions. xmlrpc.php has long been a security concern in WordPress, and unsurprisingly, we observe large volumes of requests targeting this endpoint:

:) SELECT uri, count(*) AS hits FROM access_log
   GROUP BY uri ORDER BY hits DESC LIMIT 5;

┌─uri────────────┬───hits─┐
│ /              │ 894986 │
│ //xmlrpc.php   │  56386 │
│ /xmlrpc.php    │   6905 │
│ //wp-login.php │   2664 │
│ /wp-login.php  │   2348 │
└────────────────┴────────┘

Usually this endpoint is exploited through POST requests — let’s verify that (in this query, we use the HTTP methods table):

:) SELECT if(method=3,'GET',if(method=10,'POST','OTHER')) AS http_method,
          status, count() AS hits
   FROM access_log WHERE uri LIKE '%xmlrpc.php'
   GROUP BY http_method, status ORDER BY hits DESC;

┌─http_method─┬─status─┬──hits─┐
│ POST        │    200 │ 63350 │
│ GET         │    405 │    18 │
│ POST        │    504 │     7 │
│ GET         │    404 │     1 │
└─────────────┴────────┴───────┘

Now let’s run roughly the same query as in the previous case, but this time excluding IP addresses, since they are also quite different:

:) SELECT tfh, tft, tot_n, wp_n, round(100.0 * wp_n / tot_n, 2) AS wp_pct
   FROM (
       SELECT tfh, tft, count() AS tot_n,
              sum(if(endsWith(uri, 'xmlrpc.php'), 1, 0)) AS wp_n
       FROM access_log WHERE timestamp >= now() - INTERVAL 1 MONTH
       GROUP BY tfh, tft
) WHERE tot_n >= 100 AND (wp_n / tot_n) >= 0.7
ORDER BY wp_pct DESC, wp_n DESC;

┌──────────────────tfh─┬──────────────────tft─┬─tot_n─┬──wp_n─┬─wp_pct─┐
│     4052383250776832 │ 16510119553361641488 │ 53808 │ 53808 │    100 │
│     4052383250776832 │                    0 │  4038 │  4038 │    100 │
│    44583087680004992 │ 16510119553361641488 │  2660 │  2660 │    100 │
│    44583087680004992 │                    0 │   830 │   830 │    100 │
│ 11419835049519350848 │                    0 │   301 │   301 │    100 │
│ 14842101498202555840 │   952257393946198037 │   268 │   268 │    100 │
│  1846240697765069696 │                    0 │   156 │   156 │    100 │
│ 12574953975358948224 │   127210661679530000 │   130 │   130 │    100 │
│ 12083946343356696000 │   952257393946198037 │   111 │   111 │    100 │
│  6203221905573938624 │   952257393946198037 │   108 │   108 │    100 │
│ 15360351046914803136 │   952257393946198037 │   108 │   108 │    100 │
│  9197298789346968640 │                    0 │   302 │   300 │  99.34 │
└──────────────────────┴──────────────────────┴───────┴───────┴────────┘

Zero values for the TLS fingerprint indicate plain HTTP. Note that this query covers a one-month period, yet there are only 3 different TLS fingerprints! The bots also repeatedly hit the exact same URI.

HTTP fingerprints are more diverse, but still — only 11 different fingerprints for more than 60,000 requests.

Now we need to decide how to handle these malicious requests. We can block them based on their TLS hashes or by using Tempesta FW’s HTTP tables. To avoid blocking legitimate users along with the bots, let’s check the TLS hashes against the accessed URIs and print the hashes in hex format:

:) SELECT hex(tft) AS tft_hex, left(uri, 40) AS uri_short,
          count() AS hits
   FROM access_log WHERE tft IN ('16510119553361641488',
                                 '952257393946198037',
                                 '127210661679530000')
   GROUP BY tft_hex, uri_short
   ORDER BY hits DESC;

┌─tft_hex──────────┬─uri_short────────────────────────────────┬──hits─┐
│ E51FBA42695A0010 │ //xmlrpc.php                             │ 60084 │
│ E51FBA42695A0010 │ //wp-login.php                           │  2944 │
│ 0D37190DF4E70015 │ /xmlrpc.php                              │  1039 │
│ E51FBA42695A0010 │ //?author=1                              │    61 │
│ E51FBA42695A0010 │ //wp-json/wp/v2/users/                   │    61 │
│ E51FBA42695A0010 │ /                                        │    53 │
│ 01C3F16C3D0D0010 │ /xmlrpc.php                              │    45 │
│ E51FBA42695A0010 │ //wp-includes/wlwmanifest.xml            │    42 │
│ E51FBA42695A0010 │ //?author=2                              │    42 │
│ E51FBA42695A0010 │ //?author=3                              │    21 │
│ E51FBA42695A0010 │ //wp-includes/ID3/license.txt            │    19 │
│ E51FBA42695A0010 │ //feed/                                  │    19 │
│ E51FBA42695A0010 │ /blog/lean-video-co2000nferencing-billin │     4 │
│ E51FBA42695A0010 │ /blog/fast-programming-languages-c-cpp-r │     3 │
│ 0D37190DF4E70015 │ /?author=9                               │     1 │
│ 0D37190DF4E70015 │ /?author=4                               │     1 │
│ E51FBA42695A0010 │ /blog/tempesta-fw-0-7-release-wordpress- │     1 │
│ 0D37190DF4E70015 │ /?author=8                               │     1 │
│ E51FBA42695A0010 │ /blog/web-cache-poisoning/               │     1 │
│ 0D37190DF4E70015 │ /?author=6                               │     1 │
│ 0D37190DF4E70015 │ /?author=7                               │     1 │
│ 01C3F16C3D0D0010 │ /wp-json/wp/v2/users                     │     1 │
│ 0D37190DF4E70015 │ /?author=10                              │     1 │
│ 01C3F16C3D0D0010 │ /blog/                                   │     1 │
│ 0D37190DF4E70015 │ /?author=5                               │     1 │
│ 0D37190DF4E70015 │ /?author=3                               │     1 │
│ 0D37190DF4E70015 │ /?author=2                               │     1 │
└──────────────────┴──────────────────────────────────────────┴───────┘

While most of the requests are invalid, there are several requests, in lines 11, 15, 21-23, 26, 28 and 33, that look normal. However, all of them originate from the same TLS fingerprint hash, with the four least significant bits equal to 0.

We developed our own fingerprinting hash instead of using JA3 or JA4 so that it can be decoded directly from the logs. The hash is stored as a raw C structure:

typedef struct {
    unsigned char alpn:3;
    unsigned char has_unknown_alpn:1;
    unsigned char vhost_found:1;
    unsigned char is_abbreviated:1;
    unsigned char is_tls1_3:1;
    unsigned short cipher_suite_hash;
    unsigned short extension_type_hash;
    unsigned short elliptic_curve_hash;
} TlsTft;

i.e. the 4 least significant bits encode the ALPN (Application-Level Protocol Negotiation, e.g. typically it is “http/1.1” or “h2” for HTTP/2) value, and a value of 0 means “no ALPN at all.” This is not normal for regular browsers. Because of this, we can safely conclude that there are no legitimate users sharing all three of these TLS fingerprints, and we can simply block all such traffic. However, in this particular case there are more efficient ways to block these requests than relying on TLS fingerprints.

WordPress XML-RPC is a well-known security issue in WordPress, and it is still enabled by default. Although it can be disabled on the WordPress side, the endpoint is typically exploited via POST requests, which means your backend will still receive them. Since XML-RPC pingback can be abused or large-scale L7 DDoS attacks, the volume of such requests can be significant. We recommend disabling this endpoint at the web-accelerator layer. You can do this with the following Tempesta FW configuration:

http_chain {
        uri == "*/xmlrpc.php" -> block;
}

Now, let’s take a look at one more L7 DDoS attack.

Application-level DDoS: Slow HTTP Attacks

One of the main goals of Tempesta FW is efficient mitigation of DDoS attacks, and the system includes many rate-limiting and filtering mechanisms to combat L7 DDoS. At the application layer, we can distinguish two major categories: flood attacks, which have a clear signature of sending a large number of HTTP requests, and slow attacks, which also heavily exhaust server resources but typically have much more subtle signatures. The second type is far more interesting, so let’s consider an example.

Instead of using an easy-to-detect technique, such as opening many connections and sending an HTTP request over each at an extremely low rate, a targeted DDoS attack may address the heaviest endpoint and request it from thousands of IP addresses. If they mimic real browsers, such an attack can be difficult to mitigate without degrading user experience with challenges such as CAPTCHAs.

We cannot simply select clients with the highest cumulative response time, because the most popular fingerprint generates the largest number of requests; even if each request is fast, the total response time becomes huge. We also cannot simply query the top request_time values, because normal clients occasionally produce rare outliers. However, the intersection of the top 20 talkers and the top 20 cumulative response times gives us an interesting result:

:) SELECT hex(tft) AS tft_hex,
          sum(response_time) AS tot_resp_time, count() AS tot_req,
          sum(response_time)/count() AS avg_resp_time
   FROM access_log
   GROUP BY tft
   HAVING tft IN (
       SELECT tft FROM (
           SELECT tft, sum(response_time) AS s
           FROM access_log
           GROUP BY tft ORDER BY s DESC LIMIT 20
       )
   ) AND tft IN (
       SELECT tft FROM (
           SELECT tft, count() AS c
           FROM access_log
           GROUP BY tft ORDER BY c DESC LIMIT 20
       )
   ) ORDER BY avg_resp_time DESC;

┌─tft_hex──────────┬─tot_resp_time───────┬─tot_req────────┬──avg_resp_time─────┐
│ 398A2452C0320030 │            21264088 │           7735 │ 2749.0740788623143 │
│ 398A9769C0320010 │            24137024 │          11469 │ 2104.5447728659865 │
│ B94C7202A1480035 │            12519004 │          82918 │ 150.98053498637208 │
│ E51FBA42695A0015 │             8315604 │          58670 │ 141.73519686381456 │
│ 66CB4E46EF170015 │             2869480 │         887197 │ 3.2343211259731492 │
└──────────────────┴─────────────────────┴────────────────┴────────────────────┘

The top 2 hashes have empty ALPN values (zero in the 4 least-significant bits) and an average response time an order of magnitude higher than the others.

Automatic Bot and L7 DDoS Protection

But how do we decide when exactly to run the ClickHouse queries to determine malicious bots? How to block them? And, frankly, it’s not very practical to run these queries manually.

My colleague Maksym Stukalo built Tempesta WebShield – a lightweight Python daemon that analyzes Tempesta FW access logs in ClickHouse, classifies traffic, and dynamically blocks bad actors, including L7 DDoS bots, web scrapers, shopping bots, and other automated threats.

You can find a detailed installation and configuration guide in our wiki. For now, let’s walk through a simple configuration example and test it using the MHDDoS tool.

I have a current built of Tempesta FW master branch, installed from source, so my /etc/tempesta-webshield/app.env contains the following lines:

PATH_TO_TFT_CONFIG="/root/tempesta/etc/tft/blocked.conf"
PATH_TO_TFH_CONFIG="/root/tempesta/etc/tfh/blocked.conf"
TEMPESTA_EXECUTABLE_PATH="/root/tempesta/scripts/tempesta.sh"
TEMPESTA_CONFIG_PATH="/root/tempesta/etc/tempesta_fw.conf"

Next, I have Clickhouse installed on the local machine and configure WebShield connection as

CLICKHOUSE_HOST="127.0.0.1"
CLICKHOUSE_PORT=9000
CLICKHOUSE_USER="default"
CLICKHOUSE_PASSWORD=""
CLICKHOUSE_TABLE_NAME="access_log"
CLICKHOUSE_DATABASE="default"

For a simple test, we can enable only one detector – requests per second (RPS) grouped by TLS fingerprints (tft). This detector will block bad bots using those same fingerprints at the Tempesta FW layer:

DETECTORS=["tft_rps"]
BLOCKING_TYPES=["tft"]

Besides these detectors and filters, WebShield supports HTTP fingerprints, IP addresses, filtering by accumulated response time, number of HTTP error codes, and Geo location. (See the wiki for the full list of detectors and blockers.) For the sake of simplicity, we disable training mode:

TRAINING_MODE="off"

and define the threshold as 10 standard deviations :

DETECTOR_TFT_RPS_DEFAULT_THRESHOLD=10

Make WebShield fetch new records from Clickhouse every 3 seconds:

BLOCKING_WINDOW_DURATION_SEC=3

A threshold of 10 standard deviations is generally good, but legitimate traffic spikes may still occur depending on time of day, weekday, or business activities (e.g. marketing campaign). To prevent false positives WebShiled can fetch past users who should never be blocked. We call such users persistent. Enable the option as:

PERSISTENT_USERS_ALLOW=True
# 1 week period: 3600min * 24 * 7 = 604800
PERSISTENT_USERS_WINDOW_OFFSET_MIN=604800
PERSISTENT_USERS_WINDOW_DURATION_MIN=604800

This way, we do not block any users who accessed our service within the last week.

However, false positives may still happen, so WebShiled can block clients only for a certain period of time:

BLOCKING_TIME_MIN=1
BLOCKING_RELEASE_TIME_MIN=1

In this case, any client exceeding the threshold will be blocked for 1 minute, and WebShield will check once per a minute whether it should unblock them.

Everything is almost ready; they only remaining step is to run Tempesta FW with a configuration like the following (the options relevant to this article are in lines 20-30):

cache 2;

listen 192.168.100.4:443 proto=https,h2;
listen 192.168.100.4:80;

srv_group default {
    server 192.168.100.4:8000;
}
vhost default {
    proxy_pass default;
}

tls_certificate /root/tempesta/etc/tfw-root.crt;
tls_certificate_key /root/tempesta/etc/tfw-root.key;
tls_match_any_server_name;

block_action attack reply;
block_action error reply;

tft {
    !include /root/tempesta/etc/tft
}
tfh {
    !include /root/tempesta/etc/tfh
}

access_log mmap logger_config=/root/tempesta/etc/tfw_logger.json;

http_chain {
    uri == "*/xmlrpc.php" -> block;
    -> default;
}

Now start WebShiled:

$ python3 app.py -c /etc/tempesta-webshield/app.env -l DEBUG
[2025-11-19 21:06:14,465][webshield][INFO]: Found protected user agents. Total user agents: `0`

Generate some HTTP traffic to let WebShield learn the mean and run MHDDoS (100 connections and 1000 requests per connection for 180 seconds):

MHDDoS$ ./start.py GET https://tempesta-tech.com/ 1 100 proxies.list 1000 180

Check that the access_log table is receiving more and more records:

:) SELECT DISTINCT hex(tft), status, count() as count
   FROM access_log GROUP BY tft, status;

┌─hex(tft)─────────┬─status─┬────cnt─┐
│ 66CB9FD8EF170010 │    200 │   2923 │
│ 66CB9FD8EF170010 │    502 │ 318841 │
│ C6610B269D970015 │    304 │      3 │
│ 66CB0DAC148C0010 │    504 │      1 │
│ C6610B269D970015 │    200 │     43 │
│ 0D378F00A5F5001D │    200 │   1051 │
│ 66CB0DAC148C0010 │    403 │      1 │
│ C6610B269D970015 │    206 │      2 │
│ 908BA2B344C30015 │    200 │     11 │
│ 66CB9FD8EF170010 │    400 │    853 │
│ 66CB0DAC148C0010 │    200 │     84 │
│ 398A4371C0320015 │    304 │      1 │
│ 66CB0DAC148C0010 │    502 │      2 │
│ 398A4371C0320015 │    200 │      6 │
│ 66CB8F0004E3001A │    200 │      6 │
│ 66CB9FD8EF170010 │    403 │   1326 │
└──────────────────┴────────┴────────┘

Shorty, you should see a report message from WebShield:

[2025-11-19 21:07:36,828][webshield][WARNING]: Blocked user User(tft=['66cb9fd8ef170010'], tfh=['0', 'f589c3f000c0a00'], ip=[IPv4Address('192.168.100.1')], value=37994, type=None, blocked_at=1763586454) by tft

Tempesta FW configuration should now contain a new filtration rule:

$ cat /root/tempesta/etc/tft/blocked.conf 
hash 66cb9fd8ef170010 0 0;

The access_log table should no longer receive new records with TLS fingerprint 66cb9fd8ef170010.

Now, if you stop MHDDoS and wait for 1 minute, you’ll see that WebShield automatically removes the filtration rule:

$ cat /root/tempesta/etc/tft/blocked.conf
$

HTTP Floods

With the MHDDoS test we were limited with the number of source IP addresses, so we used TLS fingerprints only. However, while working on this article, our website actually experienced a small real-world HTTP flood attack involving thousands of different IP addresses.

This incident clearly demonstrated a practical WebShield use case, even for setups that already have solid rate limiting.

Our website is a typical SMB informational site about our products and services. Other than new blog posts, there is nothing that could generate anywhere near 100 RPS.

A normal browser visiting our home page for the first time issues around 50–80 HTTP requests and 6–10 TCP connections (we serve both HTTP/1.1 and HTTP/2). So rate limits like:

frang_limits {
  request_rate 200 20;
  request_burst 200;

  tcp_connection_rate 20 60;
  tcp_connection_burst 20;
  concurrent_tcp_connections 20;
}

look reasonably generous. These allow bursts of up to 200 RPS and 20 CPS (connections per second), , while keeping sliding window limits at 200 requests per 20 seconds and 20 new TCP connections per 60 seconds, effectively around 10 RPS and 0.3(3) CPS.

The thing is that 10 RPS and ⅓ CPS can overwhelm your infrastructure when a botnet scales to tens or hundreds of thousands of sources. You can tune the Linux TCP/IP stack to handle the volume (which we, naturally, adjusted only after the incident 🙂 ), but there is no point in wasting resources on clearly parasitic traffic.

WebShield identifies aggregate, overall or per-client-group traffic anomalies even if each individual IP stays well below the limits, and blocks the offending clients automatically. And yes, this HTTP flood was very simple and exposed only 2 distinct TLS fingerprints.

As a bonus piece of advice to avoid silly operational mistakes: always keep your monitoring and ClickHouse database separate from the edge servers.

Not Smart Enough, Not Fast Enough

A system administrator with experience in DDoS attacks and malicious bots may immediately notice several issues with this approach:

The statistical model is very basic, which makes false positives possible and isn’t strong enough to counter more advanced bot behavior.
There are plenty of open-source tools capable of forging both HTTP and TLS fingerprints – for example, curl_cffi.
Even with high-performance access-log ingestion into ClickHouse, fetching data may still introduce too much delay for timely blocking.

Detection Accuracy and Your Secret Weapon

The current version of WebShield is 0.1, i.e. it’s a very-very early release, primarily to explore how access-log analytics can be used to detect bot attacks. Future versions of WebShield will introduce more advanced detection mechanisms.

But you already have a secret weapon: your data. You know which user agents and operating systems your legitimate clients use, how long they stay on your site, which pages they visit, in what sequence, and with what delays. An attacker knows none of this.

WebShield is a small open-source Python project, which means you can easily extend it with your own custom logic. (And yes — GPLv2 allows customization and does not require you to publish security-sensitive modifications.)

A custom solution becomes unique to your specific application. Its strength is exactly that uniqueness: an attacker must spend disproportionately more effort to break your setup. Today, attackers can build a bypass for Cloudflare or Akamai and sell it to everyone. The economics are simple: bypassing a major vendor is difficult, but once achieved, it compromises thousands of companies at once. You don’t need a “perfect” solution. You need one that is expensive enough to bypass compared to the revenue an attacker could make by defeating it.

For large vendors, bypass kits eventually become cheaper.

For a custom setup, the economics work in your favor.

Recently, the Shopify community discussed the growing bot problem. For an e-commerce site, for example, you can incorporate internal database queries into the WebShield classifier to detect bots creating fake accounts.

Fingerprint Impersonalization

Selenium, Playwright or curl_cffi are just a handful of many open source tools to build modern bots, such as anti_bot_scraper. These tools not only mimic regular browsers, but can also rotate exposed fingerprints. In the earlier MHDDoS example we saw 3 different TLS fingerprints.

The fingerprints we saw in that case were still unique. In other words, you may suddenly get a large volume of requests claiming to be “Chrome 123 on macOS,” even though this fingerprint is rare or nonexistent in your normal traffic. Or you might mostly serve mobile users and then observe an instant spike of “Windows 10 desktop browsers.”

The point is simple: you know your traffic; the attacker doesn’t.

But fingerprint variability is still a real problem. The solution is behavior analysis based on your data: the sequence of requested pages, the time delays between them and other high-level patterns.

One particularly interesting angle is that browsers handle HTTP/2 streams in subtly different ways, even different minor versions of the same browser. It’s relatively easy to fake a user-agent string or a set of headers, but it’s much harder to mimic this deeper protocol behavior. However, HTTP/2 stream fingerprinting cannot be computed from the very first request; you need a sequence of requests, which again makes it part of behavioral analysis rather than a simple per-request filter.

Incident Response Time

This problem is especially critical for L7 DDoS attacks, since many bots, especially scrapers, intentionally introduce long delays to evade detection.

We are addressing the response-time issue by extending tfw_logger to directly feed records directly into a machine learning daemon, reducing the delay between observation and action.

Conclusion

While working on this article, we realized how far behind open-source bot protection solutions are compared to the open-source tools designed to scrape websites and evade detection.

Tempesta FW and WebShield provide a toolkit for analyzing live web traffic and dynamically blocking bots. These are early versions, but I believe this is a step toward smart bot protection – without user-unfriendly challenges (e.g., CAPTCHA) and without simplistic “proof-of-work” mechanisms that attackers bypass easily.

We’d be glad to hear about your experience with bad bots and feature requests inspired by real-world incidents.

We are hiring! Take a look at our opportunities

Share this article

Defending Against L7 DDoS and Web Bots with Tempesta FW