Saturday, December 9, 2023
HomeBig DataFraud Detection with Cloudera Stream Processing Half 1

Fraud Detection with Cloudera Stream Processing Half 1

In a earlier weblog of this collection, Turning Streams Into Knowledge Merchandise, we talked in regards to the elevated want for decreasing the latency between knowledge era/ingestion and producing analytical outcomes and insights from this knowledge. We mentioned how Cloudera Stream Processing (CSP) with Apache Kafka and Apache Flink might be used to course of this knowledge in actual time and at scale. On this weblog we’ll present an actual instance of how that’s performed, how we will use CSP to carry out real-time fraud detection.

Constructing real-time streaming analytics knowledge pipelines requires the power to course of knowledge within the stream. A important prerequisite for in-stream processing is having the aptitude to gather and transfer the information as it’s being generated on the level of origin. That is what we name the first-mile downside. This weblog shall be revealed in two elements. Partly one we’ll look into how Cloudera DataFlow powered by Apache NiFi solves the first-mile downside by making it simple and environment friendly to purchase, remodel, and transfer knowledge in order that we will allow streaming analytics use instances with little or no effort. We will even briefly focus on some great benefits of operating this stream in a cloud-native Kubernetes deployment of Cloudera DataFlow.

Partly two we’ll discover how we will run real-time streaming analytics utilizing Apache Flink, and we’ll use Cloudera SQL Stream Builder GUI to simply create streaming jobs utilizing solely SQL language (no Java/Scala coding required). We will even use the knowledge produced by the streaming analytics jobs to feed completely different downstream techniques and dashboards. 

The use case

Fraud detection is a good instance of a time-critical use case for us to discover. All of us have been by means of a state of affairs the place the main points of our bank card, or the cardboard of somebody we all know, has been compromised and illegitimate transactions have been charged to the cardboard. To reduce the harm in that state of affairs, the bank card firm should be capable to determine potential fraud instantly in order that it could possibly block the cardboard and get in touch with the consumer to confirm the transactions and probably subject a brand new card to exchange the compromised one.

The cardboard transaction knowledge often comes from event-driven sources, the place new knowledge arrives as card purchases occur in the true world. In addition to the streaming knowledge although, we even have conventional knowledge shops (databases, key-value shops, object shops, and so forth.) containing knowledge that will have for use to complement the streaming knowledge. In our use case, the streaming knowledge doesn’t comprise account and consumer particulars, so we should be a part of the streams with the reference knowledge to supply all the knowledge we have to test in opposition to every potential fraudulent transaction.

Relying on the downstream makes use of of the knowledge produced we could must retailer the information in numerous codecs: produce the listing of potential fraudulent transactions to a Kafka matter in order that notification techniques can motion them at once; save statistics in a relational or operational dashboard, for additional analytics or to feed dashboards; or persist the stream of uncooked transactions to a sturdy long-term storage for future reference and extra analytics.

Our instance on this weblog will use the performance inside Cloudera DataFlow and CDP to implement the next:

  1. Apache NiFi in Cloudera DataFlow will learn a stream of transactions despatched over the community.
  2. For every transaction, NiFi makes a name to a manufacturing mannequin in Cloudera Machine Studying (CML) to attain the fraud potential of the transaction.
  3. If the fraud rating is above a sure threshold, NiFi instantly routes the transaction to a Kafka matter that’s subscribed by notification techniques that may set off the suitable actions.
  4. The scored transactions are written to the Kafka matter that may feed the real-time analytics course of that runs on Apache Flink.
  5. The transaction knowledge augmented with the rating can also be endured to an Apache Kudu database for later querying and feed of the fraud dashboard.
  6. Utilizing SQL Stream Builder (SSB), we use steady streaming SQL to research the stream of transactions and detect potential fraud based mostly on the geographical location of the purchases.
  7. The recognized fraudulent transactions are written to a different Kafka matter that feeds the system that may take the mandatory actions.
  8. The streaming SQL job additionally saves the fraud detections to the Kudu database.
  9. A dashboard feeds from the Kudu database to indicate fraud abstract statistics.

Buying with Cloudera DataFlow

Apache NiFi is a part of Cloudera DataFlow that makes it simple to accumulate knowledge in your use instances and implement the mandatory pipelines to cleanse, remodel, and feed your stream processing workflows. With greater than 300 processors obtainable out of the field, it may be used to carry out common knowledge distribution, buying and processing any sort of information, from and to nearly any sort of supply or sink.

On this use case we created a comparatively easy NiFi stream that implements all of the operations from steps one by means of 5 above, and we’ll describe these operations in additional element under.

In our use case, we’re processing monetary transaction knowledge from an exterior agent. This agent is sending every transaction because it occurs to a community deal with. Every transaction comprises the next data:

  • The transaction time stamp
  • The ID of the related account
  • A novel transaction ID
  • The transaction quantity
  • The geographical coordinates of the place the transaction occurred (latitude and longitude)

The transaction message is in JSON format as appears like the instance under:


  "ts": "2022-06-21 11:17:26",

  "account_id": "716",

  "transaction_id": "e933787c-f0ff-11ec-8cad-acde48001122",

  "quantity": 1926,

  "lat": -35.40439536601375,

  "lon": 174.68080620053922


NiFi is ready to create community listeners to obtain knowledge coming over the community. For this instance we will merely drag and drop a ListenUDP processor into the NiFi canvas and configure it with the specified port. It’s doable to parameterize the configuration of processors to make flows reusable. On this case we outlined a parameter known as #{enter.udp.port}, which we will later set to the precise port we’d like.


Describing the information with a schema

A schema is a doc that describes the construction of the information. When sending and receiving knowledge throughout a number of purposes in your surroundings and even processors in a NiFi stream, it’s helpful to have a repository the place the schema for all various kinds of knowledge are centrally managed and saved. This makes it simpler for purposes to speak to one another.

Cloudera Knowledge Platform (CDP) comes with a Schema Registry service. For our pattern use case, we’ve got saved the schema for our transaction knowledge within the Schema Registry service and have configured our NiFi stream to make use of the right schema identify. NiFi is built-in with Schema Registry and it’ll mechanically connect with it to retrieve the schema definition at any time when wanted all through the stream.

The trail that the information takes in a NiFi stream is decided by visible connections between the completely different processors. Right here, for instance, the information acquired beforehand by the ListenUDP processor is “tagged” with the identify of the schema that we wish to use: “transaction.”

Scoring and routing transactions

We skilled and constructed a machine studying (ML) mannequin utilizing Cloudera Machine Studying (CML) to attain every transaction based on their potential to be fraudulent. CML supplies a service with a REST endpoint that we will use to carry out scoring. As the information flows by means of the NiFi knowledge stream, we wish to name the ML mannequin service for knowledge factors to get the fraud rating for every certainly one of them.

We use the NiFi’s LookupRecord for this, which permits lookups in opposition to a REST service. The response from the CML mannequin comprises a fraud rating, represented by an actual quantity between zero and one.

The output of the LookupRecord processor, which comprises the unique transaction knowledge merged with the response from the ML mannequin, was then related to a really helpful processor in NiFi: the QueryRecord processor.

The QueryRecord processor permits you to outline a number of outputs for the processor and affiliate a SQL question with every of them. It applies the SQL question to the information that’s streaming by means of the processor and sends the outcomes of every question to the related output.

On this stream we outlined three SQL queries to run concurrently on this processor:


Word that some processors additionally outline further outputs, like “failure,” “retry,” and so forth., as a way to outline your individual error-handling logic in your flows.

Feeding streams to different techniques

At this level of the stream we’ve got already enriched our stream with the ML mannequin’s fraud rating and reworked the streams based on what we’d like downstream. All that’s left to finish our knowledge ingestion is to ship the information to Kafka, which we’ll use to feed our real-time analytical course of, and save the transactions to a Kudu desk, which we’ll later use to feed our dashboard, in addition to for different non-real-time analytical processes down the road.

Apache Kafka and Apache Kudu are additionally a part of CDP, and it’s quite simple to configure the Kafka- and Kudu-specific processors to finish the duty for us.

Working the information stream natively on the cloud

As soon as the NiFi stream is constructed it may be executed in any NiFi deployment you might need. Cloudera DataFlow for the Public Cloud (CDF-PC) supplies a cloud-native elastic stream runtime that may run flows effectively.

In comparison with fixed-size NiFi clusters, the CDF’s cloud-native stream runtime has a number of benefits:

  • You don’t must handle NiFi clusters. You possibly can merely connect with the CDF console, add the stream definition, and execute it. The required NiFi service is mechanically instantiated as a Kubernetes service to execute the stream, transparently to the consumer.
  • It supplies higher useful resource isolation between flows.
  • Circulation executions can auto-scale up and down to make sure the correct quantity of sources to deal with the present quantity of information being processed. This avoids useful resource hunger and likewise saves prices by deallocating pointless sources when they’re now not used.
  • Constructed-in monitoring with user-defined KPIs that may be tailor-made to every particular stream are completely different granularities (system, stream, processor, connection, and so forth.).

Safe inbound connections

Along with the above, configuring safe community endpoints to behave as ingress gateways is a notoriously tough downside to resolve within the cloud, and the steps differ with every cloud supplier. 

It requires establishing load balancers, DNS data, certificates, and keystore administration. 

CDF-PC abstracts away these complexities with the inbound connections characteristic, which permits the consumer to create an inbound connection endpoint by simply offering the specified endpoint identify and port quantity.

Parameterized and customizable deployments

Upon the stream deployment you possibly can outline parameters for the stream execution and likewise select the scale and auto-scaling traits of the stream:


Native monitoring and alerting

Customized KPIs might be outlined to observe the elements of the stream which can be essential to you. Alerts might be additionally outlined to generate notifications when the configured thresholds are crossed:

After the deployment the metrics collected for the outlined KPI might be monitored on the CDF dashboard:

Cloudera DataFlow additionally supplies direct entry to the NiFi canvas for the stream as a way to test particulars of the execution or troubleshoot points, if mandatory. All of the performance from the GUI can also be obtainable programmatically, both by means of the CDP CLI or the CDF API. The method of making and managing stream might be totally automated and built-in with CD/CI pipelines.


Amassing knowledge on the level of origination because it will get generated, and shortly making it obtainable on the analytical platform, is important for the success of any challenge that requires knowledge streams to be processed in actual time. On this weblog we confirmed how Cloudera DataFlow makes it simple to create, check, and deploy knowledge pipelines within the cloud.

Apache NiFi’s graphical consumer interface and richness of processors permits customers to create easy and complicated knowledge flows with out having to jot down code. The interactive expertise makes it very simple to check and troubleshoot flows in the course of the improvement course of.

Cloudera DataFlow’s stream runtime provides robustness and effectivity to the execution of the flows in manufacturing in a cloud-native and elastic surroundings, which allows it to broaden and shrink to accommodate the workload demand.

Within the half two of this weblog we’ll take a look at how Cloudera Stream Processing (CSP) can be utilized to finish the implementation of our fraud detection use case, performing real-time streaming analytics on the information that we’ve got simply ingested.

What’s the quickest option to be taught extra about Cloudera DataFlow and take it for a spin? First, go to our new Cloudera DataFlow residence web page. Then, take our interactive product tour or join a free trial



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments