Phishing Html Code



by Sven Morgenroth

Phishing is a method of trying to gather personal information using deceptive e-mails and websites. Here's what you need to know about this venerable, but increasingly sophisticated, form of cyber. Phishing is a method of trying to gather personal information using deceptive e-mails and websites. Here's what you need to know about this venerable, but increasingly sophisticated, form of cyber.

Introduction

Last year, an intriguing blog post about an innovative phishing technique was making the news. In this white paper, we piece together what happened, and deobfuscate the JavaScript code that was responsible for the clever parts of the professional – yet fake – website.

What Happened?

The phishing website in question ticked all the Phishing 101 checkboxes:

  • The website name looked similar to the original (tradeit.cash instead of skins.cash)
  • Its design was almost identical, except for the logo which was replaced with a similar one
  • It had a valid SSL certificate

As a side note, this warning popup is displayed on skins.cash (the original website).

Judging from these three factors, the fake tradeit.cash website appeared to be just like any other phishing page. When you tried to log in to the website, by clicking on the Steam Login button, a browser popup would be displayed that prompted the user to connect their Steam account with tradeit.cash. It is at this point that a user would supply their username and password, which would in turn be logged by the scammer. However, the address bar in the browser window clearly displayed 'https://steamcommunity.com/' (a valid Steam domain that takes care of Steam's SSO logins, among other things).

The Scammer Used a Clever Trick

That was confusing, to say the least. If tradeit.cash wanted to get the victim's password, this was clearly not the way to go about it, as Steam's SSO login can be considered secure and therefore the only party that would receive the password was Steam itself, not the scammer. However, if you hovered over the Chrome icon in his task bar, you would notice that only one instance of the browser was running. The screenshot shows two Chrome instances, the result that was actually expected, as seen on skins.cash (the original website).

Instead of the two anticipated instances, only one window was open. On further inspection, you can notice that the https://steamcommunity.com popup wasn't really a new browser window, just an HTML element of tradeit.cash.

This was what would have allowed the scammer to trick users, who would naturally assume they were looking at a popup from Steam's SSO portal (when, in actual fact they were just about to enter their username and password directly into a form field on the tradeit.cash phishing website). If you're anything like us, you'll be keen to know more about the HTML and JavaScript code that allowed the scammer to create such a realistic-looking browser window. Be warned: the code responsible for it is heavily obfuscated and requires some painfully tedious reconstruction in order to read it… So, let's get into it!

A Word Before We Start Deobfuscating

Here is what we can find out by deobfuscating the JavaScript code on the page:

  • How the obfuscation works
  • How the window-in-window trick was implemented

This will help us get a better understanding on how the attacker approached different problems during development of the phishing page, such as whether the user visits the page with a mobile device, which would make the phishing attempt incredibly obvious, as you don't have browser windows on most mobile devices.

However, there is a catch. Knowing either of these things doesn't help prevent the scammer from using this technique. In fact, each time the scammer sets up a new phishing page, they can use another obfuscation tool which makes it hard for anti-virus tools to detect the malicious code using a signature-based detection technique, even if we know its cleartext version.

Deobfuscating JavaScript code is mostly useful if you try to detect bugs, or in the specific case of malicious code that was planted on an infected website in order to find out what it does. However, if you are interested in this particular implementation of the technique, which is admittedly quite clever and detailed, read on.

How to Read & Make Sense of This White Paper

This post includes a lot of code snippets and technical explanations. This is how deobfuscation works. At times you might not understand what a certain object or string is used for until you've read all of the code and conducted some tests.

So, in the next section, we've linked each section of obfuscated code to its deobfuscated counterpart. If you don't immediately understand a piece of code, simply scroll to the end of the section and click on deobfuscate. This will make your browser jump to the next step in the deobfuscation process. If you click on obfuscate you will get back to the previous section. However we'd still recommend reading the post in the correct order.

It's worth noting that the obfuscation method used here has a fundamental flaw. It will become obvious how later, and this flaw will allow us to break the second and third obfuscation layer with ease. We could break the obfuscation immediately if we abused this flaw, but then we wouldn't learn anything..

Brief Overview of the Obfuscated Code

Fortunately, TehAurum, the author of the post we mentioned also posted a Github gist containing the obfuscated code. There are quite a few components that are responsible for running the script. Let's examine each one.

First Variables at the Top of the Script

Let's get the 'use strict' in line 1 out of the way first. This will most likely not interfere with our deobfuscation procedure.

Among other things, it's just a common way to ensure that you don't accidentally create global variables when you want to use them only in a specific scope, like a function or a loop. In other words, it forces you to use the var, const or let keyword when you declare a variable.

Next, there are weird-looking comments like this: /** @type {string} */. They are most likely instructions for a documentation generator, such as JSDoc, that allows you to parse the script and automatically generate documentation for the code. At first, this sounds like a good thing, as it could later reveal the types of the return values of the obfuscated functions. However, there's simply no way for us to know whether these are correct. So we will resist the urge to use them and will remove them completely later.

This leaves us with the actual variables. These strange looking names probably won't decode to something meaningful, and are likely just some random characters generated to make it harder for a human to read and track. Let's look at their values. The first looks like a base64-encoded string, while the next two look like hex encoding and the final one consists of the string 'true'.

First Array that Might Contain the Actual Payload

The rehb04de array contains a lot of gibberish: something that looks like native JavaScript function names (line 21: split, line 23: replace), part of a regular expression (line 24: w+), some numbers and hex encoded bytes separated by pipe (|) symbols (line 20, line 22), and an interesting string on line 25.

Line 25 is interesting because it looks like JavaScript, or more precisely, an abstract version of a variable declaration with an array. For now, because of its format, we will assume that this might contain an obfuscated payload.

A Simple Shuffle Function

Finally, we have some code where we can immediately see what it does. What we have here is an Immediately Invoked Function Expression (IIFE).

An IIFE is a JavaScript function that is immediately executed. This works by first encapsulating the function in parentheses and then calling it with or without parameters. In our case, the two parameters are the rehbr04de array (the one with the obfuscated payload) and the number 263. rehbr04de is now called data and the variable i contains the number 263.

Now to the interesting part. The write function is defined on line 38 and called on line 43 with the value ++i. So the isLE variable in the write function is now 264.

Let's see what the write function actually does with the number. There is a for loop on line 39 that decreases the value of isLE by 1 on each iteration. It will eventually stop when isLE has the value 0. Line 40 contains the actual code, data['push'](data['shift']());, which will remove the first element of the array (shift) and then add that element to the end of the array (push), effectively shuffling the content of the rehbr04de array and changing the order of its values. Let's move on to the next code block.

A Function that Allows You to Get an Array Element

On line 52, the rehbre04d function is defined. It has the parameters level and ai_test. However, ai_test is never used in the function.

This was done in order to suggest either that the function does something different than its true purpose, if you just quickly look at its parameters, or it exists simply to confuse someone who tries to deobfuscate it. In either case, it's not a very convincing attempt.

Now on to line 54. Here we have the code level = level - 0;, which is probably there to convert the variable level to an integer if it's a string. That would allow you, for example, to convert the string 0x5 to the integer 5. In the next line, the rowsOfColums variable contains a key of the rehbr04de array. Which key it will contain depends on the number in the level variable. After that the rehbr04de[level] value is returned.

An Object with Confusing Function Names

This block of code, and all the code that follows it, is inside an IIFE. Since it doesn't contain any parameters, we can treat it like it was there without the IIFE.

On line 61, an object is assigned to the rehbr2fb7b6$jscomp$0 variable. This object contains a few interesting functions, and if you were paying attention, some of the object keys were already present as values in the rehbr04de array. We'll explore that more later.

Let's now examine one of the functions. The first one has the key ALAuJ on line 62. It has the name handleSlide and two parameters, isSlidingUp and $cont. However, the function's content doesn't have much to do with its name. All it does is add the isSlidingUp variable to the $cont variable and return the result. The confusing name was probably designed to make it harder for the average user to uncover what these functions are doing just by looking at them. Now, it looks like they are just some JavaScript functions responsible for UI animation or to process user input from a slider. There are different functions with different purposes here. We'll examine them later.

Line 77 contains the DGmUY element, that consists of the first element (the one with the key 0) of the rehbr04de array. You might think that this was the string HpRsu, but keep in mind that we've shuffled the array earlier. We'll check which value this is later.

The Ouwac element in line 78 contains the string fromCharCode, which is often used when obfuscating JavaScript code. It returns the corresponding unicode symbol for any given number. For example, String.fromCharCode(120) would return the letter x. You can also pass an array of numbers instead of a single number.

So far, we've gained a basic understanding of what this code does, but we haven't yet replaced or looked up any values. We'll achieve that once we've read through the full script and cleaned it up to make it easier to read.

Another Array with References and a Preview of What's to Come

Let's get to the next array, rehbr139530$jscomp$0. It gives us a slight hint of what will happen to the array elements in the rehbr04de array that contains the pipe (|) character.

Since we have values separated by it, and there is both the pipe character as well as the keyword split, we can guess that the code will do something like this, which would turn the string into an array: arr = '2|0|4|3|1'.split('|'). Other than this, the array is filled with values from the rehbr2fb7b6$jscomp$0 object. We can also see the strings b and g. If we were right in our guess before, these strings may actually be part of a regular expression.

Old Functions with New Names

In this code snippet, you see a function call on line 99. We can guess that this function will just use the eval function on the second parameter, which is the function with the parameters app, index, key, a, fn and result.

We'll know for certain later, but it looks like that's what's happening. After that, an object called args is created. The format you see looks surprisingly similar to the one of the functions we've observed before. And, sure enough, there is also a handleSlide function on line 116. Also, if you look closely, it calls functions from the rehbr2fb7b6$jscomp$0 object. These functions are basically an alias for the ones we've examined above. This was most likely done to confuse the person that deobfuscates the code and further bloat the codebase in order to make it harder to keep track of it.

So the question is: What are the values of these function parameters?

Let's find out by going to the end of the file.

As you can see, the first one is the first value of the rehbr139530$jscomp$0 array. We haven't found out what it is yet. After that, there are some numbers, the result of a function call and an empty object. So far, this isn't very useful.

The First Function that will Help Deobfuscate the Code

Finally, there is another function. Yet again, it looks bloated and confusing. It returns the result of one of the functions in the args object.

Then we've got a function call in line 128. Below, there is a question mark (?) and a colon (:). This is short for if (…) { … } else { … }. So, if the result of the function in line 128 returns 'true', the result of line 129 will be returned; if it is 'false', the result of line 130 will be returned. In this case, it calls itself again, with a new data object based on the current one and the index variable. In case you were wondering, the index variable was one of the function parameters in line 99 in the previous section and its content was 62.

An Interesting Switch/Case Statement

Finally, we get to the interesting part. If a function call containing a regular expression does not result in a match, the code beginning in line 140 is executed.

From our previous guess, we can assume that the variable callbackVals will be filled with the result of '2|0|4|3|1'.split('|'). The for loop in line 143 will essentially loop forever, since the condition after the first semicolon (;) will never be false. That is, until it hits the break in line 171. On the first iteration, it will get the first element of callbackVals. In this case, it's the string '2'. The continue keyword makes the for loop continue with the next iteration. The next string is '0', then '4', '3' and finally '1'. Since at that point the code runs out of callbackVals array elements, it will return undefined. And since undefined doesn't match any case in the switch statement, it will just continue to line 171 and break out of the loop.

By the time it ends the loop, it will have filled the result object, reassigned the a variable to contain an array with a function that will return any given key from the result object, reassigned the fn variable to a new function, and set the key variable to '1'.

The Final Loop

Websites

The final for loop will check if the a object contains a key. If it does, the variable app will be reassigned with a function that is a method of app itself.

We don't know yet what app is, but judging from the regular expression in line 179-188, we can assume the whole for loop will trigger a replace operation.

Cleaning Up the Code

Now let's rename some variables, clean up the code, and move some array keys. We want to mention that renaming variables is not without risk this early in the process (we're still in an early phase). While you can rename all the variables that you can see, there might be references to them in parts of the code we haven't deobfuscated yet. It's often better to use comments to describe variables, unless you can be certain that you can rename every instance of the variable, which is most likely at the end or sometimes in deobfuscated functions. However, we'll rename the variables for now, as it's much easier to follow along when they are appropriately named.

First Variables at the Top of the Script (First Cleanup)

Unfortunately, we can't rename these variables yet. We still have no idea what they do. There are two possibilities, since there aren't yet any visible references to them in the code:

  • They might be unused and are just there to confuse someone who deobfuscates the code
  • They are referenced in an obfuscated part of the script

Therefore, the only change we've made was removing the documentation generator comments.

But, don't worry, it will be much clearer in a moment what this code does!

DEOBFUSCATE | OBFUSCATE

First Array that Might Contain the Actual Payload (First Cleanup)

As we saw earlier, the values in the first array are shuffled around. During this initial cleanup, we will put them in the right order, add comments to the array keys from which we can guess what they do, and finally add some numbers to make looking up the array keys easier in future.

This still looks pretty confusing, but it will help us out. We put it into the right order by copying both the array and the shuffle function into the Google Chrome devtools console and running it. Obviously, you shouldn't run obfuscated code in the developer console on a site you're logged into, but about:blank should be fine. And please, don't use an editor configuration that automatically allows you to run JavaScript code while coding if you have unknown, probably malicious, code like this! Most of the time those editors will use Node.js and you can't know whether one part of the obfuscated code looks like this or not.

Phishing Html Code

Since it's in a try/catch block, the script won't break when you run it in a browser. But if you run it in a node environment, like in an editor that runs JavaScript, or simply in node.js in general (yes, I've observed someone running obfuscated code in node), it might just ruin your day.

A Simple Shuffle Function (First Cleanup)

It was already obvious what the shuffle function was doing, but the variable names were still incorrect. Therefore, we can rename them as in the screenshot below, and look at the first completely deobfuscated function. We still have no idea what code for the fake browser window looks, but we're getting closer, one step at a time.

I've commented out the function call of the IIFE, simply because we've already brought the keys of the first array into the right order. And that function call would shuffle it a second time, which would make the deobfuscation fail later on.

Function That Allows You to Get an Array Element (First Cleanup)

This function is also completely deobfuscated now. Yet again, we simply renamed some variables and added a comment.

We've seen what this function does before, and it will first convert the key to an integer. The whole code is littered with calls to this function. It often looks like this:

This will, thanks to the -0, fill the key variable with the number 3 and then return firstArray[3].

An Object with Confusing Function Names (First Cleanup)

We'll get to our first function object now. We've named it misnamedFunctionsObj due to the misleading names its functions have been given before. Now the names make sense, but the Object keys still have these weird looking names.

Unfortunately, at this point it wouldn't make much sense to rename them to something that is less confusing. That's simply because there are many references to this object and we had to change some other array elements as well. We can't be sure that this won't have any negative effects later on. For example, some array elements might be a base64-encoded string that was split up and used as key, which we could later need for another part of the deobfuscation.

This is not a very common thing, but then again, it doesn't hurt to be prepared for such a trick. It would have the additional advantage that anybody who deobfuscates the code couldn't rename the object keys without breaking the code, which makes it harder to follow along. However, since we get to name the functions, we can still quickly find out what this is doing. Otherwise, we would need to add a comment above each key.

DEOBFUSCATE | OBFUSCATE

Second Array (First Cleanup)

Now let's get to the second array. After cleaning it up, it is much clearer.

Most of the array elements are taken from the misnamedFunctionsObj object. Some of the keys are there in cleartext, while others were taken from the first array. In the next run, we'll see what exactly its content is.

DEOBFUSCATE | OBFUSCATE

A Second Function object (First Cleanup)

We've established that this object mostly contains references to other functions. This becomes clear after we clean up the code.

As you can see, almost all functions are taken from the misnamed functions object. We'll call this the secondFunctionObj and we've already renamed those functions that don't have an obfuscated function body. We have also cleaned up the function call a little bit as shown.

You can see that this exclusively consists of static numbers and elements from the second array.

DEOBFUSCATE | OBFUSCATE

First Function for Deobfuscation (First Cleanup)

Although matters are much clearer now, we still don't understand everything. There are too many functions whose purpose is unclear, since they are merely there as a reference. We will fix this in the next step. Until then, we can only guess what this function is doing.

DEOBFUSCATE |OBFUSCATE

An Interesting Switch/Case Statement (First Cleanup)

We changed quite a bit here. We didn't just rename variables, but we also brought each case into the right order.

It's way easier to find out what happens now, as it's in the correct sequence. After the next step, it will become even clearer. For now, we have too much bloat due to switch.

DEOBFUSCATE | OBFUSCATE

The Final Loop

Our final for loop doesn't change much in terms of structure. However, the function names make more sense now, which is a great start for our next step – replacing the references with the actual code.

DEOBFUSCATE | OBFUSCATE

Replacing All the References

This part will remove a lot of confusion. What we've done so far was preparation for this step. It helped us to get an understanding of what's going on. We'll try to replace as many references with their actual values as possible. What that means is that we'll condense a lot of the bloated codebase into something more readable and understandable. We can now remove some functions that aren't important to us anymore, like the shuffle function and even the getFromFirstArray function, as we will replace the function calls with their actual return value.

An Object With Confusing Function Names (Resolved References)

Now this looks far more comprehensible! We've already seen in the previous steps that this is more of an intermediate object to artificially increase the code base size.

Its content will eventually end up in the secondArray and the secondFunctionObj as we will see later. If we wanted to, we could do a third cleanup and remove this object, but we won't do this for now. We still have a lot of obfuscated code, and we don't know if it needs this object.

Second Array (Resolved References)

Here, we have the second array, with all its references resolved. It becomes much clearer now what will happen later on.

We are up to the point of engaging in decoding (fromCharCode) and making some replacements (replace, w+, b, g). We're one step closer to find out how the code works. Yet again, we could probably remove this, as we will replace all occurrences of secondArray[key] with their actual value. But we will keep it until the script is completely deobfuscated.

A Second Function Object (Resolved References)

Here we have our second functions object. We've commented out the references to the first functions object and replaced them with the actual function body.

This is another one of the objects we don't need anymore. We'll still keep it, just like the others. Above, we see the parameters app, index, key, a, fn and result. Let's find out what their actual values are.

It seems like the app parameter is what we believe to be our payload. The index is 62, the key variable is 120, a is an array consisting of the content of the string with the pipe characters. It was split up using the split function, so each value between the pipe characters is now an array element. Fn is the number 0 and result is just an empty object.

First Function for Deobfuscation (Deobfuscated)

If there was any doubt that replacing the references with their actual values or function bodies would make a big difference in terms of readability, this will convince you of the opposite.

This is a huge difference compared to the previous version of the function. But what does it do? Let's first take a look at what this function returns. There are two variables, first and second.

In order to get the value of the first variable:

  • We need to compare data, the only parameter of the function to index
    • As we've seen above, index is just the number 62
  • If the data parameter is smaller than 62, first becomes an empty string.
  • If it's larger than 62
    • We divide it by 62 and use the Math.floor function to make sure it's an integer
    • fn calls itself with the result. The return value becomes the new first.

Next, data is reassigned. Its new value is the value of the old data variable modulo 62. So let's say that the old value of data is 250. The new data variable will be 250 % 62, which is 2.

Now back to the second variable and the new data.

  • If it's larger than 35, we use the String.fromCharCode function on the value of data + 29.
  • If it's smaller, we use base36 to encode the data.

At first, all of these random magic numbers are quite confusing. Let's try to find out what's going on by looking at some of the components of the function:

  • Magic numbers 62, 35, 29
  • Modulo operator (%)
  • String.fromCharCode
  • Base36

What modulo does in this case is make sure that we get a number that's between 0 and 61. So, if our number is 68, it will become 6.

Next, we'll see if our new number in data is larger than 35. If it is, String.fromCharCode(data + 29) is returned. If it's not, data is converted to base36. That allows us to get the magic number 35 out of the way. If the number is 35 or smaller, it will be converted to base36. Base36 consists of the numbers 0-9 and the characters a-z (lowercase). This is interesting.

Now, if the number is larger than 35, it will be passed to the String.fromCharCode function and 29 will be added to it. Let's check out what numbers can be passed to String.fromCharCode. The highest number is 61 + 29 or 90, as the modulo operation from before caps data to 61. The smallest number is 36 + 29 or 65, as anything below 36 will be converted to base36. If we take a look at this ASCII table, we find out that 65 becomes an uppercase A and 90 becomes an uppercase Z.

So that's what this function is actually about!

We've got a function that converts a number to a string with the following character set: 0-9a-zA-Z. That explains the magic number 62, because what we have here is a function that converts a number to base62. Let's check whether we are correct with our assumption.

Gmail Phishing Html Code

I've declared the index variable and pasted the function into the Google Chrome developer console. Then I've used this online converter to convert the base62 number 'secret' (yes, that's a number, not a word) to base10, which resulted in the number 50071489099. And sure enough, it worked.

Uppercase and lowercase letters were swapped, but that shouldn't really matter. Now that we know what it does, we can check which data was fed into the function.

The Switch / Case Statement (Deobfuscated)

Now this makes a huge difference as well. It appears you don't actually need a switch / case statement to influence the order in which your code is executed. Instead, you can just write it into the right order at the start. Crazy.

So what does this bit of code do? Remember - result is an empty object and key is 120. We've seen that in line 185 in one of the previous sections. a is the array that resulted from splitting the very long string with the pipe characters. Now, the first number is 119 (key--). So the code fills the result object.

In the first run, the key is base62(119) which is the string 1V. Its value is either the 119th array element of a if it exists, or, if the a array has less than 119 array keys it becomes 1V as well. After that, the loop starts again, this time with 118. That goes on until the key is 0. We can check how the result object would look like now by adding console.log(result) in line 153 and running the code in the developer console. This is the result.

That looks promising! We get something that looks like hex, some function names, and even a var keyword. We're getting close!

After the result object was filled, some variables were reassigned. A is now an array, containing a function that returns an element of the result object by key. fn is a new function now, that just returns the string 'w+' and key becomes the number 1. Exciting.

The Final Loop (Deobfuscated)

If you've just skipped all of the deobfuscation procedures above and went straight to this point, you should know that you've missed a lot of interesting stuff along the way.

We're just one single line of code away from finding out how all of it works, and hopefully we'll be rewarded with some deobfuscated JavaScript code that shows us how the window in window trick worked.

So this is our final loop, deobfuscated.

The only thing the loop did was decrementing the key variable, so that it's '0' now. All of the references resulted in the code you can see in line 166. To find out what this regular expression does, we can check regex101.com, which is probably the best tool available that helps you make sense out of regular expressions.

This is interesting because it will match anything consisting of a-zA-z0-9 but no special characters. The second parameter of the replace function is a callback function defined in line 154 in the previous section. So if the regular expression matches 1C in the app string it will look into the result object and return the value of result['1C'], which is the string 'while' in this case. So, in order to read our deobfuscated code, we just need to add a console.log(app) below that line or swap the return app statement with console.log(app).

I can hear you asking, 'Well… wait a minute now! Are you serious?'.

Yes we are! Flawed is the wrong word for this obfuscation method. It's downright broken. At the beginning we mentioned that we don't actually need to understand any of the obfuscation mechanisms. Now you see why. The return app statement was there from the very beginning, not obfuscated or otherwise obscured. The code it returns will be executed by the eval statement in line 106, in the section with the deobfuscated secondFunctionObj. We could have swapped that eval statement with a console.log (which we would always recommend) and would produce the deobfuscated text without much effort. It shows that obfuscation might sometimes be as hard as deobfuscation. If you don't really know what you are doing, you can chain hundreds of functions to confuse someone who wants to read your cleartext code. But that doesn't help you much if you just return the cleartext code and eval() it then in such an obvious place. Let's try it and see.

Finally We are Done! Or are We?

Finally, we can now see the deobfuscated code and find out how the window in window trick works. But there's a problem.

This looks familiar. It seems like they have obfuscated it again, using the same method. Let's first format it a bit more clearly, so we can see the structure of the code a little better.

This looks like a stripped down version of the code we dealt with before, even though the beautifier we used might just resolved some of it for us. You can see the many parts that we've deobfuscated earlier, but this code seems to omit most of the references and array shuffling. In line 23, we see our return statement from before. We can swap the return rehbr727dx1 part with a console.log(rehbr727dx1), run it in the developer console and copy the result.

Surprise: it's layer number three! After formatting (we just used a JavaScript beautifier again), this third layer is yet another variation of the obfuscated code from before.

The return statement is in line 22. You know the drill. So after swapping the return with console.log, we end up with this.

Is this the same result as before? Let's beautify the code and see what we end up with. This is the result.

At this point, we're almost done! The beautifier did a good job deobfuscating it to a certain degree and resolved all the array references for us. This may seem like cheating. But honestly after resolving roughly 600 different references across what feels like 20 objects and functions, we've earned the luxury of using an automated beautifier. However, there is still a lot of encoding going on. All the numbers and letters with the leading x are hex-encoded bytes. We somehow need to decode them to show their actual plaintext values.

This is a moderately complicated task. There are probably better ways than this, but we finally need some results. So let's start by replacing all the occurences of x with x in our editor. We will now wrap the whole code in backticks and use the following replacement operation.

This replaces every hex-encoded byte with its decoded counterpart. We don't want to replace x22 and x27 since these are double and single quotes. That might later break our code in an unexpected way, for example by prematurely ending a string, which would lead to errors. We still need to fix some code, like the regex at the beginning, as this replacement method is not completely accurate. That's not too critical, since all the regex does is check whether the website is viewed with a mobile device. So let's just remove that code for now.

What remains is actual, functional JavaScript code that does something meaningful. What does it do? We don't know yet. But it's significant progress to see proper code instead of some encoded strings and weird for loops. If you aren't familiar with it, the $ function is part of jQuery. Unfortunately we don't have the full HTML code, but that shouldn't matter too much. Let's see if we can find out what it does!

Mobile Redirect

It's clear what the code below does, even if it's a little hard to read. jQuery adds an event listener to the element with the goAuth class and a href of ?login. Within this listener function it first seems to set a cookie, using the setCookie function. This is not a native JavaScript function, but a custom one. It's defined a little bit further down the page.

Then it reassigns window.location.href, which results in a redirect. You can see the URL to which it redirects below the setCookie call. This looks exactly like a legitimate steam SSO url. However, as you might have noticed, it doesn't redirect to steamcommunity.com, but to document.location.origin (which would be https://tradeit.cash). Finally it returns false, which would make sure that the the link, once clicked, wouldn't actually visit the ?login url that's in its href attribute.

Gmail phishing html code

But this code doesn't run by default. First of all the isMobile variable needs to be set to true. But if it's not true, the code may still run if the weird looking second variable in the if statement has the value 'full'. You may not remember, but we've seen this variable before. It was one of the very first variables we encountered, the ones on top of the obfuscated code, that we haven't seen anywhere else yet. As you see, even the deobfuscated code may contain references to code we've encountered during other phases of the deobfuscation process. That's why randomly renaming variables or object keys might make it harder later when you try to make sense out of some parts of the code.

Finally, HTML Code!

After a very long search, we've finally found some HTML code. It's appended to the document body by jquery, but only if the variable we've seen above is true and if the user doesn't use a mobile device. Let's see.

It also becomes clear now why we didn't decode the double quotes. If we had literal double quotes after <script src =, the string would break. Let's copy this HTML code into another file and format it a little bit better. There are a lot of calls to document.location.origin, which we will just replace with http://localhost for now. So here's the final HTML code from line 17. First there are a few script and link tags. Initially they contained absolute URLs leading to a local resource, but we'll just use an external CDN for now.

What then follows are lines upon lines of CSS. We'll just look at the first few of them. The first one seems to refer to the container div that holds all of the other elements in this HTML file. It sets a few styles, such as display: none, which means that it's not visible by default, a white background color as well as a white border and a z-index of 999999999. The z-index basically defines whether the element is in the foreground, somewhere in the middle, or in the background. A z-index value of 999999999, means that the element is most likely always in the foreground.

In line 34 you see that there is an element with a favicon as background. That's probably the steam favicon, on the upper left side of the browser window.

Further down on the page we'll see the actual HTML elements that make up the layout of the page. We'll probably get a good understanding of what's going on once we examine this code.

That looks interesting! In line 284 there is a mention of Steam Community, which is the headline of the popup window from the Stuff With Aurum blog post. Next there is a placeholder for the browser name. It looks like the code we see here is actually the one that's responsible for showing the fake popup window. What follows are a few images. Since it contained document.location.origin, which is JavaScript but not HTML code, we replaced it with http://localhost. It looks like what we have here are six different buttons: b1, b2, b3 as well as b1_hover, b2_hover and b3_hover. We can assume that these are the minimize, maximize and close buttons.

That means what now follows should be the URL bar! And sure enough, there are several input fields, with the values 'https', '://', 'steamcommunity.com' and '/openid/login?openidns=..', which seems to be the URL we've seen on the fake popup. Then there is another input tag, containing the full URL.

At the bottom of the page is an iframe tag with an empty src attribute, which might contain the content of the phishing popup later on. Easeus todo pctrans 9.8 review. When opening the HTML file we are greeted with an empty page though. But why is that the case?

The display: none css property is likely at fault here. And even if it did show the page, the placeholder such as the browser name would still be empty, since we haven't included the deobfuscated JavaScript code. So let's get back to that first, before we can check out the fake popup.

We can see three variables. One of them has the value 1040, the other two have the value null. Below is a function that seems to show and hide the *_hover buttons from the HTML code we just saw. So if we hover with our mouse over one of the buttons it will just show another version of them.

The next piece of code seems to emulate the behaviour of a URL bar. It will select the URL once you click on it and hide the other URL, which consists of different input fields in order to show different colours for the https://, steamcommunity.com and /openid/… parts. If you click anywhere else on the page or press Enter, these input fields will appear again.

What follows now is the code that makes the popup window draggable. This allows you to move the window around when you click on the URL bar and move your mouse. Of course this only works within the boundaries of the parent browser window, but it still makes it somewhat more convincing.

On line 57 the there an event listener for the click event was added to the hover buttons. That means if you press the minimize, maximize or close button, the popup window will disappear. Additionally the iframe containing the phishing content will be cleared. Also in line 62 and 63, the clearTimeout and clearInterval functions are called on the two variables we've seen before (the ones that had the value null). It seems like further down the page those will be filled with the result of a setTimeout and a setInterval function.

Phishing Html Code

Now follows a function call, which seems to get the name of the current browser, which will fill the placeholder on our HTML page.

Let's see what the _0x428dx9 function does.

It looks like it queries some browser specific variables and functions and checks their return value in order to fingerprint the user's browser and show the right name in the window title.

What now follows is another event listener, but this time it's added to an a tag with both href='?login' and class='goAuth' properties.

This seems to show the phishing page (display: block), give it a fixed width and height and then calls the setTimeout function. The function defined within setTimeout will run after 1.2 seconds and load what's most likely the phishing page. That 1.2 seconds delay likely aims to make the loading of the phishing page more realistic.

clearInterval and clearTimeout will end any interval or timeout that is still active. After that setTimeout will wait for one second and start an interval. Every 5 milliseconds, the interval will check whether the the width of the URL bar is the same as the variable _0x428dx2. It has the value '1040' and was one of the three variables that were initially defined. If the width has not the same value as _0x428dx2, then _0x428dx2 will be reassigned. It now contains the width of the URL bar. Next the sizes of the input fields within the URL bar container are reduced. This most likely makes sure that the content of the URL bar always fits into the container.

If we take a closer look at the URL bar, we see three very familiar variables (line 78, 81 and 82). These variables were on top of the very first lines of our obfuscated piece of code. Let's resolve the dependencies of this URL and rename the variables.

The Purpose of the Very First Variables

So we can now see what the variables were used for. They were just path and cookie data. The first one was indeed base64-encoded and contained a path to where the phishing page was served.

Therefore we renamed them to base64EncodedPath. The other two variables are the names of cookies. gen_string just returns a random string. The final URL looks like this:

Opening this page with the correct parameters will show the fake Steam login prompt.

The doWindowInWindow variable is probably not perfectly named. It may either have the value 'full' or 'true', where 'full' just means that there is no popup, but instead the user is redirected to the phishing content on tradeit.cash.

Let's Test It!

Since we now know how the script works, we can test it. We need to get our own pictures such as the minimize, maximize and close buttons, the Extended Validation box containing the company name and the favicon. We also need to create a link that opens the fake window. As seen before, this link needs to have the class goAuth and the href ?login. This is what our version looks like.

The colors of our button are a bit off maybe, but overall it's a pretty convincing trick.

What Can We Learn?

  • First of all, you should always be careful when you type your username and password into any form field. Even if the site looks like a legitimate, well-known site, you should make sure to check your browser bar for the correct domain name.
  • Additionally you need to make sure that the browser window you see is real and not just part of the page. The easiest way to achieve this is by trying to drag the window over your browser bar. If it's a legitimate browser window you can drag it anywhere without limitations. If it's a fake window, it will stay within the boundaries of its parent browser window.

As already mentioned, deobfuscating this code won't prevent the scammer from using this technique and, as alarming as it is, there will be plenty of people who fall for this trick and provide their credentials. We might even see more phishing sites like this, as it makes it easy to abuse the trust we put into companies like Steam, Google or Facebook to protect our data and makes it more likely for us to provide our password to scammers like this. The only protection against this trick is to stay vigilant.

Further Reading

An Innovative Phishing Style
Gist

.

By Faizan Ahmad, University of Virginia

There is hardly a week when you go to Google News and don’t find a news article about Phishing. Just in the last week, hackers are sending phishing emails to Disney+ subscribers, ‘Shark Tank’ star Barbara Corcoran lost almost $400K in phishing scam, a bank issues phishing warnings, and almost three-quarter of all phishing websites now use SSL. Since phishing is such a widespread problem in the cybersecurity domain, let us take a look at the application of machine learning for phishing website detection. Although there have been many articles and research papers on this topic [Malicious URL Detection] [Phishing Website Detection by Visual Whitelists] [Novel Techniques for Detecting Phishing], they do not always provide open-source code and dive deeper into the analysis. This post is written to address these gaps. We will use a large phishing website corpus and apply a few simple machine learning methods to garner highly accurate results.

Data

Roblox phishing html code


The best part about tackling this problem with machine learning is the availability of well-collected phishing website data sets, one of which is collected by folks at the Universiti Malaysia Sarawak. The ‘Phishing Dataset – A Phishing and Legitimate Dataset for Rapid Benchmarking’ dataset consists of 30,000 websites out of which 15,000 are phishing and 15,000 are legitimate. Each website in the data set comes with HTML code, whois info, URL, and all the files embedded in the web page. This is a goldmine for someone looking to apply machine learning for phishing detection. There are several ways this data set can be used. We can try to detect phishing websites by looking at the URLs and whois information and manually extracting features as some previous studies have done [1]. However, we are going to use the raw HTML code of the web pages to see if we can effectively combat phishing websites by building a machine learning system. Among URLs, whois information, and HTML code, the last is the most difficult to obfuscate or change if an attacker is trying to prevent a system from detecting his/her phishing websites, hence the use of HTML code in our system. Another approach is to combine all three sources, which should give better and more robust results but for the sake of simplicity, we will only use HTML code and show that it alone garners effective results for phishing website detection. One final note on the data set: we will only be using 20,000 total samples because of computing constraints. We will also only consider websites written in English since data for other languages is sparse.

Byte Pair Encoding for HTML Code


For a naive person, HTML code does not look as simple as a language. Moreover, developers often do not follow all the good practices while writing code. This makes it hard to parse HTML code and extract words/tokens. Another challenge is the scarcity of many words and tokens in HTML code. For instance, if a web page is using a special library with a complex name, we might not find that name on other websites. Finally, since we want to deploy our system in the real world, there might be new web pages using completely different libraries and code practices that our model has not seen before. This makes it harder to use simple language tokenizers and split code into tokens based on space or any other tag or character. Fortunately, we have an algorithm called Byte Pair Encoding (BPE) that splits the text into sub-word tokens based on the frequency and solves the challenge of unknown words. In BPE, we start by considering each character as a token and iteratively merge tokens based on the highest frequencies. For instance, if a new word “googlefacebook” comes, BPE will split it into “google” and “facebook” as these words could be frequently there in the corpus. BPE has been widely used in recent deep learning models [2].

There have been numerous libraries to train BPE on a text corpus. We will use a great one called tokenizer by Huggingface. It is extremely easy to follow the instruction on the github repository of the library. We train BPE with a vocabulary size of 10,000 tokens on top of raw HTML data. The beauty of BPE is that it automatically separates HTML keywords such as “tag”, “script”, “div” into individual tokens even though these tags are mostly written with brackets in an HTML file e.g <tag>, <script>. After training, we get a saved instance of the tokenizer which we can use to tokenize any HTML file into individual tokens. These tokens are used with machine learning models.


Figure: Histogram of number of BPE tokens in HTML Code

TFIDF with Byte Pair Encoding


Once we have tokens from an HTML file, we can apply any model. However, contrary to what most people do these days, we will not be using a deep learning model such as a Convolutional Neural Network (CNN) or Recurrent Neural Network (RNN). This is mainly because of the computational complexity and the relatively small size of the data set for deep learning models. The figure above shows a histogram of tokens from BPE in 1000 HTML files. We can see that these files contain thousands of tokens whose processing will incur high computational cost in more complex models like CNN and RNN. Moreover, it is not necessary that token order matters for phishing detection. This will be empirically evident once we look at the results. Therefore, we will simply apple TFIDF weights on top of each token from the BPE.

As explained in the previous post on Authorship Attribution, TFIDF stands for term frequency, inverse document frequency and can be calculated by the formula given below. Term frequency (tf) is the count of a term i in a document j while inverse document frequency (idf) indicates the rarity and importance of each word in the corpus. Document frequency is calculated by totaling the number of times a term i appears in all documents. TF-IDF gives us weights as tfidf scores for each term in a document which is a product of tf and idf.

(1)

Machine Learning Classifier


Sticking with simplicity, we will use a Random Forest Classifier (RF) from scikit-learn. For training the classifier, we split the data into 90% training and 10% testing. No cross-validation is done since we are not trying to extensively tune any hyper-parameters. We will stick with the default hyperparameters of Random Forest from the scikit-learn implementation. Contrary to deep learning models that take a long time to train, RF takes less than 2 minutes on a CPU to train and demonstrate effective results as are shown next. To show robustness in performance, we train the model 5 times on different splits of the data and report the average test results.

Results


AccuracyPrecisionRecallFscoreAUC
98.5598.2998.8298.5599.68

Free Phishing Tools

Phishing Website Detection Results

The table above shows the results on test data averaged across 5 experiments. Looking at the surface, these seem like great results especially without any hyperparameter tuning and with a simple model. However, these are not so great. The model has 98% precision for both classes which means it gives around 2% false positives when it is detecting phishing websites. That is a huge number in the security context. False positives are the websites that the machine learning model deems to be phishing but are in fact legitimate. If users frequently encounter false positives, they have a bad user experience and they might not want to use the model anymore. Moreover, the security folks encounter threat alert fatigue when dealing with false positives. False positives are further quantified in the confusion matrix below where x-axis shows the actual classes and y-axis has the predicted classes. Even though the model is achieving a high accuracy score, there are 11 instances where the model predicted “Phishing” for the website but in reality, it was a safe website.

16 (False Negative)912 (True Negative)Legitimate
920 (True Positive)11 (False Positive)Phishing
PhishingLegitimatePredicted Class
Actual Class

Confusion matrix for the model

Now that we know there is still a problem with the model and we cannot deploy it as it is, let us look at a potential solution. We are going to use the Receiver Operating Curve (ROC) to look at the false and true positive rates. In the figure below, it is easy to see that for up to 80% true positive rate, we have a 0% false-positive rate which is something we can use for decision making.

The ROC curve demonstrates that for a particular confidence threshold (red dot), the true positive rate would be around 80-90% while the false positive rate would be close to zero. To prove this, let us look at different confidence thresholds and plot metrics against them. To apply a confidence threshold of x%, We will only keep websites where the model is more than x% confident that the website is either legitimate or a phishing one. When we do this, the total number of phishing websites (true positive rate) we can identify decreases but our accuracy increases considerably and precision also becomes close to 100%.

Facebook Phishing Page Html Code


Phishing Email Examples

Figure: Effect of Confidence Threshold on Accuracy, TPR, and FP

Phishing Websites List

The above figure demonstrates the effect of confidence threshold on test accuracy, the number of false positives, and the true positive rate. We can see that when we are using the default threshold of 0.5, we have 11 false positives. As we start to increase our confidence score, our true positive rate decreases but the number of false positives starts getting very low. Finally, at the last point in the graph, we have zero false positives for precision. This means that whenever our model says a website is trying to phish, it is always accurate. However, since our true positive rate has declined to 82%, the model can only detect around 82% phishing websites now. This is how machine learning could be used in cybersecurity by looking at the tradeoff between false positives and true positives. Most of the time, we want an extremely low false-positive rate. In such settings, one can adopt the approach above to get effective results from the model.

Limitations


Before concluding this post, let us discuss a few limitations of the methods we have seen above. First, our data set is pretty decent sized but it is not comprehensive at all for all the types of phishing websites out there. There might have been millions of phishing websites in the last couple of years but the data set contains 15,000 only. As hackers are advancing their techniques, newly made phishing websites might not be making the same mistakes that the old ones were making which might make them hard to detect using the model above. Secondly, since TFIDF feature representation does not take into account the order in which code is written, we can potentially lose information. This problem does not arise in deep learning methods as they can sequentially process sequences and take into account the order of the code. Moreover, since we are using raw HTML code, an attacker can observe the predictions of the model and spend some time trying to come up with obfuscations in the code that will render the model ineffective. Finally, someone can use off the shelf code obfuscators to obfuscate the HTML code which will again render the model useless since it has only seen plain HTML code files. However, despite some of these limitations, machine learning can still be very effective in complementing phishing blacklists such as the ones used by Google Safe Browsing. Combining blacklists with machine learning systems can provide better results than relying on blacklists alone.

Open-Source Code


As I discussed in the first post of this blog, I will always open-source the code for the projects I discuss in this blog. Keeping the tradition alive, here is the link for replicating all experiments, training your own phishing detection models, and testing new websites using my pre-trained model.

Github Repository: https://github.com/faizann24/phishytics-machine-learning-for-phishing


Bio: Faizan Ahmad is currently a Masters student at the University of Virginia (UVA) and works as a graduate research assistant at the Mcintire School of Commerce in UVA. He will be joining Facebook as a Security Engineer in June 2020. His interests lie at the intersection of cyber security, machine learning, and business analytics and he has done plenty of research and industrial projects on these topics.

Roblox Phishing Html Code

Original. Reposted with permission.

Facebook Phishing Site Html Code

Related:





Comments are closed.