One of the regular sets of issues we see on the Support Desk comes from memory problems on the LEM appliance.
There's the minimum requirements for LEM (which you can check out here: http://web.swcdn.net/creative/pdf/datasheets/SW_LEM_Datasheet.pdf): that's a dual core 3Ghz CPU and 8GB of memory.
I think that one of the common misunderstandings about LEM is how it utilizes memory. The LEM appliance loads your rules and correlations into memory, and all events go into memory and through those rules before being written to the disk. That means that LEM will grab all the memory that it gets assigned and use it. What costs memory and causes problems?
- Hyper-active rules: if you're getting more "Rule fired" events than actual events, that's going to eat the LEM appliance alive. This is one of the reasons we tell people to avoid the "AnyAlert" event type in Rules. AnyAlert sends EVERY alert to the rules engine so it can be checked for whatever field values. That takes a lot of memory.
- Virtual machine hosts: Because LEM is so dependent on memory, in a virtual environment where memory might get reassigned LEM may suddenly lose information that it's using for your rules. In some cases, we see this handicap the LEM appliance: it mostly keeps working, but behaves unpredictably. In the worst case, LEM crashes entirely and you lose data. This is one reason that support will recommend that you set reservations in VMWare or Hyper-V to make sure that nothing else can ever grab memory away from LEM. This also means that LEM's memory assignment and reservation should match (assigning 16GB of memory and reserving 8GB only changes the problem, it doesn't solve it)
- Also, if your host machine only has 16GB of memory and you assign the LEM appliance all of it or more, you're asking for trouble. Your host OS needs memory too!
- Number of nodes/events per time-period: this one isn't easy to quantify. Part of this is that a node can be so many things. I used to work with Cisco a lot, so I'll use that for my examples.
- If you're monitoring 30 Windows workstations with a really conservative audit-policy, then the minimum system will probably be fine. If you're monitoring 30 Catalyst switches that only log warning and critical events, then the minimum system will probably be fine.
- If you're monitoring 30 ASAs that are trapping debug information and generating millions of events an hour, then the minimum requirements will probably not be adequate for the task and LEM will eventually encounter problems.
- If you're generating a lot of Windows Platform Filtering noise, then workstations can also cause problems (we have a guide for some of that: SolarWinds Knowledge Base :: Disabling Windows Filtering Platform Alerts Using Alert Distribution Policy)
So when we see systems that are fully utilizing a LEM2500 license with dozens of file-servers auditing their entire file-system, firewalls logging everything down to TCP and UDP setup and teardown, and a lot of rule activity set to the minimum requirements and no reservations on a shared VM host, it's no surprise to us that the LEM appliance might be having issues.
As a frame of reference, when the Trigeo SIM appliances were retired and the LEM appliance went virtual, the last generation of appliances were built on the Dell R610 servers: they usually had 16 CPU cores and 32GB of memory.
"But wait!" I already hear you cry, "Are you saying I have to assign this huge chunk of resources to LEM? And reserve them? That totally defeats the purpose of virtualization!"
My answer is "Maybe, and sorta."
The maybe is this: you maybe need to expand beyond the base LEM requirements. If you're having problems and you have resources available, you should consider expanding LEM. Memory is more important to us than CPUs, if you have to make a choice (and in some cases, adding more CPUs to a virtual machine will actually make it perform worse). We'd rather have 32GB of memory and 2 cores than 32 cores and 8GB of memory.
The sorta is this: for the LEM appliance, virtualization opens up the ability to quickly and (more or less) painlessly expand and contract resources. Memory adjustments are done with a slider bar in a UI, not with a screwdriver. It also means that you don't have to own a Dell R610 to run LEM: you can deploy on any hardware that supports ESX or Hyper-V, so you can buy what you want and we don't have to worry about if LEM has the drivers to support it.
What else can help LEM cope with a lot of traffic? Anything that optimizes any VM machine:
- Thick provisioning (and Eager 0s) can really up the number of I/OPS supported (it'll also catch the problem of assigning a 300GB drive on a 200GB data-store before it becomes an issue)
- Aligning sectors/clusters on the data-store to the system you're deploying
LEM specific:
- Manage the noise: For example, what are you really looking for in Windows Platform traffic? Or in the ASA debug logs? Or in file-access audits for the root drive on a file server?
- Windows might be logging what you want somewhere else, some other way. Can we isolate and look for those events more particularly? Maybe some other system is capturing the data you want in a more meaningful way (Domain Controller vs Workstation, for example)
- Cisco allows you to escalate syslog messages, so if you only care about certain debug alerts, escalate them to a Warning level and then change your syslog traps (Check these configuration examples: http://www.cisco.com/c/en/us/td/docs/security/asa/asa84/configuration85/guide/asa_cfg_cli_85/monitor_syslog.pdf)
- Do your users have access to the root share on your file server? Trust AD authentication to keep users out of restricted files, and audit the stuff they can touch. You have certain users that can access those restricted areas? Audit them more severely. You don't need to put the HVAC guy under a microscope unless he's also a Domain/Schema/Global Admin (or you're Target).
Sometimes the solution isn't just to throw more resources at LEM. Sometimes you need to lessen the flow, so the LEM appliance isn't trying to drink from a fire hose. You'll reduce resource requirements and the data you collect will actually be meaningful and useful (as much fun as watching 60,000 events stream by in the Monitor console in a second might be). You'll actually be able to identify threats and respond to events because they won't be wrapped in so much noise.