Jump to content

Wild speculations for fun and science!


CodeGlitch0

Recommended Posts

I'm a developer, you're a developer, she's probably a developer too! Let's face it, DU is born to attract the developer types to it. So let's start a (serious) wild speculation thread about what we think the back-end server architecture might look like!

 

I have to imagine that since NQ is targeting a single-shard server structure the back-end will probably need to be broken up into micro-services.  But since it is a real time game, it will also need to be as compact as possible with as few network boundaries to cross as possible. A server cluster, perhaps similar to Azure Service Fabric would do the job very well, with services partitioned out and distributed across nodes for density and automatic failover and scaling.

 

First off, there will be a single point of contact for establishing and authenticating connections to the servers.  The authentication service will check credentials and negotiation the session security keys and reserve a socket connection endpoint on the gateway layer for the client. The client will then connect to the specified endpoint with provided symmetric keys to prevent tampering.

 

The gateway layer will be scaled as necessary to provide raw network throughput. At 5 KBps bandwidth per client, that means we get a theoretical maximum of roughly 200 clients per server until the gigabit network ports are fully saturated. Incoming packets (messages) are dropped into an event queue system for processing by the back end processing services.

 

The regions are partitioned spatially using an octree algorithm (essentially a 3-dimensional binary search, where every cubic region of space is split in half on each dimension, into eight cubes, as necessary).  The regions are separate services and are spread amongst server nodes for density and scalability.  Each region is responsible for calculating physics on region objects and routing events between players in the region.  A level of detail system is also in place for sending lower frequency important message across regions.  As more players move into a region, the region is split cubically into sub-regions that redistribute amongst nodes in the cluster. As players leave the regions, they are collapsed back into the parent region to conserve resources.

 

When the server cluster hits specified usage limits, additional nodes are added/removed from the cluster to scale up/down as needed.  The service cluster framework (for example, Azure Service Fabric) is responsible for redistribution of partitions across server nodes and replication of services for failover purposes.

 

Each of these scale units are also distributed geographically to maintain low latency, with a backplane in place to keep geographic regions in sync.

 

That's my initial spitball idea for the architecture. Feel free to elaborate, correct, or share your own architecture ideas. Let's get a conversation going.

Link to comment
Share on other sites

I don't have the same kind of experience with networking architectures, but I guess you could call what I posted in the Q&A thread speculation, in the form of questions.

 

In case we don't get official answers by the time the campaign ends, how feasible would you speculate this hypothetical architecture is in practice?

I'm skeptical myself, and my gut feeling is that the project is more likely to fail than succeed, but I'm tempted to back it anyway, to some extent.

 

I'm not sure I understand a couple parts of your post. Firstly, what kind of other overhead are you counting in for the GbE ports? 200*5KB/s only accounts for about 1% of the total throughput, right? Secondly, could you elaborate on how the Azure type system works, specifically in terms of inter-cell connectivity and optimizing geographically disperse clients' connections? 

 

The model I hypothesized in my other post would have clients in the same world-region (leaf of the space dividing tree) would be connected to the same physical hardware server, but would it be smarter if what we see in the demo video of this is just a logical division of avatars in space and they're actually connected on a client-by-client basis to physically proximal server hardware?

 

As I said, I've no clue what I'm talking about when it comes to "smarter" networking with load balancing and such, I've never looked into it and it might just be way over my head.

 

P.S. FYI - not a dev (by trade, anyway)

Link to comment
Share on other sites

I'm not sure I understand a couple parts of your post. Firstly, what kind of other overhead are you counting in for the GbE ports? 200*5KB/s only accounts for about 1% of the total throughput, right?

 

Did I do derp math again?  I always mess up some mundane detail.  (I hope at least one of you gets that reference).  The actual math doesn't matter anyways. The point is: you have a layer for ingress traffic.

 

 

 

Secondly, could you elaborate on how the Azure type system works, specifically in terms of inter-cell connectivity and optimizing geographically disperse clients' connections? 

 

Service fabric doesn't handle things like inter-cell connectivity or geographic dispersion. The is something you'd have to write yourself.  Service fabric is a framework / engine for clustering, and service partition node and failover management.

 

You'd handle regions by making each region an "actor" or "service" for example and distribute those services across available nodes.  You're responsible for spatial divisions (an octree is just a highly efficient way of indexing spatial data, and many games use them).

 

 

Age of Ascent (a space game supporting 50,000+ players in a single twitch combat battle) has a lot of public available architecture information, which I based a lot of my hypothesis on.  It's pretty well documented too. http://web.ageofascent.com/blog/

Link to comment
Share on other sites

Service fabric doesn't handle things like inter-cell connectivity or geographic dispersion. The is something you'd have to write yourself.  Service fabric is a framework / engine for clustering, and service partition node and failover management.

 

You'd handle regions by making each region an "actor" or "service" for example and distribute those services across available nodes.  You're responsible for spatial divisions (an octree is just a highly efficient way of indexing spatial data, and many games use them).

 

It's kind of painful reading the docs on that tech. It's so bloated with Microsoft-speak and jargon, to a degree that doesn't seem entirely necessary (but that's MS for ya I guess). Their introductory page seems like 10% technical explanation and examples and 90% marketing spiel.

 

Regardless, I guess the gist of it is that you break a problem down into some small components that can be replicated, like their example of thousands of user profiles or databases or whatever (in your hypothetical example, regions of space). So basically you delegate the responsibility of distributing the load evenly (and densely) on server hardware, instead of designing that by hand, which means you just have to design the server program in a more granular way (broken up into these "service" things). That's cool and all, but it leaves many questions unanswered. I'm also not too happy about how the effectively black box of the server fabric technology obscures implementation details (and hence makes the kind of estimation of efficiency and feasibility kind of difficult, as far as I can tell).

 

Given all that, how would you design the actual inter-region communications? Do they all talk to eachother directly? Do they only talk to adjacent regions? How do you deal with network delay between regions if they're not geographically adjacent in real life?

 

Best I can think of is a thing where you do the octree division thing to figure out which clients will be in the same region (cell as NQ calls them), figure out which of their server locations is closest to the center of the clients' geographic locations (ping-wise, in a least squares sense), and assign a node from that cluster to them. Then the avatars closest to you in the game world would have, on average, as low a ping as possible. But things then get more complicated if you want to minimize the distances between the chosen geographic locations for nodes for adjacent regions... given a completely random geographic distribution of connected clients.

 

You seem to be much more informed on these sorts of problems in practice with typical solutions -- what would you propose?

 

Age of Ascent (a space game supporting 50,000+ players in a single twitch combat battle) has a lot of public available architecture information, which I based a lot of my hypothesis on.  It's pretty well documented too. http://web.ageofascent.com/blog/

 

Holy shit that's actually a thing? So at least a bare-bones version of this is technically feasible? That does make me... somewhat less nervous. Time to read up on them next, I suppose.

Link to comment
Share on other sites

My second idea is that there are a bunch of gerbals in a back room, powering the servers that chimpanzees are typing wildly into until a game emerges.

Oh just like eve online! That's the only sentence I made sense of.

Link to comment
Share on other sites

 

Given all that, how would you design the actual inter-region communications? Do they all talk to eachother directly? Do they only talk to adjacent regions? How do you deal with network delay between regions if they're not geographically adjacent in real life?

 

Best I can think of is a thing where you do the octree division thing to figure out which clients will be in the same region (cell as NQ calls them), figure out which of their server locations is closest to the center of the clients' geographic locations (ping-wise, in a least squares sense), and assign a node from that cluster to them. Then the avatars closest to you in the game world would have, on average, as low a ping as possible. But things then get more complicated if you want to minimize the distances between the chosen geographic locations for nodes for adjacent regions... given a completely random geographic distribution of connected clients.

 

You seem to be much more informed on these sorts of problems in practice with typical solutions -- what would you propose?

 

I don't believe it would work to have them communicate directly.  That'd be too much traffic, essentially bottlenecking it back down to the problem you get with a single machine.  Instead, everything would need to go through some sort of event router service.  The service would have a Level of Detail algorithm to intelligently route messages to valid regions (and players) on a scaled back frequency by distance and importance of the message.  Adjacent regions to the source would get the messages on a semi-frequent basis, further regions would get even less frequent, and even further regions may not get them at all. You would also need to take into account that region can be really small in that router.  So region size would also be a factor in messaging frequency.

 

The regions could be comparable to the old-style notion of a server shard, to a degree.  A region knows about all the players in it, and all players know the region they are in.  This is an in-game region, a chunk of the universe space.  The region is also responsible for all the players, NPCs, and constructs in that chunk of space and performs all of the physics/actions/voxel manipulation within it. Because this is all processed on the server cluster, ping time in a region would be dependent on yours and their connection to the server, more so than to each other like you get in typical multiplayer games.  Your typical game creates a lobby and may try to establish more of a peer-to-peer connection, which won't be possible in DU.

 

RL geographic regions would need to be handled differently.  To minimize latency you would want to distribute the cluster globally, but this can create other problems, such as keeping copies of a region in sync across datacenters.  I'm not sure what the best solution is there: distribute globally and minimize player latency, while increasing some server-side latency (more predictable than user latency) for synchronization or have exactly one copy of each region and players will have to deal with additional latency dependent on how far they are from the actual datacenter hosting that region. That is a difficult problem to solve and I don't have an answer yet.

 

 

Holy shit that's actually a thing? So at least a bare-bones version of this is technically feasible? That does make me... somewhat less nervous. Time to read up on them next, I suppose.

 

Yeah, it's actually a thing.  I've been in a couple of their monthly playtest sessions (30 min each month).  It works shockingly well.  And the battles are absolutely insane. I love it.  The server tech will likely prove somewhat similar to DUs in the end.

 

 

It's kind of painful reading the docs on that tech. It's so bloated with Microsoft-speak and jargon, to a degree that doesn't seem entirely necessary (but that's MS for ya I guess). Their introductory page seems like 10% technical explanation and examples and 90% marketing spiel.

 

To a degree, yes.  But I am a Microsoft / .NET developer. So it is what I'm used to, and actually quick clear.  Linux docs make me crazy.  But yes, there is a lot of marketing in there.  That was likely the trade-off Illyriad Games made with Microsoft.  They got early access to Service Fabric and a lot of direct assistance from Microsoft in the architectural design/development.  In exchange, they share some cross-marketing.

 

Illyriad has actually made a lot of code contributions to the .Net Core codebase, making huge leaps in performance such that Asp.Net Core can actually surpass Node.js now in raw throughput. I'm impressed by them.

Link to comment
Share on other sites

This is a cool idea for a thread.

 

Mere speculation here, and I'm not a game developer. But I have a few ideas.

 

DDOS is always an issue. Here are some ideas that I would consider if I were to create a game.

 

THE INITIAL CONNECTION:

The client should have the ability to connect to multiple URLs (server names) for the initial connection. I would probably program an algorithm to generate those URLs so they're not statically stored on the client. Maybe even a second algorithm to generate a set of URLs that are dynamic (such as date specific). If the client can't attach to the first URL, it attempts the second and so on.

 

The algorithms would also be used server side to generate valid URLs such as S3de932.dualthegame.com (S3de932 is randomly generated by the algorithm). You have to have the algorithm on both sides, so that the client and server are identical.

 

The algorithms ARE client side, so there is the potential that a hacker could eventually decipher the algorithm. So possibly update the algorithm every so often to throw them off.

 

The above solution means a DDOS attack can't focus on a specific server name, and by extension, on a single IP address.

 

Then I would use ANYCAST for IP addressing. There are two good reasons to use anycast. First, you can geographically disperse servers with identical IP addresses. This would allow a user from America, and a user from China to connect to the same IP address, but connect to servers that are close to them. The second reason to use anycast is that it can dynamically load balance to other servers. A DDOS attack might hit the IP address, but the load would be distributed to multiple servers.

 

Coupling the two above solutions would increase performance geographically, and not limit the initial client connection to a single server or anycast IP address. Each URL would point to a differnt anycast IP address.

 

GAMEPLAY SERVERS:

 

Once the players are logged in, they're automatically going to be passed to the server that hosts that 3D portion of virtual space. So, if the DDOS were to get past the initial login, it would be limited to a specific sector of space within the game.

 

The attacker would need an authenticated client connection, which could be BANNED if a DDOS were attempted. And once they're disconnected the "3D space server" (whatever it's called) could have a new IP address assigned to it if needed. This would prohibit the banned user from continuing to hit the server, by spamming junk packets to the IP.

 

Configure the server/firewall to only accept packets from authenticated users. This would prohibit an authenticated game client from passing good IP addressing information to the botnet.

 

Also, the code used to dynamically scale the servers shouldn't rely solely on player count. If the botnet WERE able to bypass the above firewall rules, the server service would become overloaded. There should be a mechanism to scale or even offload legitimate users to a separate server and allow the unauthenticated packets to continue to hit a server that's no longer being used. By selectively offloading users over time, the "hacker" account could be deduced by noticing when the new server was hit by the botnet.

 

 

What do you think?

 

I'd love for you to shoot holes in my solution.

Link to comment
Share on other sites

I don't believe it would work to have them communicate directly.  That'd be too much traffic, essentially bottlenecking it back down to the problem you get with a single machine.  Instead, everything would need to go through some sort of event router service.  The service would have a Level of Detail algorithm to intelligently route messages to valid regions (and players) on a scaled back frequency by distance and importance of the message.  Adjacent regions to the source would get the messages on a semi-frequent basis, further regions would get even less frequent, and even further regions may not get them at all. You would also need to take into account that region can be really small in that router.  So region size would also be a factor in messaging frequency.

 
Well, the notion of "direct" communication and frequency of updates are kind of separate topics, right? Would this routing service be running on every cluster? Even so, I reckon physically messages would still be direct rather than daisy-chained or whatever, to answer my own question... I think my notion of it causing congestion in some special way was naive.
 
I still think update frequency will be a function of not only distance/adjacency, but also player density (you said as much above, I misread -- also you add type importance, which I forgot to consider)... So a naive way to implement it would be the number of cells between you and a target cell you want to figure out the update rate for.
 
Still, we agree that messages, physically speaking, should be addressed directly to a target region, rather than propagating/attenuating recursively through directly adjacent nodes? I figure since we don't know the geographic distribution of the clusters (or do we?), we don't know what kind of delay we'd introduce by doing it the second way.
 

The regions could be comparable to the old-style notion of a server shard, to a degree.  A region knows about all the players in it, and all players know the region they are in.  This is an in-game region, a chunk of the universe space.  The region is also responsible for all the players, NPCs, and constructs in that chunk of space and performs all of the physics/actions/voxel manipulation within it. Because this is all processed on the server cluster, ping time in a region would be dependent on yours and their connection to the server, more so than to each other like you get in typical multiplayer games.  Your typical game creates a lobby and may try to establish more of a peer-to-peer connection, which won't be possible in DU.

 
I concur, although I believe the part in boltalic is incorrect. I haven't seen that since GunZ: The Duel, and I believe most games I play have a server-client networking model rather than P2P (at least all the source games and ArmA 3).
 
Do you think moving from one region/subdivision to another would also move you to a different cluster? I could see that causing some overhead... How would a swarm of players moving rapidly on an FTL-speed ship be handled? Or even just a group of 1k people running in the same direction? Would there be a hand-off every n meters where n is the current width of a region-subdivision (cell-width)? How about keeping the same cluster that just moves with the players from cell to cell? What about if they disperse, or some blend of dispersion/moving together? How would you blend between... am I making a problem out of something that is actually trivial and simple? I feel like I might be.
 

 

RL geographic regions would need to be handled differently.  To minimize latency you would want to distribute the cluster globally, but this can create other problems, such as keeping copies of a region in sync across datacenters.  I'm not sure what the best solution is there: distribute globally and minimize player latency, while increasing some server-side latency (more predictable than user latency) for synchronization or have exactly one copy of each region and players will have to deal with additional latency dependent on how far they are from the actual datacenter hosting that region. That is a difficult problem to solve and I don't have an answer yet.

 

The notion of server side sync didn't even cross my mind, to be honest. Seems like it wouldn't be worth it to trade redundantly copying data for a likely small increase in predictability, though. The source of uncertainty/fluctuation in the other model would be between one client and the server at a time, whereas the other model spreads those routing/whatever other issues to other clients effectively, if my intuition is right...

 

Though I guess the real benefit would be larger scale homogeneity of the distribution of players' pings... I think that's a good thing...? I mean if you did distribute, your region would be disperse, your interaction with the world would be snappier, your interaction with other clients in your region would be dependent on roughly the same sort of geographic distance as in the other case, plus some (constant?) term added due to overhead... Your interactions with other regions' avatars would be similarly distributed, I think... Yeah, perhaps that would be the smart way.

 

 

To a degree, yes.  But I am a Microsoft / .NET developer. So it is what I'm used to, and actually quick clear.  Linux docs make me crazy.  But yes, there is a lot of marketing in there.  That was likely the trade-off Illyriad Games made with Microsoft.  They got early access to Service Fabric and a lot of direct assistance from Microsoft in the architectural design/development.  In exchange, they share some cross-marketing.

 

Illyriad has actually made a lot of code contributions to the .Net Core codebase, making huge leaps in performance such that Asp.Net Core can actually surpass Node.js now in raw throughput. I'm impressed by them.

 

I actually think l|1yriad's communications are slightly clearer and to the point, I was refering to these pages by MS. After I forced myself to read through the overview, I think I eventually formed a decent enough picture to get what's going on, roughly. Like on this page, if you skip over the BS to the pictures and the text next to them, it's not that bad.

Link to comment
Share on other sites

@Ripper:
Those are some great points! The addition of ANYCAST perhaps through something like new cloud DNS services) would definitely be required for the initial connections and help greatly with geographic distribution and DDoS protection.  That would pair wonderfully with the Authentication/Connection reservation/encryption key generation I mentioned.
 
This brought to mind a couple of other benefits about having an abstracted ingress gateway layer. The ingress layer would be responsible for all encryption/decryption of over the wire traffic to players and would easily reject any other connections attempts that have not been reserved.  If they we targetted by DDoS attacks, it would be easy to scale up additional instances and just dump the attacked nodes.  Player connections would simply reconnect to a new ingress node and they would only notice some lag until automatic failover happens and the blip of that reconnection attempt.  Everything else would happen behind that ingress layer in traffic that is inaccessible publicly, and actual regions wouldn't actually be affected. The primary downside would be a slight addition to latency, but because it is all part of the cluster it is pretty much just traffic over a local network switch, so we're talking singles or less of ms.

@LurkNautili:
Connections between nodes would be direct yes, but messages would need to be queued once received to ensure they happen sequentially.  But once in the queue, they can be processed and acted on asynchronously, with some intelligence to resolve conflicts. Individual services, such as the region, can also make use of replicas to allow certain read operations to occur across nodes, while write operations would happen on a single primary node to protect the system integrity.  A lot of this can just happen automatically thanks to the cloud services / service fabric layer.
 
Because data-center connections are much more likely to be consistent and predictable than player connections, geographic distribution and synchronization might actually make a lot of sense in a cloud situation.  There might be some lag for events from players in one RL region to players in another RL region, but that would happen regardless of implementation.  It is inevitable.  Distributing RL geographic regions via only the cloud, would likely actually normalize a lot of those inconsistencies.
 
Now you got me thinking even deeper.  Perhaps I need a diagram...  That'll be later.
Link to comment
Share on other sites

I'd love for you to shoot holes in my solution.

 

Well, as stated, I don't know a whole lot about networking...

 

However, wouldn't anycast require you to have access to the routing tables of the ISPs carrying the connections? Like, say I get an IP address for a US from Level 3, which is from a block of IPs assigned to them, how would I set that as the IP of e.g. a German server connected to the internet via Deutsche Telekom? I mean I know that as an autonomous system, you can set up anycast within your network, but how do you set something like that up for IPs you don't own (or from within networks to which that IP is not assigned to)? How does NQ who isn't even an AS do something like that? I'm confused.

 

Also, what's the point of using DNS for a system where they make the client and the backend infrastructure? It won't obscure the endpoints clients connect to, IPs aren't secret information, you can just look at the traffic and aggregate a list of servers over time.

 

Wouldn't it be best to have your client connect to a service (behind some IP, maybe the client could store a list of alternatives) to log in, and at the same time a backend service would negotiate a connection to a cluster local to you?

 

As for DoS protection in general, the kind of architecture CodeGlitch0 mentioned would, on its own, go a long way against mitigating DoS, since an in-game region's corresponding cluster might be geographically disperse, DoSing would have unpredictable effects, and since you're probably trying to mess with some specific org or player, you can't point the attack very effectively (you'd have to know where they live, and target their closest NQ server).

Of course if you're just trying to bring down game service across the board, you just target all the NQ servers, whose IPs would be public record -- just the nature of IPs and routing, nothing you can do about it. In order to protect the login service from being brought down by a cheaper DoS attack, the login backend would just have to be distributed across all the NQ servers as microservices, as CodeGlitch0 outlined above.

Link to comment
Share on other sites

@LurkNautili:

Connections between nodes would be direct yes, but messages would need to be queued once received to ensure they happen sequentially.  But once in the queue, they can be processed and acted on asynchronously, with some intelligence to resolve conflicts. Individual services, such as the region, can also make use of replicas to allow certain read operations to occur across nodes, while write operations would happen on a single primary node to protect the system integrity.  A lot of this can just happen automatically thanks to the cloud services / service fabric layer.

 

Doesn't TCP just do this own its own (sequence numbering in packet headers)? If I send that player A moves to point x, then y, then z, those TCP packets have to arrive in the order I send them and I don't need to worry about it, right? Or if you mean players A and B both in different regions distinct from mine and from eachothers' violating causality... Well I guess you'd have to maintain some sort of server simulation time indexing as well, which you could use to sort data arriving from different regions -- is this what you meant?

 

[EDIT] That could cause delays, though, if you allow a lagged up packet from one region (at say, server tick 1500) delay the processing of events from other regions (at ticks 1501, 1600, whatever), right? It would have to be smarter than just a queue that you sort and execute in order. Is this what you're eluding to with read vs. write (some kind of analogy I don't get there)?

Link to comment
Share on other sites

I apologize for constantly going back to Micro$oft tech.  I use M$ dev stack every day, so it is what I know best.

 

 

Well, as stated, I don't know a whole lot about networking...

 

However, wouldn't anycast require you to have access to the routing tables of the ISPs carrying the connections? Like, say I get an IP address for a US from Level 3, which is from a block of IPs assigned to them, how would I set that as the IP of e.g. a German server connected to the internet via Deutsche Telekom? I mean I know that as an autonomous system, you can set up anycast within your network, but how do you set something like that up for IPs you don't own (or from within networks to which that IP is not assigned to)? How does NQ who isn't even an AS do something like that? I'm confused.

 

Also, what's the point of using DNS for a system where they make the client and the backend infrastructure? It won't obscure the endpoints clients connect to, IPs aren't secret information, you can just look at the traffic and aggregate a list of servers over time.

 

Wouldn't it be best to have your client connect to a service (behind some IP, maybe the client could store a list of alternatives) to log in, and at the same time a backend service would negotiate a connection to a cluster local to you?

 

As for DoS protection in general, the kind of architecture CodeGlitch0 mentioned would, on its own, go a long way against mitigating DoS, since an in-game region's corresponding cluster might be geographically disperse, DoSing would have unpredictable effects, and since you're probably trying to mess with some specific org or player, you can't point the attack very effectively (you'd have to know where they live, and target their closest NQ server).

Of course if you're just trying to bring down game service across the board, you just target all the NQ servers, whose IPs would be public record -- just the nature of IPs and routing, nothing you can do about it. In order to protect the login service from being brought down by a cheaper DoS attack, the login backend would just have to be distributed across all the NQ servers as microservices, as CodeGlitch0 outlined above.

 

There can be a number of benefits to having a level of protection at the DNS level.  With the advent of things like Azure DNS, you can go even further than just security with it. In the cloud, IP addresses can be very fluid.  There isn't a whole lot of need to hold on to a single public IP for long periods of time, unless that is how your system is designed.  In microservices approach, they can come and go and it wouldn't really matter.  DNS protection would only help with the initial connection attempt for each client. Most clients will cache results and then just use the same given IP until something fails and it attempts a reconnect.

 

Diving in... What can you achieve with a DNS layer protection system? A couple of things. Primarily: automatic geographic distribution, initial load balancing, and DDoS protection. Because IPs can be in flux, a cloud-based DNS server can protect against DNS-level DoS attack by simply blocking evil IPs from the DNS layer. That prevents a level of DoS traffic from even getting to your app.  This is more beneficial if IP change regularly. However, most app public access points are fairly static.

 

Secondly is geographic distribution and load balancing. In something like the cloud DNS, you can have the server know about a number of affinity points for your app. (read: datacenter location specific)  Lets say you have "connect.dualthegame.com" as your primary connection point.  This service is distributed amongst 30 nodes/IPs in 3 geographically dispersed data centers.  The DNS layer can know that you also have "america.connect.dualthegame.com," "europe.connect.dualthegame.com," and "asia.connect.dualthegame.com."  When that initial IP resolution query comes in, the DNS service can look at your IP, check it's geographic location and return the appropriate CNAME for your region.  It is also possible to select based on latency, I believe. Each of the geographic CNAMES has a list of IP addresses for each cloud load balancer on the connection point for that datacenter.  When multiple IPs are returned, the client will essentially just choose one at random and connect to that.  This gives a degree of initial load balancing before your client even tried an actual connection to the app. 

 

The malicious botnet will be mitigated in some respect on that initial connection attempt, if it uses DNS resolution.  The cloud DNS/front-end load balancers can pick up a list evil IP based on network traffic and prevent before it even touches your app. The load balancer denial also protects your app from direct IP address attacks.  If the hackers get through even that, the ingress layer has it own protection now, as I mentioned, because individual nodes are fluid and disposable.

Link to comment
Share on other sites

TCP is a poor choice for gaming.  It would never be used for FPS, and I wouldn't use it for "tabbed targeting" RPG games either.  There's a lot of redundancy and overhead with TCP.  However, it won't make much of a difference for DU.  I just believe the server resources could be used more effectively elsewhere.

 

UDP can be used for unreliable data AND reliable ordered packets.

 

For more information:

http://www.gafferongames.com

Link to comment
Share on other sites

Doesn't TCP just do this own its own (sequence numbering in packet headers)? If I send that player A moves to point x, then y, then z, those TCP packets have to arrive in the order I send them and I don't need to worry about it, right? Or if you mean players A and B both in different regions distinct from mine and from eachothers' violating causality... Well I guess you'd have to maintain some sort of server simulation time indexing as well, which you could use to sort data arriving from different regions -- is this what you meant?

 

[EDIT] That could cause delays, though, if you allow a lagged up packet from one region (at say, server tick 1500) delay the processing of events from other regions (at ticks 1501, 1600, whatever), right? It would have to be smarter than just a queue that you sort and execute in order. Is this what you're eluding to with read vs. write (some kind of analogy I don't get there)?

 

I'm talking in the context of a "dumb" ingress layer.  Yes, at the TCP level, packets would be ordered and retry as normal.  But once through to the ingress nodes, that traffic is authenticated, validated, decrypted, and passed on to the actual game logic nodes.  It is at this level that I image a queuing system might be necessary.  If you have 100 ingress nodes and no queuing, that is a lot of asynchronous traffic (100 streams) being dumped into the game services (let's say 20 server nodes) and it all needs to be handled and processed at once. By implementing a queue (either in process or separate/distributed) you can quickly accept that traffic, then process it as quickly as possible without fully flooding the services. If you use something like Azure Service Bus or Event Hubs (again, sorry for always posting M$ tech), then those cloud services can ingest the millions of messages per second and use Message Queue "topics" to intelligently route to only the nodes that requires the message. instead of the connected node receiving EVERY message, always.  Again, it's a trade-off between raw performance or adding a bit of latency to be more intelligent and reduce overall traffic ingestion on specific nodes.

Link to comment
Share on other sites

TCP is a poor choice for gaming.  It would never be used for FPS, and I wouldn't use it for "tabbed targeting" RPG games either.  There's a lot of redundancy and overhead with TCP.  However, it won't make much of a difference for DU.  I just believe the server resources could be used more effectively elsewhere.

 

UDP can be used for unreliable data AND reliable ordered packets.

 

For more information:

http://www.gafferongames.com

 

Yeah, there are definitely options out there.  But I imagine something like DU won't actually require a huge amount of network traffic per client.  Latency will be first an foremost.  But without a requirement for twitch combat (ie. FPS shooter), the traffic will likely be mostly game events (voxels added/broken, shot fired/target and player movement.  It can probably be accomplished within 50-100 kbps (~5-10 KB per second) on average per player. I could be way off on this number, but hey! It's wild speculation day!

 

Also, an ingress layer helps immensely with that.  The responsibility of those nodes is solely: accept traffic, decrypt it, validate/authenticate it, pass on to game services.  The major difference between cloud/microservices and tradional game servers is that the cloud doesn't require every single bit of processing to happen on a single node.  So, you aren't required to micromanage performance issues in the same regard as traditional.  The distribution means you can handle a lot more than normal.

 

EDIT: But you are right.  UDP is generally better for things like gaming.  When the game logic is primarily server-based though, it might become a different beast. Dropped packets, except for movement and the like, can be really bad in that scenario.  Possibly a hybrid/dual streams is the best choice.

Link to comment
Share on other sites

Well, as stated, I don't know a whole lot about networking...

 

However, wouldn't anycast require you to have access to the routing tables of the ISPs carrying the connections? Like, say I get an IP address for a US from Level 3, which is from a block of IPs assigned to them, how would I set that as the IP of e.g. a German server connected to the internet via Deutsche Telekom? I mean I know that as an autonomous system, you can set up anycast within your network, but how do you set something like that up for IPs you don't own (or from within networks to which that IP is not assigned to)? How does NQ who isn't even an AS do something like that? I'm confused.

Anycast or an alternative solution is already in use by most cloud hosting companies.  There's no need to hack into their infrastructure.

 

Well, as stated, I don't know a whole lot about networking...

 

Also, what's the point of using DNS for a system where they make the client and the backend infrastructure? It won't obscure the endpoints clients connect to, IPs aren't secret information, you can just look at the traffic and aggregate a list of servers over time.

 

DNS allows for changes to IP addressing.  Novaquark would not need to know the IP address provided by the hosting company until the server name was created.  There's most definitely a reason servers have names instead of just having IP addresses.  Ask any Admin.

 

 

Well, as stated, I don't know a whole lot about networking...

Wouldn't it be best to have your client connect to a service (behind some IP, maybe the client could store a list of alternatives) to log in, and at the same time a backend service would negotiate a connection to a cluster local to you?

A single IP address is the SOLE cause of why DDOS is so effective.  But once you do authenticate your client will be connected to the server service that hosts the sector your avatar is in.  It may not be an actual server.  It could be a service, but it will most likely have a dedicated ethernet interface and IP address.

 

Well, as stated, I don't know a whole lot about networking...

As for DoS protection in general, the kind of architecture CodeGlitch0 mentioned would, on its own, go a long way against mitigating DoS, since an in-game region's corresponding cluster might be geographically disperse, DoSing would have unpredictable effects, and since you're probably trying to mess with some specific org or player, you can't point the attack very effectively (you'd have to know where they live, and target their closest NQ server).

Of course if you're just trying to bring down game service across the board, you just target all the NQ servers, whose IPs would be public record -- just the nature of IPs and routing, nothing you can do about it. In order to protect the login service from being brought down by a cheaper DoS attack, the login backend would just have to be distributed across all the NQ servers as microservices, as CodeGlitch0 outlined above.

DDOSing is not about gaining an advantage over one player within a game. 

 

Its about holding the game hostage and asking for a ransom from Novaquark.  There's a good "Wired" article about how hackers held an online gambling site hostage, asking for a ransom. 

 

A hacker doesn't need to send encrypted packets to the service.  All they have to do is overwhelm the network infrastructure with invalid packets.  (junk packets)  This will take the entire game down, thereby denying revenue to Novaquark.  At the very least, turning off players, and making NQ look bad to the gaming community as a whole (which ultimately impacts revenue).

 

As for IPs..

 

I can go onto my EC2 servers and obtain an IP within seconds.  Registered to Amazon.  And I can drop that IP address seconds later.  There's no need for NQ to purchase an entire block of IPs.  This ability coupled with DNS resolution allows the hosting company to deal with the DDOS packets in multiple ways. 

 

The key concept is to not make anything a single point of failure, or define a static environment.  It should be completely dynamic.  This will give NQ and their hosting company several methods to address the mass spamming of invalid packets.

Link to comment
Share on other sites

I don't know if this was mentioned elsewhere in the thread but i believe in the AMA they said there would not be multiple game server clusters. Only a single one active at any given time.

 

 

EDIT from AMA:

Quote

What server architecture are you using? ie. your own dedicated setup (eg. eve online) or cloud (eg. google compute, aws, etc). Will there be regional servers? US, EU, Oceania, S.America & Asia? Or is it dynamic global like star citizen is aiming for?

 

We won't have our own datacenter at start, that would be too costly and really not necessary in 2016 when there are so many high quality cloud offers dedicated to high performance gaming. So we will work with third parties to host the datacenter, and we are currently reviewing several offers. Now, due to our single-shard approach, we cannot of have regional servers, there will be one central cluster to connect to. We are considering the possibility to switch from a US based cluster or Europe based cluster depending on the time of day (doing sync in the background or during down times), but this might not be necessary.
Link to comment
Share on other sites

DNS allows for changes to IP addressing.  Novaquark would not need to know the IP address provided by the hosting company until the server name was created.  There's most definitely a reason servers have names instead of just having IP addresses.  Ask any Admin.

Yeah I realized I kinda derped with that remark, obviously you wouldn't actually store the IP in the client regardless of what backend you had, rather you'd find the first point of contact via DNS. I should've more specifically refered to the part about generating addresses algorithmicly.

 

TCP is a poor choice for gaming.  It would never be used for FPS, and I wouldn't use it for "tabbed targeting" RPG games either.  There's a lot of redundancy and overhead with TCP.  However, it won't make much of a difference for DU.  I just believe the server resources could be used more effectively elsewhere.

 

UDP can be used for unreliable data AND reliable ordered packets.

 

For more information:

http://www.gafferongames.com

 
You have a point there, I somehow had the notion in my head that games I play use TCP, but now that I think about it, the ones I know about probably use UDP.
 

I'm talking in the context of a "dumb" ingress layer.  Yes, at the TCP level, packets would be ordered and retry as normal.  But once through to the ingress nodes, that traffic is authenticated, validated, decrypted, and passed on to the actual game logic nodes.  It is at this level that I image a queuing system might be necessary.  If you have 100 ingress nodes and no queuing, that is a lot of asynchronous traffic (100 streams) being dumped into the game services (let's say 20 server nodes) and it all needs to be handled and processed at once. By implementing a queue (either in process or separate/distributed) you can quickly accept that traffic, then process it as quickly as possible without fully flooding the services. If you use something like Azure Service Bus or Event Hubs (again, sorry for always posting M$ tech), then those cloud services can ingest the millions of messages per second and use Message Queue "topics" to intelligently route to only the nodes that requires the message. instead of the connected node receiving EVERY message, always.  Again, it's a trade-off between raw performance or adding a bit of latency to be more intelligent and reduce overall traffic ingestion on specific nodes.

 
Yeah ignore what I said about TCP, hah.
Link to comment
Share on other sites

Diving in... What can you achieve with a DNS layer protection system? A couple of things. Primarily: automatic geographic distribution, initial load balancing, and DDoS protection. Because IPs can be in flux, a cloud-based DNS server can protect against DNS-level DoS attack by simply blocking evil IPs from the DNS layer. That prevents a level of DoS traffic from even getting to your app.  This is more beneficial if IP change regularly. However, most app public access points are fairly static.

 

How would you distinguish between a malicious user asking for your IP via DNS vs. an innocent user? Many DoS attacks spoof source IPs for instance, right? What would prevent someone from playing the game innocently on their computer, mapping out the backend IPs and attacking those IPs with a completely unrelated botnet? I don't see how it's possible to keep IPs secret, and treating them like secrets seems like a bad security practice.

 

There's a reason why DDoS is an unsolved (possibly unsolvable) problem.

 

 

Secondly is geographic distribution and load balancing. In something like the cloud DNS, you can have the server know about a number of affinity points for your app. (read: datacenter location specific)  Lets say you have "connect.dualthegame.com" as your primary connection point.  This service is distributed amongst 30 nodes/IPs in 3 geographically dispersed data centers.  The DNS layer can know that you also have "america.connect.dualthegame.com," "europe.connect.dualthegame.com," and "asia.connect.dualthegame.com."  When that initial IP resolution query comes in, the DNS service can look at your IP, check it's geographic location and return the appropriate CNAME for your region.  It is also possible to select based on latency, I believe. Each of the geographic CNAMES has a list of IP addresses for each cloud load balancer on the connection point for that datacenter.  When multiple IPs are returned, the client will essentially just choose one at random and connect to that.  This gives a degree of initial load balancing before your client even tried an actual connection to the app. 

 

See that's the sort of thing I meant might make more sense than using anycast (I still don't understand how that would work, I've never seen anything like what was mentioned -- as I don't work in that industry).

 

 

The malicious botnet will be mitigated in some respect on that initial connection attempt, if it uses DNS resolution.  The cloud DNS/front-end load balancers can pick up a list evil IP based on network traffic and prevent before it even touches your app. The load balancer denial also protects your app from direct IP address attacks.  If the hackers get through even that, the ingress layer has it own protection now, as I mentioned, because individual nodes are fluid and disposable.

 

If I can get a list of IPs as I mentioned above, I wouldn't have to have my hypothetical botnet connect via the provided interface at all, I could just launch an attack of my choosing on any of the IPs I've discovered.

 

Distribution, on the other hand, is a valid strategy to mitigate against traffic concentration (network  congestion) based DoS attacks, I agree.

Link to comment
Share on other sites

Yeah I realized I kinda derped with that remark, obviously you wouldn't actually store the IP in the client regardless of what backend you had, rather you'd find the first point of contact via DNS. I should've more specifically refered to the part about generating addresses algorithmicly.

Sorry about the misunderstanding.

 

To clarify about the algorithm.  My suggestion would indicate the server name doesn't necessarily need to be static.  It just needs to be known on both ends.  That's why DDOS is so effective for web browsers and websites.  There's no way to do this with a generic browser, but its completely doable with a game client.  A generic browser has one URL that its attempting to connect to.  As long as the attacker floods junk packets to the URL, it doesn't matter if the hosting company changes IP addresses, or points the URL somewhere else.  The DDOS just hits the new location.

 

The server needs to know the randomly generated name so it can setup the name in DNS.  There's no charge for a third level name, So, there's no additional expense.  The server DOES need to create the name at least 24 hours in advance to account for DNS propagation.

 

The client needs to know the name so it can connect to the server.  The server names could be stored statically, but a hacker would eventually find them. 

 

An algorithm could be used to generate random names on both the server side and client side.  The client could also have 1 or 2 static names in the list for redundancy, but they're  the ones most likely to be attacked.  The client just needs to go down the list of valid server names until it connects.

 

The static server names AND the algorithm could be changed at NQs discretion via the standard client updates.  This would make it more difficult for a hacker, because they would need to start all over to hack the algorithm.

Link to comment
Share on other sites

How would you distinguish between a malicious user asking for your IP via DNS vs. an innocent user? Many DoS attacks spoof source IPs for instance, right? What would prevent someone from playing the game innocently on their computer, mapping out the backend IPs and attacking those IPs with a completely unrelated botnet? I don't see how it's possible to keep IPs secret, and treating them like secrets seems like a bad security practice.

 

There's a reason why DDoS is an unsolved (possibly unsolvable) problem.

 

My previous post indicated why a generic client can't solve the DDOS issue, vs a dedicated game client.

 

DDOS attacks target a known resource. For example http://www.whitehouse.gov

 

My solution moves the critical game infrasturcture from a single "known resource" (known to the hacker) to multiple new unknown resources.  The DDOS can continue to attack the "known resource", or the hosting company can block all traffic to the "known resource" entirely.  The game clients can then connect to the next server on the list.

Link to comment
Share on other sites

Sorry about the misunderstanding.

 

To clarify about the algorithm.  My suggestion would indicate the server name doesn't necessarily need to be static.  It just needs to be known on both ends.  That's why DDOS is so effective for web browsers and websites.  There's no way to do this with a generic browser, but its completely doable with a game client.  A generic browser has one URL that its attempting to connect to.  As long as the attacker floods junk packets to the URL, it doesn't matter if the hosting company changes IP addresses, or points the URL somewhere else.  The DDOS just hits the new location.

The server needs to know the randomly generated name so it can setup the name in DNS.  There's no charge for a third level name, So, there's no additional expense.  The server DOES need to create the name at least 24 hours in advance to account for DNS propagation.

 

The client needs to know the name so it can connect to the server.  The server names could be stored statically, but a hacker would eventually find them. 

 

An algorithm could be used to generate random names on both the server side and client side.  The client could also have 1 or 2 static names in the list for redundancy, but they're  the ones most likely to be attacked.  The client just needs to go down the list of valid server names until it connects.

 

The static server names AND the algorithm could be changed at NQs discretion via the standard client updates.  This would make it more difficult for a hacker, because they would need to start all over to hack the algorithm.

 

But my point is, the IP addresses behind the DNS aren't going to be changing much while the server is running. Once an attacker knows where your server is, all he has to do is get enough traffic to concentrate somewhere in the logical viscinity of your server, and he'll bring that whole region of the network down.

 

So within your system I could:

1. Set up packet capture locally.

2. Play the game normally, like any other legitimate customer would.

3. Obtain the NQ server(s) IP(s) by reading the traffic leaving my machine.

4. Use my C&C to issue an order to my botnet to nuke NQ servers. NQ servers are now offline.

5. ????

6. Profit.

Lesson? IP addresses aren't secrets, and you can't treat them as such in this kind of model.

 

Point being, that there is no easy defense against DDoS, all the mitigation systems I know of primarily work based on distributing the load of the attack, but you can't do that if you just have the one or two servers. Only players like Google can do that sort of thing.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...