Monday, 12 August 2013

WebResponse HTTPS GetResponseStream Encoded

WebResponse HTTPS GetResponseStream Encoded

I'm just trying to get the HTML of an HTTPS web page. But I'm being
returned Questionmarks and other junk characters. This is my main method:
Public Function PostPage(ByVal URL As String, ByVal enc As Encoding) As
String
Try
ServicePointManager.ServerCertificateValidationCallback = New
RemoteCertificateValidationCallback(AddressOf ValidateCertificate)
Dim htmlRequest As HttpWebRequest =
DirectCast(WebRequest.Create(URL), HttpWebRequest)
Dim htmlResponse As HttpWebResponse =
DirectCast(htmlRequest.GetResponse(), HttpWebResponse)
Return New
System.IO.StreamReader(htmlResponse.GetResponseStream(),
enc).ReadToEnd()
Catch ex As Exception
Console.WriteLine("Error: " & ex.Message)
End Try
Return ""
End Function
You might notice I am bypassing a certificate, and that my encoding is
parameterized.
Sometimes I include other headers like Accept-Encoding: gzip, deflate, and
UserAgent, etc. But the main thing here is how I call this function. I use
the following:
Sub LearnEncoding(ByVal MyURL As String)
Dim dctResults As New Dictionary(Of String, String)
For Each objEncoding In System.Text.Encoding.GetEncodings
If dctResults.ContainsKey(objEncoding.DisplayName) = False Then
Dim MySpider As New clsWebSpider
dctResults.Add(objEncoding.DisplayName,
MySpider.PostPage(MyURL, objEncoding.GetEncoding))
End If
Next
End Sub
So I try every encoding in the framework (139 of them), and the Dictionary
gives me a quick glance at the result of every attempt. Most are different
from each other, but all are junk.
However, when I run this and see the results in Fiddler, it's perfect
HTML. So I'm getting the response back correctly, I just don't know how to
decode the HTML.
Could be something to do with the Certificate? ValidateCertificate just
returns true:
Public Function ValidateCertificate(ByVal sender As Object,
ByVal certificate As X509Certificate,
ByVal chain As X509Chain,
ByVal sslPolicyErrors As
SslPolicyErrors) As Boolean
Return True
End Function
I also tried the encoding to be GetEncoding(htmlResponse.CharacterSet).
But could be an encoding that I haven't heard of? How would I get that?
Like I said, IE, Chrome, FF, Fiddler, etc all decode it correctly, but I
don't know how to see what encoding they are using to get the HTML. The
Charset in the headers and Meta-Tags of the Response both say UTF-8, but
that returns me symbols to.

No comments:

Post a Comment